#1642025-05-17
Microsoft MEMENTO: LLMs that compress their own chain-of-thought
Microsoft Research teaches reasoning models to summarise their own thinking mid-generation — 2.5x less peak KV cache, ~2x throughput, and a surprising 'hidden channel' in the KV states that alone is worth 15 accuracy points on AIME24.