MSA: retrieving thought, not text

There are three ways to give an LLM memory. All three are broken.

Fine-tune the weights? High precision, but fixed capacity and catastrophic forgetting.

External retrieval (RAG)? Scales to any corpus, but retrieves in text/embedding space, not the space the model actually thinks in. There’s a structural ceiling no amount of reranking can close.

Compress into a latent state? Efficient, but fixed-size states forget at scale. RWKV drops from 100% to 53% accuracy at 1M tokens on needle-in-a-haystack.

I’ve been exploring all three for Sketch’s memory layer. None of them fully solve the problem. Which is why this paper caught my attention.

A fourth option

Memory Sparse Attention (MSA), from Evermind.

Instead of retrieving text chunks from an external database, MSA retrieves the model’s own internal representations. Learned “Router Projectors” inside the transformer score documents by their KV cache signatures, select the top-k most relevant, and feed only those into attention. Retrieval and generation share the same forward pass, same loss function, same representation space.

RAG retrieves text. MSA retrieves thought.

Documents are encoded once, offline, with doc-wise RoPE. At query time a router scores them by their KV signature, selects the top-k, and concatenates with local attention to generate the answer.

The trick that makes it scale

Document-wise RoPE. Each document gets independent position IDs starting from 0, decoupling positional encoding from corpus size. Train on 64K tokens. Infer on 100 million. The model never sees out-of-distribution positions because each document looks like a standalone input it’s seen thousands of times.

Results

All on a 4B model, 2 A800 GPUs:

Less than 9% degradation scaling from 16K to 100M tokens
94.84% needle-in-a-haystack at 1M tokens (backbone collapses to 25%)
Beats RAG systems backed by 235B generators on 4 of 9 QA benchmarks
16% average improvement over standard RAG on same-backbone comparisons

What the ablation really says

The biggest contributor isn’t the routing architecture. It’s injecting the original document text after routing. Compressed KVs find the right documents. Raw text generates the answers. MSA is a hybrid: latent routing, text generation.

The caveats

They matter. Single backbone (Qwen3-4B). The memory parallelism that enables 100M tokens depends on the model being small enough to replicate per GPU. Unclear if this scales to 70B+. No confidence intervals on evaluations. Static memory bank requiring offline encoding.

Where this points

RAG isn’t dead. It still wins on updateability, interpretability, and cost. But for large static knowledge bases where retrieval quality is the bottleneck, operating in the model’s native representation space closes a gap that pipeline architectures structurally cannot.

The real direction here is internalization. Retrieval moving from external pipeline into the model itself. MSA is one step. It won’t be the last.

Paper, MSA: Memory Sparse Attention (arXiv: 2603.23516)