DeepSeek Engram

Context: N-Gram Models

"engram"? → Embedding vector itself is the "n-gram"

N-gram Models illustrate higher-order correlations between N words (unigram, etc.)

We can scale Embedding tables >10. This is used in Ads and Pinterest. State of the art for modeling clicks, conversions, etc. Evolved to perform regression, memorize patterns, and train on trillions of examples and huge QPS inference.

We chunk the query ("reebok new running shoes") into 3-grams. The query is Lemmatized, which reduces words to their base or dictionary form, called a "lemma" (i.e. "running" turns into "run" and "shoes" into "shoe" to reduce cardinality). We then hash the string and look into the sparse embedding table to fetch and sum embeddings. The embeddings are then fed into the neural networks.

Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models

It goes back to the original string and actually tokenizes is differently. The Engram block grabs level two grams and three grams and uses that to produce the lookups for the embeddings. Then they also use the hidden like the residual backbone. And they use that to perform like a gating also. The lookups happen directly from the input string to the transformer. and then a gating function is the only thing that looks at the hidden state.

So when it's training, they are taking the input IDs like the input sequence. They can in parallel, start computing the transformer from the unigrams and then they're fetching that and the embeddings from like the lookup and then they're injecting them into the later parts of the architecture. So you can like overlap this computation so you can be sending parsing and sending the message.

They do the same canonicalization with stripping space and lowercasing to get a 23 reduction in the vocabulary projection layer. Vocabularies in transformers go up to 50 or 200 To be more precise: if you process a 128 tokenizer with a fixed budget, the number of symbols decreases by 23 and for 2-grams, this would be a 0.77×0.77=59.29 or a 40.71 reduction due to quadratic nature.

Mathematics: Context-aware Gating (From Paper)

The retrieved embeddings (e_t) serve as context-independent priors. Being static, however, they inherently lack contextual adaptability and may suffer from noise due to hash collisions or polysemy. They employ a context-aware gating mechanism via a hidden state (h_t)— which has aggregated global context via preceding attention layers—as a dynamic Query, while the retrieved memory

(e_t) serves as the source for both Key and Value projections:

(k_t)=(W_K)*(e_t), (v_t)=(W_V)*(e_t)

where (W_K),(W_V) are learnable projection matrices. To ensure gradient stability, we apply RMSNorm to the Query and Key before computing the scalar gate (α_t)∈(0,1):

(α_t)=σ*((RMSNorm*((h_t))⊤*RMSNorm*((k_t)))/√(,d))

The gated output is defined as (v_t)=(α_t)⋅(v_t). This design enforces semantic alignment: if the retrieved memory (e_t) contradicts the current context (h_t), the gate (α_t) tends toward zero, effectively suppressing the noise.

Finally, to expand the receptive field and enhance the model’s non-linearity, we introduce a short, depthwise causal convolution. Let V∈ℝ(T×d) denote the sequence of gated values. Using a kernel size w (set to 4), dilation δ (set to the max N-gram order) and SiLU activation, the final output Y is computed as:

Y=SiLU*(Conv*1*D*(RMSNorm*(V)))+V,

The Engram module is integrated into the backbone via a residual connection: H(ℓ)←H(ℓ)+Y, followed by the standard Attention and MoE.

Impact: Scaling and Performance

For a fixed parameter count, the sweet spot is around 20 of the parameters being used for the embeddings. The second graph shows that with the non-scaling O(1) overhead of Engram, the curve follows a strict power law (linear in log-space), indicating that Engram provides a predictable scaling knob. Regarding scaling efficiency: while the direct averaging approach of OverEncoding benefits from larger memory tables, Engram unlocks much larger scaling potential from the same memory budget.

Moreover, the throughput hit for adding a 100 Engram on the CPU is small even when adding to a 4 or 8 models.

This Engram blocks gives it a shortcut to avoid complicated computations from the Transformer blocks to get the multi-token lookup. The underlying argument: we are separating out the components of the fact-storing versus the overall semantic meaning of a sentence and we can take advantage of cheap RAM + CPU cycles VS expensive VRAM + GPU cycles.

Questions

"Gated residual connections, make sense. So now do the authors also compare gated residual connections within the network between the layers to just these gated connections to the embedding table and do they justify that specifically?"