DeepSeek Engram
Context: N-Gram Models
"engram"?
N-gram Models illustrate higher-order correlations between N words (unigram, etc.)
We can scale Embedding tables
We chunk the query ("reebok new running shoes") into
Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models
It goes back to the original string and actually tokenizes is differently. The Engram block grabs level two grams and three grams and uses that to produce the lookups for the embeddings. Then they also use the hidden like the residual backbone. And they use that to perform like a gating also. The lookups happen directly from the input string to the transformer. and then a gating function is the only thing that looks at the hidden state.
So when it's training, they are taking the input IDs like the input sequence. They can in parallel, start computing the transformer from the unigrams and then they're fetching that and the embeddings from like the lookup and then they're injecting them into the later parts of the architecture. So you can like overlap this computation so you can be sending parsing and sending the message.
They do the same canonicalization with stripping space and lowercasing to get a
Mathematics: Context-aware Gating (From Paper)
The retrieved embeddings
where
The gated output is defined as
Finally, to expand the receptive field and enhance the model’s non-linearity, we introduce a short, depthwise causal convolution. Let
The Engram module is integrated into the backbone via a residual connection:
Impact: Scaling and Performance
For a fixed parameter count, the sweet spot is around
Moreover, the throughput hit for adding a
This Engram blocks gives it a shortcut to avoid complicated computations from the Transformer blocks to get the multi-token lookup. The underlying argument: we are separating out the components of the fact-storing versus the overall semantic meaning of a sentence and we can take advantage of cheap RAM + CPU cycles VS expensive VRAM + GPU cycles.
Questions
"Gated residual connections, make sense. So now do the authors also compare gated residual connections within the network between the layers to just these gated connections to the embedding table and do they justify that specifically?"