Attention Residuals

Part I

The Flaw Hiding in Every Transformer

Every modern language model is a stack of sequential layers. GPT, Gemini, Claude, Llama, DeepSeek. Data enters at the bottom, passes through dozens or hundreds of these layers, and exits at the top as a prediction. Each layer performs a different transformation, progressively refining the representation from raw token patterns toward abstract reasoning.

This depth is only possible because of an idea from 2015 called residual connections. Before residual connections, training deep networks was practically impossible. The learning signal, which flows backward through the network during training, would shrink to nearly zero by the time it reached the early layers. This is the vanishing gradient problem. The fix was elegant: let the input to each layer skip past it and get added directly to the output. This creates a highway for information and gradients to pass through unimpeded.

That single idea unlocked modern deep learning. We went from networks with a few dozen layers to networks with hundreds. Every transformer architecture since 2017 uses this design. It is foundational.

And it has a critical flaw.

Because every layer's output gets added together with equal weight of 1, the cumulative signal grows without bound as depth increases. By layer 100, the hidden state is a massive pile of accumulated contributions from every previous layer. The contribution of any single layer, especially early ones, becomes statistically invisible.

The Signal Dilution Problem

The Kimi team measured this directly: hidden state magnitudes grow roughly linearly with depth in standard transformers. This creates two cascading failures. First, early layers lose their influence on the final output. Second, later layers must produce disproportionately large outputs just to have any noticeable effect, because they are competing against the cumulative weight of everything before them.

Researchers have tried scaling tricks: multiplying residuals by small constants, adding learnable gates, using multi stream recurrences. None of these solved the core issue. They all still feed each layer a single aggregated result of everything before it. The soup stays mixed.

Part II

A Problem We Already Solved

Here is where the paper gets genuinely clever. The Kimi team recognized that this exact problem has an exact structural analog. And the field already solved that analog a decade ago.

Before transformers, the dominant architecture for language was the Recurrent Neural Network (RNN). An RNN processes text one token at a time, compressing everything it has seen so far into a single fixed size hidden state. To process word 100 in a sentence, it uses a summary of words 1 through 99. The failure mode is obvious in hindsight: by the end of a long paragraph, the information from the beginning has been compressed and overwritten so many times that it is practically gone.

The transformer architecture fixed this by introducing attention over the sequence dimension. Instead of a single compressed memory, every token can look back directly at every previous token and compute learned, content dependent weights that determine what is relevant. The model decides what matters on the fly. That single change made modern AI possible.

The Structural Parallel

The core insight. Standard residual connections create the exact same bottleneck along the depth dimension that RNNs had along the sequence dimension. Each layer receives a single compressed summary of all previous layers, with no ability to selectively retrieve specific information from specific depths. RNNs compress along time. Residual connections compress along depth. Transformers fixed the first. Attention Residuals fix the second.

Part III

The Fix: Attention Over Depth

The mechanism is straightforward if you understand how transformer attention works. In standard token attention, every token has a query ("what am I looking for?"), every previous token has a key ("here is what I contain"), and a value ("here is my actual data"). The model computes which keys match each query and retrieves the corresponding values with learned weights.

Attention Residuals apply this exact mechanism along the depth axis instead. Each layer gets a learned query vector that asks: "Which previous layers hold information relevant to what I am currently processing?" Each previous layer's output serves as both key and value. The model computes softmax attention weights over all preceding layer outputs and assembles a weighted combination.

Full Attention Residuals (AttnRes)

The critical difference: weights are not fixed at 1. They are computed dynamically based on what the network is actually processing. A layer handling a logical inference might pull heavily from early reasoning layers. A layer processing emotional tone might reach for entirely different depths. The model routes information through itself based on what the input requires.

# Standard residual (every transformer since 2017)
h[l] = h[l-1] + f(h[l-1])          # fixed weight = 1, always

# Attention Residual
h[l] = Σ alpha[i→l] * v[i]         # learned, input dependent weights
       # alpha = softmax(query_l · key_i)
       # weights sum to 1 → bounded signal

Each layer has a learned query vector. The keys and values come from previous layer outputs. The softmax normalization ensures the weights sum to one, which bounds the hidden state magnitude and prevents the unbounded growth that plagues standard residuals. The signal stays stable no matter how deep the network goes.

Part IV

Making It Work at Scale

Full Attention Residuals have an obvious engineering problem. If every layer attends to every previous layer, memory cost grows as O(L·d) where L is the layer count and d is the hidden dimension. For a model with hundreds of layers distributed across multiple server racks connected by fiber optic cables, this creates an explosion of cross machine communication. Each layer would need to fetch data from potentially every previous server. That is not viable at the scale of frontier models.

The Kimi team solved this with Block Attention Residuals. The model is partitioned into blocks, roughly 8 in practice. Within each block, layers accumulate normally using standard residual connections. At block boundaries, the accumulated representation gets stored as a summary. The attention mechanism then operates over these block level summaries instead of individual layer outputs.

Block Attention Residuals

The team tested block counts from 1 (equivalent to full AttnRes) up to 32. Block counts of 2, 4, and 8 all reached nearly identical validation loss. Beyond 16, the benefit degraded back toward baseline. Eight blocks became the default: small enough for manageable cross machine communication, large enough to capture most of the benefit.

Training overhead

Under 4%

Inference latency overhead

Under 2%

Memory per layer vs standard

5.5d vs 3d (standard) vs 34d (mHC)

Architecture changes needed

None. Drop in replacement.

These numbers matter. Block AttnRes is not a research prototype that requires a new training infrastructure. You swap the residual connection module. Everything else stays the same. The attention heads, the feed forward layers, the routing logic, the optimizer, the data pipeline. Unchanged.

Part V

What the Numbers Say

The Kimi team trained models with and without AttnRes across five different sizes and integrated it into their Kimi Linear architecture (48B total parameters, 3B activated, trained on 1.4T tokens). The improvements were consistent across every size and every benchmark.

1.25x Less compute needed to reach the same performance level

+7.5 Points on GPQA Diamond (graduate level science)

Stable Signal magnitude bounded across all depths (no exponential growth)

Even Gradient distribution across all layers (balanced learning)

The 1.25x compute efficiency alone is significant. At the scale where a single training run costs tens of millions of dollars, that translates directly into millions saved or a meaningfully more capable model for the same budget. The paper notes that AttnRes outperforms DeepSeek's multi head constraint hyperconnections (mHC) on the same benchmarks while using significantly less memory per layer.

The reasoning improvements are the more revealing result. The biggest jumps appear on tasks that require long chains of multi step reasoning: graduate level science, mathematics, code generation. These are exactly the domains where signal dilution hurts most, because solving them requires the model to hold early reasoning steps in focus while performing dozens of subsequent operations. When early signals survive instead of getting buried, the model can actually use them.

The paper also ran a depth versus width experiment that deserves attention. They trained 25 models with identical parameter counts but different shapes, ranging from short and wide to deep and narrow. Both standard and AttnRes models improved as they went deeper. But standard models hit a wall: past a certain depth, performance collapsed as dilution took over. AttnRes models kept improving. The constraint that has limited model depth since 2015 is gone.

Depth is no longer a liability. It is an advantage.

Part VI

What the Model Learns to Do

The most interesting finding in the paper is not the benchmark improvements. It is the attention pattern visualizations.

When the researchers mapped how layers attend to each other, three distinct behaviors emerged. First, locality: most layers attend most strongly to their immediate predecessors, preserving step by step processing. Second, long range connections: certain deep layers suddenly reach all the way back to the very first layers, skipping everything in between. The model learns to revisit its original premises when the computation demands it. Third, specialization: different layers develop different functional roles. Some focus locally, acting as working memory. Others act as global coordinators, pulling information from across the entire depth of the network.

Emergent Attention Patterns

This is no longer a static pipeline. The model dynamically routes information through itself based on what the input requires. For each input, it constructs a custom pathway through its layers, uses it, and moves on. The architecture becomes adaptive in a way that standard residual connections physically cannot support.

The practical takeaway. Attention Residuals are a drop in replacement for standard residual connections with minimal overhead. They fix a measurable structural flaw, improve performance across all evaluated benchmarks, and unlock deeper architectures. The paper includes open source code and reference implementations. If you are building or training transformer models, this is worth integrating. If you are following how these models are designed, this is worth understanding because the constraint it removes (useful depth) has shaped every architecture decision for the past decade.

Residual connections were the ceiling preventing useful depth. With that ceiling removed, the design space for transformer architectures expands considerably. Deeper models with specialized layers, dynamic routing, and long range internal connections become practical to build and train. The paper does not claim this is the final architecture. But it removes one of the fundamental constraints that has shaped model design since 2015. And sometimes removing the right constraint is the thing that changes everything.

AttentionResiduals

The Flaw Hiding in Every Transformer

A Problem We Already Solved

The Fix: Attention Over Depth

Making It Work at Scale

What the Numbers Say

What the Model Learns to Do

Attention
Residuals