A Minimal Route to Transformer Attention

Is attention inevitable?

Oct 30, 2025

In this post, I’ll show how a small set of reasonable assumptions can recover the Transformer attention mechanism. Some parts of attention are theoretically motivated, while others are arbitrary choices. I’ll explicitly call out which is which.

To see why attention exists, it helps to recall its predecessor: the recurrent neural network (RNN). Classic encoder-decoder RNNs process a sequence token by token. Each new hidden state incorporates the current token and the previous hidden state, producing a vector you can think of as an “accumulator” of everything seen so far. After ingesting the final token, that accumulated vector is repeatedly fed to the decoder, which predicts output tokens until it emits a STOP symbol.

The problem is long-range dependence. If an important token appeared far earlier in the sequence (say, the first of 10,000 tokens), its influence becomes diluted as the RNN processes additional tokens. The model simply forgets.

Ideally, the model should use all previously seen tokens to compute the information needed to predict the next token, weighting each earlier token by how relevant it is for that prediction. That suggests computing a relevance score between a position i and each other position j, and then combining some function of the embeddings accordingly.

Formally, define a scalar relevance function that takes the embeddings X along with indices i and j:

\(u(i, j | X_1, …, X_n)\)

We work in embedding space rather than raw token IDs to avoid meaningless geometric assumptions (e.g., token 9 is not inherently “closer” to token 10). One-hot encodings would work, but are much more sparse.

Then the model’s output vector at position i can be written as:

\(y_i = G(\{x_j, u(i, j | X)\}_j)\)

where G aggregates some function of the embeddings x_j alongside their relevance to position i. (During autoregressive generation, i corresponds to the most recently produced token.) We don’t yet know the form of G or u. Our goal is to characterize the simplest constraints that lead directly to Transformer-style attention.

Observation: Enforce permutation symmetry

We want to constrain the space of possible functions for G and u.

Once we have the relevance scores u(i, j), the output y_i should not depend on the order in which the pairs (x_j, u(i, j)) are provided. In other words, if we reorder the elements indexed by j, the result should remain the same. This requires G to be “permutation-invariant” over the set {(x_j, u(i, j))}_j.

The Deep Sets theorem (Zaheer et al., 2017) tells us that any such function can be written as:

\(G(\{x_j, u(i, j)\}_j) = ρ\left(\sum_j φ(x_j, u(i, j))\right)\)

Here ρ and φ are arbitrary differentiable functions. We fix the index i, since the invariance only applies over j. Differentiability ensures that the overall model can be trained with gradient-based methods.

At this point, ρ and φ are still completely general, and we also need to define u. We will impose further assumptions to narrow down their form.

Assumption 1: ρ is the identity function

ρ could output many different types of objects. For example:

It could output a scalar, but that would discard most of the information from the embeddings.
It could output an O(N)-dimensional vector, with one component per input element, but that would make the output scale with sequence length and defeat the purpose of summarizing information.
It could output a vector in some intermediate dimension, or even map into a different space/manifold entirely.

All of these are technically possible. In practice, Transformers set ρ to be the identity function, so:

\(G(X) = \sum_j φ(x_j, u(i, j))\)

This simplifies the structure of G and lets us focus on constraining φ and u.

Assumption 2: Relevance-contribution proportionality

Even with ρ set to the identity, φ could be any function of the embedding x_j and the relevance score u(i, j). To simplify the form, we assume that if a token’s relevance is scaled by a constant k, its contribution scales by the same factor:

\(φ(x_j, k * u(i, j)) = k * φ(x_j, u(i, j))\)

This is not the only possible relationship. For example, we could have chosen a quadratic or some other monotonic transformation in u(i, j). The key requirement is simply that φ should separate into:

A scalar measuring how important x_j is
A vector capturing what x_j contributes

Under the linear version of this assumption, we get:

\(φ(x_j, u(i, j)) = u(i, j) * φ(x_j, 1)\)

Define v(x_j) = φ(x_j, 1), yielding:

\(φ(x_j, u(i, j)) = u(i, j) * v(x_j)\)

This makes φ explicitly separable, where u(i, j) purely controls magnitude (relevance), and v(x_j) determines the content being contributed.

Assumption 3: Linear change of coordinates

At this point, v(x_j) could be any function of x_j. To simplify the model and keep it efficient to compute, we assume v is a linear transformation of x_j:

\(v(x_j) = W_V x_j\)

Substituting this into the previous expression gives:

\(G(X) = \sum_j {u(i, j) * W_V x_j}\)

This means each token contributes a linearly transformed version of its embedding, weighted by its relevance score u(i, j).

Observation: Constrain u for efficient parallel computation

We want u(i, j) to be computable efficiently on hardware like GPUs. Here, “efficient” refers to low sequential depth in the computational graph, not necessarily a low number of arithmetic operations. GPUs can execute many multiplications in parallel, but long chains of dependent operations create bottlenecks. For example, a recurrent computation with O(N) sequential steps is slow for long sequences, but a matrix multiply has O(1) sequential depth and is highly parallelizable.

If we allowed a fully general relevance function such as:

\(u(i, j | X) = g_\theta(x_i, x_j, \text{context}(X))\)

where context(X) examines all tokens at once, we would need to evaluate this network O(N²) times for a single layer, which is too slow.

Alternatively, we could define a single model:

\(g_θ: X → ℝ^{n \times n}\)

that outputs all pairwise relevance values directly. But that would require storing and training parameters of size O(N²), which locks the model to a fixed input length and scales poorly.

To keep computation parallelizable and scalable, we restrict u to be built from tensor operations such as:

Linear projections
Element-wise functions
Inner products
Reductions like sums

and avoid control flow or long sequential recurrences.

Assumption 4: Dot product similarity for u

A simple way to score the interaction between x_i and x_j is with a dot product. However, we don’t necessarily want similarity in the embedding space - we want similarity in a space optimized for relevance.

So, as we did for v(x_j), we first apply learned linear projections:

\(\begin{aligned} q_i &= W_Q x_i \\ k_j &= W_K x_j \end{aligned}\)

Then we define the relevance score as:

\(u'(i, j | X) = ⟨q_i, k_j⟩\)

We denote this version as u’ because additional modifications will be applied later.

Assumption 5: Pick a normalization for u

Next, we want the relevance scores u’(i, j) to measure relative importance. If the same constant were added to all scores, or if they were scaled uniformly, the ranking of tokens should not change. This motivates applying a differentiable normalization function over j.

There are several possibilities (e.g., softmax, Gumbel-Softmax). In practice, Transformers use softmax.

One final issue: the dot product ⟨q_i, k_j⟩ tends to grow in magnitude with the key/query dimension d_k. To prevent extremely large values from dominating the softmax, we scale the logits:

\(u’(i, j) = \frac{⟨q_i, k_j⟩}{\sqrt{d_k}}\)

Applying softmax normalization over j then gives:

\(G(x_i) = \sum_j \operatorname{softmax}_j\!\left( \frac{\langle W_Q x_i, W_K x_j \rangle}{\sqrt{d_k}} \right) \, W_V x_j\)

This is exactly the scaled dot-product attention used in Transformers.

Does a better attention mechanism exist?

So there you have it. If we impose the following assumptions:

ρ is the identity function
Each token’s contribution scales proportionally with its relevance score
A linear transformation maps embeddings to the value, key, and query vectors
Relevance is based on a dot product
Relevance scores are normalized with a softmax

we obtain the exact scaled dot-product attention used in Transformers.

While some of the choices were forced, they weren’t all theoretically required. There may be better options for ρ, for the similarity measure, or for the normalization function. Even more fundamentally, the Deep Sets form at the beginning was from imposing permutation-invariance, but we end up reinjecting positional encodings in practice. Exploring these variations could reveal new attention mechanisms with different computational or modeling advantages.

Neel Somani's Blog