Intro to Routing: Mixture-of-Experts and Expert Choice
I derive MoE and EC from first-principles.
In this post, I’ll cover routing mechanisms for large language models. People often talk about Mixture-of-Experts and Expert Choice, so my goal is to give a first-principles walkthrough that explains how these methods arise naturally. You can think of this as how I would have derived them myself, or as an ex post facto explanation that clarifies the logic behind their design.
Mixture-of-Experts (MoE)
Historical Roots of MoE
The engineering motivation behind MoE is straightforward. The goal is to take N expert functions fi and compute a weighted average of their outputs based on how confident the model is in each expert.
The basic idea is simple. First, compute logits for each expert:
Then convert these logits into a probability distribution:
Finally, compute the convex combination of expert outputs:
Training proceeds exactly as it does for a standard feedforward layer. This formulation matches the original work of Jordan and Jacobs (1991).
The drawback is that this approach requires evaluating every expert for every token, including experts with very low probability. This becomes expensive as N grows.
Top-1 Gating
Let’s say we want to perform “Top-1 gating,” where we select only the highest scoring expert. Specifically, we want to compute all gi(x), pick the top expert s, run only that expert, and set:
This has a property you might find unexpected. Even if all fi point in roughly the same direction, the magnitude of y is smaller than the raw output of fs(x), since fs is scaled by gs. You can argue that this is sometimes desirable, because lower confidence often corresponds to smaller updates in many other ML architectures. But this is a post hoc justification, and the real reason is just that attempts to renormalize y tend to perform worse in practice.
Backpropagation follows from the product rule. For the expert parameters θf:
By symmetry, the gradient for the router parameters θg is:
For i != s, we have dL/dθf_i = 0, since those experts do not run. But dL/dθg_i is not zero, because the softmax couples all logits zi, so each gi influences the scaling of fs.
The gradients are undefined at the exact boundaries where the identity of the top expert changes. But that’s typically fine, just like it’s not an issue for ReLU or other piecewise differentiable functions.
The major problem is that unused experts never improve. Training collapses to a solution where gs ~= 1 for whichever expert happened to win early in training. In an ideal setting, we would want all experts fj for j != s to receive at least some tokens so they continue to learn.
So we define the proportion of tokens routed to expert i in a batch:
You might attempt to regularize this by optimizing:
where U is a uniform distribution. This would flatten the token allocation, and in principle could prevent collapse.
But since pi above depends on an argmax (through st), the gradient of pi is zero almost everywhere. The model receives no useful signal from this penalty.
As a result, we need some other differentiable penalty that discourages any single expert from dominating the routing. The goal is to reduce the confidence gi for experts that receive too many tokens and increase it for experts that receive too few.
Flattening the Argmax Distribution
Since pi isn’t useful for differentiation, we try using the Gumbel max trick, which provides a way to make the argmax behave like a soft, differentiable sampling process. As a general rule, if zj are logits and εj ~ Gumbel(0, 1), then:
This distribution gives us Pr[st = i] for each expert i. (In principle, we could even use the noisy sampling in the forward pass.) More importantly, this lets us compute an expected load for each expert and attempt to flatten it.
In our case, let zj be the logit that produces the gating probability gj(xt) = softmax(z(xt))j. Then we assume:
This implies:
and therefore the expected load for expert i is:
With this expected load vector, we can now try to flatten it. Possible approaches include:
KL(load || U)
L2 distance to uniform
Entropy maximization
A common alternative is to minimize the coefficient of variation (or a similar quantity), which flattens the distribution without the numerical instability of KL or entropy when some components get small:
We add an auxiliary term:
where CV is squared for optimization convenience. So the full loss becomes:
Practical Implementation Today
In modern MoE implementations, a simpler surrogate auxiliary loss is used. It’s not statistically derived or theoretically clean, but it works well in practice:
In theory, you might reach this form by starting from the objective:
Since the pi sum to 1, minimizing this objective encourages a uniform allocation via the method of Lagrange multipliers
Then, assuming we really are sampling with Gumbel noise, then given a sufficiently large batch, the empirical load fraction can be approximated via the soft probabilities gi:
where qi is a differentiable surrogate for the observed load proportion. Using this approximation on one of the factors:
which is the surrogate used above.
The effect is straightforward. If an expert receives too many tokens (that is, if pi is too large), the loss increases, which pushes the model to reduce gi.
The key detail is that the derivative of pi(B) is locally zero, since pi depends on an argmax that does not change in a small neighborhood. For this reason, implementations mark pi(B) as a “no-grad” quantity.
Compared to this heuristic approach, the formulation based on Gumbel noise and the coefficient of variation is mathematically cleaner. The stochastic forward pass aligns naturally with the probabilistic reasoning used in the backward pass, and the load penalties follow from that framework without ad hoc constructions. Despite this conceptual clarity, the surrogate loss above remains more widely used.
Generalizing to Top-K
If we want to use the Top-K experts rather than Top-1, first we need a constructive definition of the statistical process. As before, we compute gi(xt) for each expert i. But this time, we select the Top-K experts instead of a single one.
We’ll try to take the same approach as before. To compute the expected load for expert i, we need
where S is the set of the K selected experts. The expected load is:
Just like in the Top-1 case, we can minimize CV(E[loadi])2. In theory, E[loadi] is differentiable because Top-K sampling follows a Plackett–Luce distribution. But in practice, the gradient is computationally intractable. So we end up using the same surrogate as in the previous section.
Finally, it’s worth noting that the original formulation in Shazeer et al. (2017) used a different approach. Instead of Gumbel noise, the authors added normal noise to the logits, and they renormalized the weights gi after selecting the Top-K elements. The methodology and derivation differ from the conceptual framework presented above.
Expert Choice (EC)
Expert Choice (Zhou et al. 2022) observes that the biggest pitfall of MoE is that an expert can get overloaded. If too many tokens want the same expert, that expert overloads and gets a huge share of the gradient. If too few tokens go to an expert, that expert’s weights collapse and it fails to train. That is why we needed that hacky regularization term earlier.
Imagine you’re Google serving inference at massive scale. You don’t care about routing every token perfectly. You care about keeping all experts busy, avoiding hot spots, and guaranteeing predictable latency.
Rather than computing gi(xt) for each token and selecting argmaxi gi (letting the tokens pick the experts), you can let the experts pick which tokens they want to serve. Each expert receives a fixed budget of M tokens and selects the M tokens for which it believes it is most useful.
You still evaluate gi(xt) for all experts i and all tokens t in the batch. For each expert i, you select:
Each expert receives exactly M tokens (or up to M if the batch is small). How this affects backpropagation depends on what you do when multiple experts select the same token. For simplicity, assume the output is the sum:
(Note that real EC implementations use more complicated aggregations.) In this case, backpropagation for the expert parameters is exactly the same as in MoE. If an expert is not selected, it receives no gradient. If it is selected, then
The gradient with respect to the router parameters is surprisingly simple:
Notice that we didn’t have to differentiate through the Top-M operator at all, since the gradients only flow through gi for the tokens actually selected by expert i.
There’s a glaring pitfall here. What if a token isn’t selected by any of the experts? In practice, implementations handle this by increasing M or by routing those stray tokens to the expert with the largest gi(xt).
Future Directions
I hope this post was informative. In the future, I plan to cover other routing mechanisms such as Mixture-of-Depths (MoD) or Switch Transformers, a paper by authors I respect a lot.
Conceptually, MoD pushes sparsity in a more radical direction. Instead of selecting which expert network should process a token, the router selects which layers of the transformer the token should flow through. This sounds simple, but it breaks a core assumption underlying MoE, which is that each expert implements a well-defined function f(x) at a fixed location in the computation graph. In MoD, the location where a layer applies its transformation is no longer fixed, and the domain of each layer becomes ambiguous. The resulting interactions make MoD a significantly harder routing problem than MoE or Expert Choice.
When the research landscape matures further, or when I have time to think more deeply about these issues, I plan to revisit this topic with a dedicated analysis.

