Intro to Routing: Mixture-of-Experts and Expert Choice

I derive MoE and EC from first-principles.

Nov 14, 2025

In this post, I’ll cover routing mechanisms for large language models. People often talk about Mixture-of-Experts and Expert Choice, so my goal is to give a first-principles walkthrough that explains how these methods arise naturally. You can think of this as how I would have derived them myself, or as an ex post facto explanation that clarifies the logic behind their design.

Mixture-of-Experts (MoE)

Historical Roots of MoE

The engineering motivation behind MoE is straightforward. The goal is to take N expert functions f_i and compute a weighted average of their outputs based on how confident the model is in each expert.

The basic idea is simple. First, compute logits for each expert:

\(z_i = (W_g x + b)_i\)

Then convert these logits into a probability distribution:

\(g_i(x) = \text{softmax}(z(x))_i\)

Finally, compute the convex combination of expert outputs:

\(y = \sum_i g_i(x) * f_i(x)\)

Training proceeds exactly as it does for a standard feedforward layer. This formulation matches the original work of Jordan and Jacobs (1991).

The drawback is that this approach requires evaluating every expert for every token, including experts with very low probability. This becomes expensive as N grows.

Top-1 Gating

Let’s say we want to perform “Top-1 gating,” where we select only the highest scoring expert. Specifically, we want to compute all g_i(x), pick the top expert s, run only that expert, and set:

\(y = g_s(x) * f_s(x)\)

This has a property you might find unexpected. Even if all f_i point in roughly the same direction, the magnitude of y is smaller than the raw output of f_s(x), since f_s is scaled by g_s. You can argue that this is sometimes desirable, because lower confidence often corresponds to smaller updates in many other ML architectures. But this is a post hoc justification, and the real reason is just that attempts to renormalize y tend to perform worse in practice.

Backpropagation follows from the product rule. For the expert parameters θ_f:

\(\frac{dL}{d\theta_{f_s}} = \frac{dL}{dy} g_s(x) \frac{df_s(x)}{d\theta_{f_s}}\)

By symmetry, the gradient for the router parameters θ_g is:

\(\frac{dL}{d\theta_{g_s}} = \frac{dL}{dy} f_s(x) \frac{dg_s(x)}{d\theta_{g_s}}\)

For i != s, we have dL/dθ_{f_i} = 0, since those experts do not run. But dL/dθ_{g_i} is not zero, because the softmax couples all logits z_i, so each g_i influences the scaling of f_s.

The gradients are undefined at the exact boundaries where the identity of the top expert changes. But that’s typically fine, just like it’s not an issue for ReLU or other piecewise differentiable functions.

The major problem is that unused experts never improve. Training collapses to a solution where g_s ~= 1 for whichever expert happened to win early in training. In an ideal setting, we would want all experts f_j for j != s to receive at least some tokens so they continue to learn.

So we define the proportion of tokens routed to expert i in a batch:

\(p_i = \frac{1}{|B|} \sum_{t \in B} \mathbf{1}[s_t = i]\)

You might attempt to regularize this by optimizing:

\(L’ = L + \lambda * \mathrm{KL}(p || U)\)

where U is a uniform distribution. This would flatten the token allocation, and in principle could prevent collapse.

But since p_i above depends on an argmax (through s_t), the gradient of p_i is zero almost everywhere. The model receives no useful signal from this penalty.

As a result, we need some other differentiable penalty that discourages any single expert from dominating the routing. The goal is to reduce the confidence g_i for experts that receive too many tokens and increase it for experts that receive too few.

Flattening the Argmax Distribution

Since p_i isn’t useful for differentiation, we try using the Gumbel max trick, which provides a way to make the argmax behave like a soft, differentiable sampling process. As a general rule, if z_j are logits and ε_j ~ Gumbel(0, 1), then:

\(\arg\max_j { z_j + \varepsilon_j } \sim \text{Categorical}(\text{softmax}(z))\)

This distribution gives us Pr[s_t = i] for each expert i. (In principle, we could even use the noisy sampling in the forward pass.) More importantly, this lets us compute an expected load for each expert and attempt to flatten it.

In our case, let z_j be the logit that produces the gating probability g_j(x_t) = softmax(z(x_t))_j. Then we assume:

\(s_t = \arg\max_j { z_j(x_t) + \varepsilon_j }\)

This implies:

\(\Pr[s_t = i] = \text{softmax}(z(x_t))_i\)

and therefore the expected load for expert i is:

\(\text{E}[\text{load}_i] = \sum_{t \in B} \text{softmax}(z(x_t))_i = \sum_{t \in B} g_i(x_t)\)

With this expected load vector, we can now try to flatten it. Possible approaches include:

KL(load || U)
L2 distance to uniform
Entropy maximization

A common alternative is to minimize the coefficient of variation (or a similar quantity), which flattens the distribution without the numerical instability of KL or entropy when some components get small:

\(\text{CV}(\text{E}[\text{load}]) = \frac{\text{std_dev}(\text{E}[\text{load}])}{\text{mean}(\text{E}[\text{load}])}\)

We add an auxiliary term:

\(L_{\text{aux}} = \text{CV}(\text{E}[\text{load}])^2\)

where CV is squared for optimization convenience. So the full loss becomes:

\(L’ = L + \lambda * L_{\text{aux}}\)

Practical Implementation Today

In modern MoE implementations, a simpler surrogate auxiliary loss is used. It’s not statistically derived or theoretically clean, but it works well in practice:

\(L_{\text{aux}} = \sum_i \frac{1}{|B|} \left[ p_i(B) \sum_{t \in B} g_i(x_t) \right]\)

In theory, you might reach this form by starting from the objective:

\(\sum_i {p_i}^2\)

Since the p_i sum to 1, minimizing this objective encourages a uniform allocation via the method of Lagrange multipliers

Then, assuming we really are sampling with Gumbel noise, then given a sufficiently large batch, the empirical load fraction can be approximated via the soft probabilities g_i:

\(p_i \approx q_i = \left( \frac{1}{|B|} \sum_{t \in B} g_i(x_t) \right)\)

where q_i is a differentiable surrogate for the observed load proportion. Using this approximation on one of the factors:

\(\sum_i {p_i(B)}^2 \approx \sum_i p_i(B) * \left( \frac{1}{|B|} \sum_{t \in B} g_i(x_t) \right)\)

which is the surrogate used above.

The effect is straightforward. If an expert receives too many tokens (that is, if p_i is too large), the loss increases, which pushes the model to reduce g_i.

The key detail is that the derivative of p_i(B) is locally zero, since p_i depends on an argmax that does not change in a small neighborhood. For this reason, implementations mark p_i(B) as a “no-grad” quantity.

Compared to this heuristic approach, the formulation based on Gumbel noise and the coefficient of variation is mathematically cleaner. The stochastic forward pass aligns naturally with the probabilistic reasoning used in the backward pass, and the load penalties follow from that framework without ad hoc constructions. Despite this conceptual clarity, the surrogate loss above remains more widely used.

Generalizing to Top-K

If we want to use the Top-K experts rather than Top-1, first we need a constructive definition of the statistical process. As before, we compute g_i(x_t) for each expert i. But this time, we select the Top-K experts instead of a single one.

We’ll try to take the same approach as before. To compute the expected load for expert i, we need

\(q_i(x_t) = \Pr[i \in S]\)

where S is the set of the K selected experts. The expected load is:

\(\text{E}[\text{load}_i] = \sum_t q_i(x_t)\)

Just like in the Top-1 case, we can minimize CV(E[load_i])². In theory, E[load_i] is differentiable because Top-K sampling follows a Plackett–Luce distribution. But in practice, the gradient is computationally intractable. So we end up using the same surrogate as in the previous section.

Finally, it’s worth noting that the original formulation in Shazeer et al. (2017) used a different approach. Instead of Gumbel noise, the authors added normal noise to the logits, and they renormalized the weights g_i after selecting the Top-K elements. The methodology and derivation differ from the conceptual framework presented above.

Expert Choice (EC)

Expert Choice (Zhou et al. 2022) observes that the biggest pitfall of MoE is that an expert can get overloaded. If too many tokens want the same expert, that expert overloads and gets a huge share of the gradient. If too few tokens go to an expert, that expert’s weights collapse and it fails to train. That is why we needed that hacky regularization term earlier.

Imagine you’re Google serving inference at massive scale. You don’t care about routing every token perfectly. You care about keeping all experts busy, avoiding hot spots, and guaranteeing predictable latency.

Rather than computing g_i(x_t) for each token and selecting argmax_i g_i (letting the tokens pick the experts), you can let the experts pick which tokens they want to serve. Each expert receives a fixed budget of M tokens and selects the M tokens for which it believes it is most useful.

You still evaluate g_i(x_t) for all experts i and all tokens t in the batch. For each expert i, you select:

\(B_i = \text{TopM}_{t \in B} (g_i(x_t))\)

Each expert receives exactly M tokens (or up to M if the batch is small). How this affects backpropagation depends on what you do when multiple experts select the same token. For simplicity, assume the output is the sum:

\(y = \sum_{i : x_t \in B_i} g_i(x_t)* f_i(x_t)\)

(Note that real EC implementations use more complicated aggregations.) In this case, backpropagation for the expert parameters is exactly the same as in MoE. If an expert is not selected, it receives no gradient. If it is selected, then

\(\frac{dL}{d\theta_{f_i}} = \frac{dL}{dy} g_i(x_t) \frac{\partial f_i(x_t)}{\partial \theta_{f_i}}\)

The gradient with respect to the router parameters is surprisingly simple:

\(\frac{dL}{d\theta_{g_i}} = \sum_{x_t \in B_i} \frac{dL}{dy} f_i(x_t) \frac{\partial g_i(x_t)}{\partial \theta_{g_i}}\)

Notice that we didn’t have to differentiate through the Top-M operator at all, since the gradients only flow through g_i for the tokens actually selected by expert i.

There’s a glaring pitfall here. What if a token isn’t selected by any of the experts? In practice, implementations handle this by increasing M or by routing those stray tokens to the expert with the largest g_i(x_t).

Future Directions

I hope this post was informative. In the future, I plan to cover other routing mechanisms such as Mixture-of-Depths (MoD) or Switch Transformers, a paper by authors I respect a lot.

Conceptually, MoD pushes sparsity in a more radical direction. Instead of selecting which expert network should process a token, the router selects which layers of the transformer the token should flow through. This sounds simple, but it breaks a core assumption underlying MoE, which is that each expert implements a well-defined function f(x) at a fixed location in the computation graph. In MoD, the location where a layer applies its transformation is no longer fixed, and the domain of each layer becomes ambiguous. The resulting interactions make MoD a significantly harder routing problem than MoE or Expert Choice.

When the research landscape matures further, or when I have time to think more deeply about these issues, I plan to revisit this topic with a dedicated analysis.

Neel Somani's Blog