Neel Somani's Blog

Autoformalization and the Future of Math Research

Neel Somani — Fri, 30 Jan 2026 04:50:46 GMT

Last week, I gathered a group of bright undergraduate students to construct GPT-Erdos. We ran a uniform procedure across the open Erdős problems, evaluating GPT-5.2 Pro, Deep Research, and when possible, Aristotle by Harmonic. This produced 3 accepted solutions, 3 partial results, and 4 previously undocumented rediscoveries, all open-sourced.

What I found is that the value of autoformalization goes beyond the raw tech. (By “autoformalization,” I mean the tooling that takes a human-readable math proof and converts it into a machine-checkable format like Lean or Coq.) By formalizing our work, we expose underspecified concepts that implicitly guide research, like novelty, progress, and correctness.

The Most Common Failure is Underspecification

Early experiments made it clear that failure cases can be nuanced. For example, GPT 5.2 Pro produced an accepted solution for problem #397, but shortly thereafter Terence Tao used Deep Research to find a partial result that was closely related. The solution took a different path than GPT 5.2 Pro, but it could likely be extended to solve the same problem. Depending on who’s the judge, this could be described as a novel result, a rediscovery, or an extension of existing literature.

So I became interested in what the failure modes were when GPT 5.2 Pro fails to give a novel result, or when it might succeed. The methodology was simply pasting the exact LaTeX problem statement into GPT 5.2 Pro, and asking Deep Research to surface any previous solutions, resembling the process that #397 followed. No human intervention was allowed during generation. Each response was independently reviewed and classified according to predefined categories. Our open-source results are consistent with what others have found:

Sankey Diagram of ChatGPT Responses to Erdos Problems. (Classification requires human judgment. Boundaries are debatable, and anyone is welcome to re-label the dataset.)

What stands out is that if you’re only looking at the non-trivial attempts beyond pure literature recitation, underspecification causes at least as many failures as outright errors.

Relatedly: Finding existing literature, expanding on it, and producing structurally new proofs are useful results, but I found that psychologically, people want there to be a clear “novel” result attributed to an LLM, with no historical literature or partial results. Disagreements over novelty aren’t purely epistemic. Novelty functions as a proxy for intellectual contribution, and it’s also a measure of which proving system is the most advanced. That creates pressure to draw clean delineations around what’s “actually novel.”

Informal Goals Guiding Formal Methods

Top mathematicians disagree on whether a result is novel. For example, in problem #652, Tao classifies the GPT 5.2 Pro response as novel, but the result heavily relies on Mathialagan’s bipartite theorem. One solution (#281) was interesting enough that Nat Sothanaphan wrote a great write-up on it and Tao wrote about the method on his blog.

Of course, in math almost everything in some way builds upon previous results. The question is whether there was a non-trivial insight, perspective, or approach applied to a problem. One extreme interpretation is that nothing is novel or that everything is. Neither reconciles with our intuitions around novelty.

Maybe the correct definition is something closer to the minimum complexity of expressing a proof, while being allowed to reference existing results with constant cost. In other words, if a proof is an existing theorem with new parameters, that’s pretty simple to express, but if a proof requires several new non-trivial theorems and there’s no way around expressing it that way, that’s more complex and probably novel.

Or another possible formalism might take inspiration from zero-knowledge research, where we show it’s possible to be convinced of the truth of a statement while not being able to reconstruct the underlying witness given polynomial time computation. In an analogous sense, we can define mathematical “knowledge” as the ability to reconstruct a proof using existing results in polynomial time. So retrieving an existing theorem and substituting new parameters wouldn’t count.

A formalism like that helps if we give the LLM the problem that we want to solve. But at some point the LLM needs to have some sense for which problems are the most “interesting” to solve, too. And that seems harder to formalize. Interestingness is sometimes a proxy for utility (does this help us solve other problems in math/physics?) but sometimes the utility is unclear.

The critique obviously extends beyond just math. The LLM has no way of knowing what business ideas are most interesting or novel, or which art pieces are the most meaningful. None of that is a part of the training process, and the post-training process is so heuristic that I wouldn’t bet all the properties we care about are emergent. So maybe these are important concepts to formalize. Applying formal methods has a way of revealing which informal concepts we rely on most.

Where’s the Space Heading?

I can’t find the tweet, but somewhere Daniel Litt says something along the lines of: “Even if LLM progress completely stalled, the existing technology will substantially impact the practice of mathematics.” I think that’s very true. Mathematicians spend their time verifying proofs by hand that can be checked by Aristotle; people spend time on “open” problems that are sometimes already solved in the literature; quickly getting up-to-speed on existing approaches to a problem can save significant time. Of course, I think the models are going to get way better at math. Though we might need new techniques.

Regardless, people keep asking me where this is going. What good is proving? Sometimes you hear answers like quant finance, but we don’t really formally prove things in quant finance. But we definitely develop models that you’d want to be provably correct.

For example, can we say with certainty that a particular C/C++ CUDA kernel has no memory safety violations? Can we say that a program handles all exceptions gracefully? Are there outputs that a statistical model provably cannot internally represent?

These are all well-studied questions prior to autoformalization, but no one wants to use a super sophisticated type system, so I see autoformalization as a way of applying formal methods at scale. In that same vein, we have so much slop code being produced by LLMs that no one’s checking that carefully. Provable guarantees become a lot more valuable when there’s no close supervision.

The other thing that I hope emerges from autoformalization research is some concept of “closeness” to completion. No such metric is widely accepted to my knowledge. I think Terence Tao has some old essay or video where he points out that experienced mathematicians make errors too, but the errors tend to “cancel out” since the intuition is sound. Conversely, a single error can throw a junior mathematician completely off course. In short, there’s a difference between a proof that’s trivially reparable vs. one that’s fatally flawed, but final formal verification is binary: either the proof verifies or it doesn’t. In an ideal world, that closeness function would be a differentiable surrogate so we can optimize it directly.

The concept of closeness to completion matters in a bunch of domains. For one, it serves as a search oracle when you’re trying to find the right solution for something. But second there’s a ton of domains where the binary “proves vs. doesn’t prove” would fail. The Einstein field equations were famously discovered via a variety of heuristics and metaphors, and they were only cleanly formalized by later physicists. In general, the process for discovering truth is heuristic, not formal. Coincidentally, that resembles the state-of-the-art in autoformalization today; the gold standard is GPT 5.2 Pro for finding the proof, then Aristotle by Harmonic for verifying it. Almost all of the AI-solved Erdos problems were derived this way to my knowledge.

There are lots of other cool domains to apply autoformalization and formal methods to, so I’m curious to hear other people’s ideas!

The Endgame for Mechanistic Interpretability

Neel Somani — Mon, 05 Jan 2026 23:53:51 GMT

Mechanistic interpretability is currently pulled between competing visions. On one side, Neel Nanda argues for pragmatic interpretability: grounding work in real models and judging progress by empirical feedback, even when the underlying understanding is partial. On the other, Leo Gao defends ambitious interpretability: a long-term bet on building circuits that are necessary and sufficient for behavior, on the view that deeper mechanistic understanding is what will generalize across model changes. This disagreement is treated as a question of methods, but the deeper divide lies elsewhere.

The disagreement persists because mechanistic interpretability lacks an agreed-upon end goal. What, exactly, would count as success?

The Telos of Mechanistic Interpretability

Today, feature labeling, circuit discovery, probing, activation patching, and causal interventions coexist as productive methods, yet they are only loosely coordinated. The field lacks a shared ideal for what interpretability methods are ultimately meant to deliver.

One possible telos is legibility: producing explanations that are intelligible to humans. On this view, interpretability succeeds when it tells a coherent story about why a model behaves as it does. But explanations that fail under counterfactual intervention may sound plausible, yet provide no reliable handle for control. Even advocates of curiosity-driven interpretability typically hope their findings will eventually support some downstream use, and legibility alone does not guarantee this.

A second telos is scientific understanding. Intervention is used to test hypotheses, identify causal structure, and build general explanations. Mechanistic interpretability often operates successfully under this ideal. But LLMs are not natural objects. They are engineered software artifacts. Focusing on understanding alone leaves a powerful affordance unused. Interventions need not merely reveal structure, they can permanently modify the system while preserving formally specified properties. A purely scientific telos does not demand patchability, certification, or correctness under modification.

A third telos is capability enhancement. From this perspective, interpretability is valuable only insofar as it accelerates optimization. The natural equilibrium of this ideal favors systems that are maximally effective and maximally opaque.

These ideals are not mutually exclusive, but no single one provides a stable foundation for the field. From this perspective, mechanistic interpretability ought to orient itself toward debuggability: the ability to localize failures to specific mechanisms, intervene on those mechanisms predictably, and certify that the intervention preserves desired behavior on bounded domains. This telos subsumes legibility and scientific understanding, while resisting the drift toward opacity implicit in a purely capability-driven view.

In what follows, I make this notion precise.

Desired Goals for LLM Debuggability

In an idealized setting, debugging an LLM would proceed from localization, to intervention, to certification. Each stage places strictly stronger demands on our mechanistic understanding, and each rules out large classes of explanations.

Localization: Identifying the Responsible Mechanism

The first requirement of debuggability is localization: the ability to identify which internal mechanisms are responsible for a given behavior, and to distinguish mechanisms that generalize from those that merely correlate with it.

In the strongest form, a debuggable localization supports counterexample search. For a bounded LLM input domain D, this means being able to determine whether the behavior can occur without the mechanism being active, or whether the mechanism can be active without producing the behavior (and, when possible, to surface concrete inputs that witness such cases).

Intervention: Surgical, Mechanism-Level Debugging

Localization is only meaningful if it admits intervention. Once a failure is traced to a mechanism, we’d like to modify that mechanism in a way that is predictable and targeted:

The responsible head, MLP, or subspace can be modified or constrained.
The intervention removes the undesired behavior on a specified domain.
The intervention does not induce collateral damage elsewhere in that domain.

Certification: Domain-Bounded Safety Guarantees

The final goal of debuggability is certification: the ability to make exhaustive, falsifiable claims about model behavior on bounded domains.

For a formally specified domain D, this can mean proving that no harmful token is produced for any input in D, a claim that is in principle achievable for sufficiently constrained settings. Certification may also take the form of subcircuit-level bounds, for example structural invariants that rule out entire classes of behavior by construction, such as proving that a circuit cannot bypass a guard layer unless a specific feature is active.

What a Debuggable Explanation Can (and Can’t) Promise

First, some anti-goals:

Debuggability does not imply that a trained Transformer can be cleanly de-compiled into a single symbolic program. Transformer models are not limited to discrete algorithmic control flow. They exploit continuous geometry in high-dimensional embedding spaces, superposition, and distributed representations. Any realistic abstraction must preserve this expressive freedom rather than erase it.
Debuggability does not entail identifying a single, privileged “cause” of an output or phenomenon. Model behavior is mediated by deep, branching causal pathways with redundancy, overlap, and compensatory mechanisms. A debuggable explanation need not be unique. What matters is that the identified mechanisms are enough to explain and control the behavior within scope, and that alternative bypasses would be surfaced by counterexample search rather than hidden by storytelling.
Debuggability does not aim at global safety proofs of the form “this model will never produce harmful output.” Even if “harmful” were formally defined, the unconstrained input space of frontier LLMs is beyond the reach of any existing or foreseeable verification technique. Any agenda that predicates success on global guarantees is doomed to either vacuity or false confidence.

Instead, debuggability is about constructing a family of verified, compositional abstractions that faithfully reproduce model behavior on bounded domains and support predictable intervention. These abstractions are partial, local, and plural, but they are exact where they apply.

The relevant analogy is debugging a large, safety-critical software system. One cannot prove “Chrome will never crash.” But one can prove that specific routines are memory-safe, that sandboxing prevents certain classes of process escape, that critical invariants are preserved across refactors, and that a given patch eliminates a vulnerability without introducing regressions.

The same logic applies to LLMs. Meaningful debuggability consists in guarantees such as:

This subcircuit cannot activate a forbidden feature on domain D.
This intervention removes a failure mode while preserving all other behaviors in scope.
This pathway is structurally incapable of bypassing a guard unless a specific internal condition is met.

The Necessity of Formal Methods

Note that the above are universal claims over bounded domains. They are not distributional or probabilistic, and they do not rest on sampling. This is why a debuggability-oriented interpretability agenda is necessarily coupled to formal methods. SMT solvers, abstract interpretation, and neural verification frameworks are not optional add-ons. They are the only frameworks in which claims of impossibility, preservation, or closure under intervention can be made precise.

This vision does not require that today’s frontier LLMs be fully verifiable end-to-end. What matters is that the debuggability of Transformer models has precedent:

Sparse circuit extraction shows that models contain relatively isolated, algorithmic subcircuits that remain stable under targeted intervention.
Symbolic Circuit Distillation is an early example of automated extraction, where neural mechanisms can be proven formally equivalent to symbolic programs.
Neural verification work (e.g. Reluplex and Marabou) establishes that exhaustive reasoning is possible once models are reduced to verification-friendly components on bounded domains.
Alternative attention mechanisms suggest that standard attention (the dominant barrier to SMT verification in Transformers) is an architectural choice rather than a theoretical necessity, opening the door to verification-aware model design with comparable performance.

Taken together, these results shift the problem from conceptual impossibility to engineering integration and scale. Debuggability is about being able to say, with confidence: “This mechanism, on this domain, behaves this way, and if it didn’t, we would know.”

What De-compiling Actually Looks Like

Here’s what a possible “de-compilation” pipeline might look like:

1. Identify stable linear regions (local programs)

The smallest unit of analysis is a particular mechanism that has a stable branch structure on a bounded domain. Many verification-friendly components (affine maps, threshold gates, max/Top-K selection) behave like ordinary programs. Once you know which branch you’re in (e.g. which segment of a piecewise function is active, which items win a Top-K) the remaining computation is just affine arithmetic.

A “local program” is a region of inputs defined by explicit guard conditions (linear inequalities) together with the affine map executed under those guards. The stability part matters, because you want margins on the guards (e.g. thresholds not near zero, Top-K winners separated from runners-up) so that small perturbations or permissible interventions don’t flip the branch decisions and invalidate the explanation.

Here’s what a concrete example might look like:

Head 31.2: Helps break text into paragraphs
Empirically verified: attention weight exceeds ε when a token from set {‘\n’, ‘\n\n’} or a discourse marker token occurs at position t−1.

2. Factor into meaningful subspaces

Factoring into meaningful subspaces is the step where you decompose a mechanism’s activations into low-dimensional directions that have stable semantics across inputs and contexts, such as syntactic markers, sentiment, or safety-relevant features. A single local program may operate over several subspaces, and the same subspace may participate in many different local programs.

Without subspaces, interventions are blunt (ablating whole heads or MLPs). With them, interventions can be surgical (editing or bounding specific directions while leaving others untouched).

Ideally, these subspaces exhibit “functional coherence,” where moving along the subspace produces predictable, monotonic changes in model behavior on a bounded domain.

3. Extract formally verifiable causal circuits

In this step, we compose local programs and subspaces into a single object that supports global, counterfactual claims about behavior on a bounded domain. Formally, this means specifying an interface, a domain, and a set of admissible interventions, and then proving that the neural subcircuit is equivalent to (or soundly approximated by) a symbolic specification on that domain.

My project Symbolic Circuit Distillation builds in this direction by providing formally verified functional abstractions on bounded domains. Achieving this level of debuggability places strong pressure to redesign core Transformer components that are poorly suited to formal verification.

Multiple circuits may implement overlapping or reconstructed features elsewhere in the model. What matters is that, within scope, the abstraction is correct and closed under counterfactuals. If a bypass exists, formal search will find a concrete counterexample. This is the point where mechanistic interpretability becomes robust to patching and refactoring, and where safety-relevant guarantees become possible. Circuits stop being explanatory stories and become objects you can edit, reason about, and certify.

How these verified control abstractions are surfaced to human operators (whether through explicit query languages, automated tooling, or learned interfaces) is an important but orthogonal problem, and not required for the core claim of debuggability.

Interpretability as Control

The central question is whether mechanistic insight can support reliable, bounded, and verifiable control over systems that matter.

The result is a patchwork of verified abstractions: local programs, meaningful subspaces, and formally specified circuits. Many such decompositions may exist. Verification removes arbitrariness not by enforcing uniqueness, but by enforcing sufficiency.

I am curious to hear thoughts from other researchers. You can reach me on X: @neelsomani

Thanks to Maaz for giving feedback prior to posting.

Intro to Routing: Mixture-of-Experts and Expert Choice

Neel Somani — Fri, 14 Nov 2025 21:19:02 GMT

In this post, I’ll cover routing mechanisms for large language models. People often talk about Mixture-of-Experts and Expert Choice, so my goal is to give a first-principles walkthrough that explains how these methods arise naturally. You can think of this as how I would have derived them myself, or as an ex post facto explanation that clarifies the logic behind their design.

Mixture-of-Experts (MoE)

Historical Roots of MoE

The engineering motivation behind MoE is straightforward. The goal is to take N expert functions f_i and compute a weighted average of their outputs based on how confident the model is in each expert.

The basic idea is simple. First, compute logits for each expert:

Then convert these logits into a probability distribution:

Finally, compute the convex combination of expert outputs:

Training proceeds exactly as it does for a standard feedforward layer. This formulation matches the original work of Jordan and Jacobs (1991).

The drawback is that this approach requires evaluating every expert for every token, including experts with very low probability. This becomes expensive as N grows.

Top-1 Gating

Let’s say we want to perform “Top-1 gating,” where we select only the highest scoring expert. Specifically, we want to compute all g_i(x), pick the top expert s, run only that expert, and set:

This has a property you might find unexpected. Even if all f_i point in roughly the same direction, the magnitude of y is smaller than the raw output of f_s(x), since f_s is scaled by g_s. You can argue that this is sometimes desirable, because lower confidence often corresponds to smaller updates in many other ML architectures. But this is a post hoc justification, and the real reason is just that attempts to renormalize y tend to perform worse in practice.

Backpropagation follows from the product rule. For the expert parameters θ_f:

By symmetry, the gradient for the router parameters θ_g is:

For i != s, we have dL/dθ_{f_i} = 0, since those experts do not run. But dL/dθ_{g_i} is not zero, because the softmax couples all logits z_i, so each g_i influences the scaling of f_s.

The gradients are undefined at the exact boundaries where the identity of the top expert changes. But that’s typically fine, just like it’s not an issue for ReLU or other piecewise differentiable functions.

The major problem is that unused experts never improve. Training collapses to a solution where g_s ~= 1 for whichever expert happened to win early in training. In an ideal setting, we would want all experts f_j for j != s to receive at least some tokens so they continue to learn.

So we define the proportion of tokens routed to expert i in a batch:

You might attempt to regularize this by optimizing something like:

where U is a uniform distribution. This would flatten the token allocation, and in principle could prevent collapse.

But since p_i above depends on an argmax (through s_t), the gradient of p_i is zero almost everywhere. The model receives no useful signal from this penalty.

As a result, we need some other differentiable penalty that discourages any single expert from dominating the routing. The goal is to reduce the confidence g_i for experts that receive too many tokens and increase it for experts that receive too few.

Flattening the Argmax Distribution

Since p_i isn’t useful for differentiation, we try using the Gumbel max trick, which provides a way to make the argmax behave like a soft, differentiable sampling process. As a general rule, if z_j are logits and ε_j ~ Gumbel(0, 1), then:

This distribution gives us Pr[s_t = i] for each expert i. (In principle, we could even use the noisy sampling in the forward pass.) More importantly, this lets us compute an expected load for each expert and attempt to flatten it.

In our case, let z_j be the logit that produces the gating probability g_j(x_t) = softmax(z(x_t))_j. Then we assume:

This implies:

and therefore the expected load for expert i is:

With this expected load vector, we can now try to flatten it. Possible approaches include:

KL(E[load] || U)
L2 distance to uniform
Entropy maximization

A common alternative is to minimize the coefficient of variation (or a similar quantity), which flattens the distribution without the numerical instability of KL or entropy when some components get small:

We add an auxiliary term:

where CV is squared for optimization convenience. So the full loss becomes:

Practical Implementation Today

In modern MoE implementations, a simpler surrogate auxiliary loss is used. It’s not statistically derived or theoretically clean, but it works well in practice:

In theory, you might reach this form by starting from the objective:

Since the p_i sum to 1, minimizing this objective encourages a uniform allocation via the method of Lagrange multipliers.

Then, assuming we really are sampling with Gumbel noise, then given a sufficiently large batch, the empirical load fraction can be approximated via the soft probabilities g_i:

where q_i is a differentiable surrogate for the observed load proportion. Using this approximation on one of the factors:

which is the surrogate used above.

The effect is straightforward. If an expert receives too many tokens (that is, if p_i is too large), the loss increases, which pushes the model to reduce g_i.

The key detail is that the derivative of p_i(B) is locally zero, since p_i depends on an argmax that does not change in a small neighborhood. For this reason, implementations mark p_i(B) as a “no-grad” quantity.

Compared to this heuristic approach, the formulation based on Gumbel noise and the coefficient of variation is mathematically cleaner. The stochastic forward pass aligns naturally with the probabilistic reasoning used in the backward pass, and the load penalties follow from that framework without ad hoc constructions. Despite this conceptual clarity, the surrogate loss above remains more widely used.

Generalizing to Top-K

If we want to use the Top-K experts rather than Top-1, first we need a constructive definition of the statistical process. As before, we compute g_i(x_t) for each expert i. But this time, we select the Top-K experts instead of a single one.

We’ll try to take the same approach as before. To compute the expected load for expert i, we need

where S is the set of the K selected experts. The expected load is:

Just like in the Top-1 case, we can minimize CV(E[load_i])². In theory, E[load_i] is differentiable because Top-K sampling follows a Plackett-Luce distribution. But in practice, the gradient is computationally intractable. So we end up using the same surrogate as in the previous section.

Finally, it’s worth noting that the original formulation in Shazeer et al. (2017) used a different approach. Instead of Gumbel noise, the authors added normal noise to the logits, and they renormalized the weights g_i after selecting the Top-K elements. The methodology and derivation differ from the conceptual framework presented above.

Expert Choice (EC)

Expert Choice (Zhou et al. 2022) observes that the biggest pitfall of MoE is that an expert can get overloaded. If too many tokens want the same expert, that expert overloads and gets a huge share of the gradient. If too few tokens go to an expert, that expert’s weights collapse and it fails to train. That is why we needed that hacky regularization term earlier.

Imagine you’re Google serving inference at massive scale. You don’t care about routing every token perfectly. You care about keeping all experts busy, avoiding hot spots, and guaranteeing predictable latency.

Rather than computing g_i(x_t) for each token and selecting argmax_i g_i (letting the tokens pick the experts), you can let the experts pick which tokens they want to serve. Each expert receives a fixed budget of M tokens and selects the M tokens for which it believes it is most useful.

You still evaluate g_i(x_t) for all experts i and all tokens t in the batch. For each expert i, you select:

Each expert receives exactly M tokens (or up to M if the batch is small). How this affects backpropagation depends on what you do when multiple experts select the same token. For simplicity, assume the output is the sum:

(Note that real EC implementations use more complicated aggregations.) In this case, backpropagation for the expert parameters is exactly the same as in MoE. If an expert is not selected, it receives no gradient. If it is selected, then

The gradient with respect to the router parameters is surprisingly simple:

Notice that we didn’t have to differentiate through the Top-M operator at all, since the gradients only flow through g_i for the tokens actually selected by expert i.

There’s a glaring pitfall here. What if a token isn’t selected by any of the experts? In practice, implementations handle this by increasing M or by routing those stray tokens to the expert with the largest g_i(x_t).

Future Directions

I hope this post was informative. In the future, I plan to cover other routing mechanisms such as Mixture-of-Depths (MoD) or Switch Transformers, a paper by authors I respect a lot.

Conceptually, MoD pushes sparsity in a more radical direction. Instead of selecting which expert network should process a token, the router selects which layers of the transformer the token should flow through. The resulting interactions make MoD a significantly harder routing problem than MoE or Expert Choice.

When the research landscape matures further, I plan to revisit this topic with a dedicated analysis.

Note: I cover the content of this blog post in a YouTube explainer.

A Minimal Route to Transformer Attention

Neel Somani — Thu, 30 Oct 2025 00:25:32 GMT

In this post, I’ll show how a small set of reasonable assumptions can recover the Transformer attention mechanism. Some parts of attention are theoretically motivated, while others are arbitrary choices. I’ll explicitly call out which is which.

To see why attention exists, it helps to recall its predecessor: the recurrent neural network (RNN). Classic encoder-decoder RNNs process a sequence token by token. Each new hidden state incorporates the current token and the previous hidden state, producing a vector you can think of as an “accumulator” of everything seen so far. After ingesting the final token, that accumulated vector is repeatedly fed to the decoder, which predicts output tokens until it emits a STOP symbol.

The problem is long-range dependence. If an important token appeared far earlier in the sequence (say, the first of 10,000 tokens), its influence becomes diluted as the RNN processes additional tokens. The model simply forgets.

Ideally, the model should use all previously seen tokens to compute the information needed to predict the next token, weighting each earlier token by how relevant it is for that prediction. That suggests computing a relevance score between a position i and each other position j, and then combining some function of the embeddings accordingly.

Formally, define a scalar relevance function that takes the embeddings X along with indices i and j:

We work in embedding space rather than raw token IDs to avoid meaningless geometric assumptions (e.g., token 9 is not inherently “closer” to token 10). One-hot encodings would work, but are much more sparse.

Then the model’s output vector at position i can be written as:

where G aggregates some function of the embeddings x_j alongside their relevance to position i. (During autoregressive generation, i corresponds to the most recently produced token.) We don’t yet know the form of G or u. Our goal is to characterize the simplest constraints that lead directly to Transformer-style attention.

Observation: Enforce permutation symmetry

We want to constrain the space of possible functions for G and u.

Once we have the relevance scores u(i, j), the output y_i should not depend on the order in which the pairs (x_j, u(i, j)) are provided. In other words, if we reorder the elements indexed by j, the result should remain the same. This requires G to be “permutation-invariant” over the set {(x_j, u(i, j))}_j.

The Deep Sets theorem (Zaheer et al., 2017) tells us that any such function can be written as:

Here ρ and φ are arbitrary differentiable functions. We fix the index i, since the invariance only applies over j. Differentiability ensures that the overall model can be trained with gradient-based methods.

At this point, ρ and φ are still completely general, and we also need to define u. We will impose further assumptions to narrow down their form.

Assumption 1: ρ is the identity function

ρ could output many different types of objects. For example:

It could output a scalar, but that would discard most of the information from the embeddings.
It could output an O(N)-dimensional vector, with one component per input element, but that would make the output scale with sequence length and defeat the purpose of summarizing information.
It could output a vector in some intermediate dimension, or even map into a different space/manifold entirely.

All of these are technically possible. In practice, Transformers set ρ to be the identity function, so:

This simplifies the structure of G and lets us focus on constraining φ and u.

Assumption 2: Relevance-contribution proportionality

Even with ρ set to the identity, φ could be any function of the embedding x_j and the relevance score u(i, j). To simplify the form, we assume that if a token’s relevance is scaled by a constant k, its contribution scales by the same factor:

This is not the only possible relationship. For example, we could have chosen a quadratic or some other monotonic transformation in u(i, j). The key requirement is simply that φ should separate into:

A scalar measuring how important x_j is
A vector capturing what x_j contributes

Under the linear version of this assumption, we get:

Define v(x_j) = φ(x_j, 1), yielding:

This makes φ explicitly separable, where u(i, j) purely controls magnitude (relevance), and v(x_j) determines the content being contributed.

Assumption 3: Linear change of coordinates

At this point, v(x_j) could be any function of x_j. To simplify the model and keep it efficient to compute, we assume v is a linear transformation of x_j:

Substituting this into the previous expression gives:

This means each token contributes a linearly transformed version of its embedding, weighted by its relevance score u(i, j).

Observation: Constrain u for efficient parallel computation

We want u(i, j) to be computable efficiently on hardware like GPUs. Here, “efficient” refers to low sequential depth in the computational graph, not necessarily a low number of arithmetic operations. GPUs can execute many multiplications in parallel, but long chains of dependent operations create bottlenecks. For example, a recurrent computation with O(N) sequential steps is slow for long sequences, but a matrix multiply has O(1) sequential depth and is highly parallelizable.

If we allowed a fully general relevance function such as:

where context(X) examines all tokens at once, we would need to evaluate this network O(N²) times for a single layer, which is too slow.

Alternatively, we could define a single model:

that outputs all pairwise relevance values directly. But that would require storing and training parameters of size O(N²), which locks the model to a fixed input length and scales poorly.

To keep computation parallelizable and scalable, we restrict u to be built from tensor operations such as:

Linear projections
Element-wise functions
Inner products
Reductions like sums

and avoid control flow or long sequential recurrences.

Assumption 4: Dot product similarity for u

A simple way to score the interaction between x_i and x_j is with a dot product. However, we don’t necessarily want similarity in the embedding space - we want similarity in a space optimized for relevance.

So, as we did for v(x_j), we first apply learned linear projections:

Then we define the relevance score as:

We denote this version as u’ because additional modifications will be applied later.

Assumption 5: Pick a normalization for u

Next, we want the relevance scores u’(i, j) to measure relative importance. If the same constant were added to all scores, or if they were scaled uniformly, the ranking of tokens should not change. This motivates applying a differentiable normalization function over j.

There are several possibilities (e.g., softmax, Gumbel-Softmax). In practice, Transformers use softmax.

One final issue: the dot product ⟨q_i, k_j⟩ tends to grow in magnitude with the key/query dimension d_k. To prevent extremely large values from dominating the softmax, we scale the logits:

Applying softmax normalization over j then gives:

This is exactly the scaled dot-product attention used in Transformers.

Does a better attention mechanism exist?

So there you have it. If we impose the following assumptions:

ρ is the identity function
Each token’s contribution scales proportionally with its relevance score
A linear transformation maps embeddings to the value, key, and query vectors
Relevance is based on a dot product
Relevance scores are normalized with a softmax

we obtain the exact scaled dot-product attention used in Transformers.

While some of the choices were forced, they weren’t all theoretically required. There may be better options for ρ, for the similarity measure, or for the normalization function. Even more fundamentally, the Deep Sets form at the beginning was from imposing permutation-invariance, but we end up reinjecting positional encodings in practice. Exploring these variations could reveal new attention mechanisms with different computational or modeling advantages.

Note: I cover the content of this blog post in a YouTube explainer.

Killing the GIL: How To Use Python 3.14's Free-Threading Upgrade

Neel Somani — Tue, 14 Oct 2025 23:14:03 GMT

For almost three decades, Python’s Global Interpreter Lock (GIL) has been the single mechanism standing between your CPU cores and real parallelism.

That changes with Python 3.14.

The new free-threaded build removes the GIL, allowing multiple threads to execute Python bytecode simultaneously. No multiprocessing, no pickle, no hacks.

In this post, I’ll:

Explain why the GIL existed and what it was protecting
Compare Python’s old concurrency models (threading, multiprocessing, asyncio)
Build Python 3.14 with the GIL disabled
Run a short multithreaded benchmark that finally scales with cores
Explain the results

Why the GIL existed in the first place

The GIL is a global mutex that historically allowed only one thread to execute Python bytecode at a time. It was there to protect CPython, the C implementation of the interpreter.

Every Python object in CPython lives on the heap as a C struct with a reference count. Each assignment or function call increments and decrements those counters constantly. If two threads updated the same object’s reference count simultaneously, you could get memory corruption or premature frees that crash the interpreter.

Adding locks around every Python object would have been complex and slow, so early CPython took the simple route: wrap the entire interpreter in one global lock. That made single-threaded execution safe, but prevented true multithreading for CPU-bound workloads.

How concurrency previously worked

Before 3.14, Python offered three main concurrency models, each with trade-offs:

threading (old): uses real OS threads, but only one can execute Python bytecode at a time because of the GIL. Good for I/O, useless for parallel CPU work.
multiprocessing: spawns multiple processes, each with its own interpreter and GIL. True parallelism, but expensive, requiring separate memory, pickling overhead, and slower process startup.
asyncio (green threads): runs everything cooperatively on one thread. Excellent for high-concurrency I/O, but it never uses more than one core.

With Python 3.14’s free-threaded build, threading becomes the best of all worlds: true parallelism across cores, shared memory without serialization, and minimal overhead.

Building Python 3.14 without the GIL

Compile it yourself:

git clone https://github.com/python/cpython
cd cpython
git checkout v3.14.0
./configure --prefix=$HOME/.py-314-ft --disable-gil
make -j && make install
$HOME/.py-314-ft/bin/python3 -V

Or with pyenv:

pyenv uninstall -f 3.14.0 || true
PYTHON_CONFIGURE_OPTS=”--disable-gil” pyenv install 3.14.0
pyenv local 3.14.0

Verify that you’re running a free-threaded build:

python3 - <<’PY’
import sys
print(”Free-threaded build:”, not sys._is_gil_enabled())
PY

You want: Free-threaded build: True.

Running a realistic multithreaded benchmark

Here’s a bit-optimized N-Queens solver. You can also download the repo on GitHub. It’s already efficient and CPU-bound, a good test of whether threads can finally scale.

# nqueens.py
import threading, time

def solve_row(n, cols=0, diags1=0, diags2=0, row=0):
    if row == n: return 1
    count = 0
    free = (~(cols | diags1 | diags2)) & ((1 << n) - 1)
    while free:
        bit = free & -free
        free -= bit
        count += solve_row(
            n, cols|bit, (diags1|bit)<<1, (diags2|bit)>>1, row+1
        )
    return count

def solve_threaded(n, n_threads):
    first_row = [(1 << c) for c in range(n)]
    chunks = [first_row[i::n_threads] for i in range(n_threads)]
    total = 0
    lock = threading.Lock()

    def work(chunk):
        nonlocal total
        local = 0
        for bit in chunk:
            local += solve_row(
                n, cols=bit, diags1=bit<<1, diags2=bit>>1, row=1
            )
        with lock:
            total += local

    threads = [threading.Thread(target=work, args=(c,)) for c in chunks]
    for t in threads: t.start()
    for t in threads: t.join()
    return total

if __name__ == “__main__”:
    for threads in (1, 2, 4, 8):
        t0 = time.perf_counter()
        solve_threaded(14, threads)
        dt = time.perf_counter() - t0
        print(f”threads={threads:<2}  time={dt:.2f}s”)

Run it once with standard CPython 3.14 (GIL on) and once with your free-threaded build. With the GIL, all runs take about the same time. With the free-threaded build, performance improves almost linearly with thread count.

Results

Example benchmark:

14-Queens on an M1 Pro, Python 3.14

That’s an ~8x speed-up without changing a single line of logic - just running under the free-threaded interpreter.

Caveats:

C extensions: any binary package must be recompiled for free-threading, or it might quietly re-enable the GIL.
Thread safety: without the GIL, race conditions are real. Protect shared state with locks, queues, or immutable data.
Single-thread overhead: expect a 5-10% slowdown for purely single-threaded scripts due to atomic ops and internal locks.

Closing thoughts

The GIL made CPython simple and safe to implement, but it locked Python to a single core.

With Python 3.14, that trade-off is gone. For the first time, standard Python threads can run in true parallel on modern CPUs.

So go ahead and kill the GIL, and let me know how it works for you.

What You Didn't Learn in Berkeley CS 188 — Part 4

Neel Somani — Sat, 11 Oct 2025 00:36:24 GMT

This is the fourth and final piece in my series on reinforcement learning. Previously, we covered classical RL, continuous control, and off-policy methods. The topic of LLM post-training is discussed all over X, so this primer should help anyone get up to speed.

Here’s how I like to think about post-training methodologies:

2x2 Quadrant of Post-Training Methods

SFT is simple. It’s just applying additional training iterations like the pre-training stage, but on a curated set of ideal (prompt, response) pairs. You might make this more efficient with a LoRA adapter.

In this post, we’ll focus on quadrant 2: DPO and offline GRPO. Along the way, I’ll point out how methods like online PPO and online GRPO fit in. Historically, online PPO came first, so understanding it helps explain DPO.

Theory of Relative Scoring

Before getting into the objective function of direct preference optimization (DPO), we need to motivate the idea of relative scoring.

We’re given lists of prompts x and pairwise responses a₊ and a_-:

All we know is that a₊ is preferred to a_-. A human may have rated them that way, or another signal might imply it (e.g. code that compiles > code that fails).

That setup doesn’t immediately lend itself to the methods we’ve seen so far. On-policy methods don’t work because neither response may be likely under the current policy. Off-policy methods still don’t work because we lack a defined reward.

You might try to make the model more likely to output a₊ than a_- by optimizing Pr[π(a₊ | x) > π(a_- | x)]. But that expression makes no sense. π(a | x) are constants given the model. Unless we add tunable parameters (say, some γ where f_γ(π) produces a new model), those probabilities don’t change.

You could define f_γ(π)(x, a) and optimize Pr[f_γ(π)(x, a₊) > f_γ(π)(x, a_-)], or build an even more general model f_γ(π, x, a₊, a_-) that directly outputs the likelihood that a₊ is better. But f is still abstract. It’s unclear how to parameterize it.

Instead, DPO (and the original online PPO post-training) take a simpler route by introducing a latent reward. The assumption is that if a human preferred a₊ to a_-, then there exists some implicit reward function r such that

r(a_{-} | x) + ε_{-}","id":"HXOCDHEXHZ"}" data-component-name="LatexBlockToDOM">

where ε represents human noise or ambiguity. If we can learn that reward function, we can optimize the model accordingly.

Learning the Reward Function

One approach is maximum likelihood estimation. We denote a₊ ≻ a_- if a₊ is preferred. We’d like a function g such that:

and then optimize φ to maximize:

Let’s try to define g. Notice:

r(a_{-} | x) + ε_{-}] \\\\\n&= Pr[r(a_{+} | x) − r(a_{-} | x) > ε_{-} − ε_{+}]\n\\end{align*}","id":"WCAPQQRNJF"}" data-component-name="LatexBlockToDOM">

So preference depends only on the difference between rewards. That implies translational invariance: g(u, v) = g(u + c, v + c). That property implies that g must be f(r(a₊ | x) - r(a_- | x)) for some function f, since g(u, v) = g(u − v, 0) = f(u − v), where the first equality follows by translational invariance.

Second, if r(a₊₊ | x) > r(a₊ | x) > r(a_- | x), the higher-reward response should never be less preferred. In other words, f must be non-decreasing: f’(x) >= 0

Finally, f(r(a₊ | x) - r(a_- | x)) + f(r(a_- | x) - r(a₊ | x)) = 1, which along with the previous condition, implies f(0) = ½, lim _t→∞ f(t) = 1, and lim _t→-∞ f(t) = 0.

Many functions f satisfy these conditions. The choice depends on what noise distribution you assume. In practice, DPO uses the logistic sigmoid, which assumes Gumbel noise:

U(a_{-})] &= σ(r(a_{+}) − r(a_{-})) \\vphantom{\\frac{1}{1 + e^{-x}}}\n\\end{aligned}","id":"ERJADFCGRK"}" data-component-name="LatexBlockToDOM">

If noise were Gaussian, you’d recover the probit model instead.

The final objective to optimize r_φ is:

The KL Divergence Penalty & PPO

Now we have a reward function. Just like the traditional methodology for REINFORCE, you can optimize your policy with respect to the objective function:

It’s a bit different from REINFORCE since there’s no discounted sum of rewards across a trajectory. Instead, it’s just a single-step reward that we’re optimizing with respect to. The problem with this approach is that it’s going to completely alter your model. The optimization will force the policy to output a₊ with very high probability, at the cost of everything else.

So the actual optimization for online PPO and DPO actually adds a constraint to prevent the policy from diverging too much from the original policy, π_ref:

That KL divergence constraint might make you think of PPO. But that similarity is completely superficial. Recall that the KL divergence constraint for PPO came from rewriting the objective function:

We needed to constrain d^π_old ≈ d^π_new so we didn’t have to re-sample, and the best we could do was penalize KL(π_new || π_old) and establish an upper bound on the divergence of the state distributions.

The KL divergence constraint for online PPO and DPO is not fundamentally justified in the same way. It is simply the heuristic notion that we want π_new to be not too different from π_ref. You could theoretically derive this constraint if you think the true model follows a Boltzmann distribution, and you impose π_ref as a prior. This leads to the same objective function as above. But that’s not really where this KL divergence constraint comes from.

If you’re running online PPO, you’ll see the KL divergence penalty in the objective function to keep the policy π_new close to π_ref, and a clipping mechanism to keep the policy π_new within the trust region of π_old.

To finish the derivation for online PPO, we add the constraint that π_θ(a | x) must sum to 1:

Taking the gradient:

You can go ahead and optimize π_θ using this gradient, and that’s exactly where methods like online PPO (or as we cover later, online GRPO) fit in.

Direct Preference Optimization (DPO)

DPO, on the other hand, attempts to turn this into a supervised learning problem, eliminating the need for rollouts or trajectories altogether. DPO starts by solving for the closed form solution of π_θ. Note that π_θ(a | x) * ∇log(π_θ(a | x)) = ∇π_θ(a | x) by the log-gradient trick, so that second summation sums to 1:

Simplifying,

so:

and exponentiating gives:

where C(x) is the normalization constant ensuring probabilities sum to one.

Then, DPO moves in the reverse direction, substituting this definition of π_θ to express r:

Previously we solved this expression, which we’ll use for MLE:

U(a_{-})] = σ(r(a_{+}) − r(a_{-}))","id":"ZRXQYUSIQZ"}" data-component-name="LatexBlockToDOM">

Plugging in r:

U(a_{-})] &= \\sigma\\Big(\\frac{1}{\\beta}[(\\log \\pi_{\\theta}(a_{+} | x) − \\log \\pi_{\\text{ref}}(a_{+} | x)) − (\\log \\pi_{\\theta}(a_{-} | x) − \\log \\pi_\\text{ref}(a_{-} | x))]\\Big) \\end{align*}","id":"GUUWNJCGIQ"}" data-component-name="LatexBlockToDOM">

Then we maximize likelihood:

That’s the final DPO objective. It can be optimized via standard supervised learning on your dataset. Choosing β controls the trade-off between imitation and divergence. But note that you no longer get additional signal beyond the dataset, unlike online PPO.

Offline Group Relative Policy Optimization (GRPO)

Now we reach the modern variant. GRPO was introduced by DeepSeek in 2024.

GRPO begins with the same pairwise setup as DPO. In fact, pairwise GRPO is mathematically identical to DPO, just rewritten.

To simplify notation, define:

Then the objective function for DPO becomes:

Define a shorthand:

Then:

Next, GRPO defines some synthetic reward function Ŕ:

Then we might rewrite the gradient as:

This is exactly the REINFORCE objective! This is the basic formulation for offline GRPO in the pairwise case. As you can see, all we did was make a few substitutions, but we didn’t fundamentally change the optimization. Thus, offline GRPO (pairwise) ≡ DPO ≡ REINFORCE in disguise.

Extending to Groups

So that’s the pairwise case. But the “group” in “group relative policy optimization” implies that you can have more than two responses. To be clear, with a₁ > a₂ > a₃, you could decompose that into pairs (a₁ > a₂, a₂ > a₃, …), but GRPO treats the group as a first-class citizen.

Here’s where theory gets shaky. In DPO, the weights w_i are strictly determined ±(1−σ(z))/β. GRPO merely observes that these weights satisfy ∑ w_i = 0 and generalizes: any set of scores with ∑ w_i = 0 is allowed.

The same supervised objective then applies:

where Ŕ uses the custom group weights.

To adapt this to online GRPO, we reuse the same idea as online PPO. After generating k responses for a prompt, compute their scores, center them (subtract the mean), and treat those as the rewards.

End of the Series

So is GRPO broken? Many people report that it works for them empirically. But it’s fair to say that GRPO’s theoretical foundations are weaker than many other methods. I’ll end this with a take I posted about GRPO:

Source: https://x.com/neelsomani/status/1976690361553895711

I hope to cover other RL/ML topics in future posts, but that concludes my blog series on reinforcement learning. Feedback is appreciated!

What You Didn't Learn in Berkeley CS 188 — Part 3

Neel Somani — Thu, 09 Oct 2025 00:57:02 GMT

So far in this series on reinforcement learning, we’ve covered classical methods and the foundations of continuous-control methods.

Let’s say you want to use these methods at scale. Ideally, we’d have a method to run as many actors as we want, alongside some way to consolidate the results.

Basic knowledge of PPO tells us that we can do this as long as the actors are using a policy that isn’t “too far” from the latest policy π_current. But that restriction inherently caps how many actors we can run concurrently. If an update drifts π_current too far, then all of the other actors’ work becomes much less valuable.

This naturally leads us to the “off-policy” methods: DDPG, TD3, and SAC.

Deep Deterministic Policy Gradient (DDPG)

One nice thing about Q-learning was that, in theory, you could fill out the state-action table in parallel. Work was never really wasted, because every visit to an (s, a) pair contributed information toward the optimal value function.

Of course, the argument for convergence of Q-learning relied on the contraction property of the Bellman update operator. That’s going to be harder to prove in a continuous action space, because we have to use something like a neural net to output Q-values (DQN), meaning we can’t guarantee that the policy is strictly improving like in the tabular method. In fact, there is no clean convergence proof for DQN.

Regardless, the question remains: is there a reasonable method that never “throws away” old samples and can use every (s, a) collected? Note that the reason we couldn’t use Q-learning in a continuous action space is that we couldn’t solve the maxₐ in this update:

What if we try just outputting maxₐ Q(s′, a′) directly? Is it possible to build an algorithm around this?

Derivation of the Deterministic Policy Gradient Theorem

The deterministic policy gradient theorem is a way to differentiate our objective function by integrating over states rather than actions. Let’s define a “deterministic policy” μ:

and its objective:

Our goal is to compute ∇_θJ(θ) so we can perform gradient ascent. Starting from:

Direct differentiation is messy because Pr[s_t= s] depends on θ. We’d like to express this in terms of value functions instead, and eliminate the Pr[s_t= s] term.

Intuitively, this discounted and probability-weighted summation of rewards across each state is equivalent to the expected value of starting the game at all: 𝔼_{s_0}[V(s₀)]. Let’s prove that equivalence formally.

Step 1. Relating reward and value

We relate r and V via the Bellman equation:

and define the “discounted state visitation”:

Then:

So we want an expression for ∇Σₛ p_θ(s)V^μ_θ(s) and/or ∇Σₛ p_θ(s)γ𝔼[V^μ_θ(s′)].

Step 2. Recursive property of state visitation

We expand transitions:

Sum over t:

Apply discounting by γ:

Define:

The left-hand side is p_θ(s′) except it omits t = 0:

In words: Each state’s discounted occupancy p_θ(s′) consists of two parts: the initial-state contribution d₀(s′) and the γ-discounted flow of visitation mass from predecessor states, weighted by their transition probabilities under μ_θ.

It’s starting to look pretty close to what we want above.

Step 3. Multiplying by V(s′) and summing over s′

Multiply both sides by V(s′) and sum over s′:

Recognize that the inner sum is an expectation:

So:

Now recall our earlier expression for the gradient, and substitute our expression above back in:

Or equivalently,

Step 4. Chain rule on Q

Now the expression is much friendlier. How do we differentiate V(s)? Since V(s) = Q(s, μ_θ(s)) for deterministic policies, we apply the multivariable chain rule:

The first term, ∇_θQ(s,a), is recursive with respect to the original gradient ∇J via the Bellman equation, except now it’s over the distribution of s₁ rather than the starting distribution of s₀:

Unrolling this recursion gives:

Continuing indefinitely:

Equivalently, using the discounted state-visitation form:

This is the Deterministic Policy Gradient Theorem. The key difference from the traditional (stochastic) policy gradient theorem is that the expectation is taken over the state distribution rather than the action distribution.

That single shift, integrating over p_θ(s) instead of π(a|s), makes deterministic continuous control methods tractable and forms the foundation for DDPG and its successors.

Practical Considerations of DDPG

We can’t compute this expectation analytically, so we approximate Q and μ with neural networks.

The first relaxation that we need to make, in line with our effort to increase sample efficiency, is to allow ourselves to use samples that don’t necessarily come from the current state distribution p_θ. The original DPG paper shows that for any sampling distribution p with sufficient coverage:

Thus, we can reuse samples from “replay buffers,” where we store all of the previous samples that we’ve observed, even after our gradient has had many updates. This is the key to off-policy learning.

Second, DDPG uses two networks: the actor μ_θ and the critic Q_φ. During exploration, Gaussian noise is added to μ_θ(s). The critic is trained by minimizing:

where

and the actor loss is simply:

actions = actor(states)
q_values = critic(states, actions)
actor_loss = -q_values.mean()
actor_loss.backward()

In practice, DDPG maintains frozen target networks Q_φ’ and μ_θ’. These networks are updated “softly”:

for small τ. In practice, this reduces the variance of the network updates. But buyer beware: DDPG is still known to be very unstable and sensitive to hyperparameters.

Twin Delayed DDPG (TD3)

TD3 is directly a response to DDPG. It makes three changes to DDPG, all starting with the letter D, two of which are fairly simple:

Double critics: Instead of one critic network, now we have two, and we use min(Q_{φ′₁}, Q_{φ′₂}). The reasoning is that Q-networks can be spiky and randomly assign too high of a value to some states. μ_θ(s) is computing arg maxₐ Q(s, a), which implicitly relies on maxₐ Q(s, a). This leads to systematic over-estimation of the true Q-value. By taking min(Q_{φ′₁}, Q_{φ′₂}), we counteract this overestimation bias and err toward underestimating the true objective function.
Delayed actor updates: The heuristic reasoning for convergence (not formally proven) is structurally identical to the convergence argument for policy iteration. We want the state values to converge first, and then we update the policy. TD3 solves this by only updating the actor once every two times the critic network is updated.

The last change to DDPG, called deterministic target smoothing, is more nuanced.

Deterministic Target Smoothing Analysis

The critical insight here is that (s, a) is continuous, and therefore it’s extremely unlikely that you’ll ever hit the same (s, a) twice. That means that if you have a randomly high Q(s′, a′) from initialization, that spike will permanently distort that part of the Q-network. Worse, it propagates downstream through the Bellman update:

This contamination affects nearby states too, since Q-networks generalize over continuous space. The double critics mitigate but don’t eliminate this problem.

We solve it by smoothing out that local, spurious (s, a) pair with its neighbors. Before, the target was:

Instead, TD3 uses a smoothed target:

Maybe a single (s, a) pair has a random spike, but on average, the neighborhood of points is likely to be reasonable. Computing that expectation is intractable, but we can approximate it via Monte Carlo. In practice, a single ε-sample per update is sufficient to get a good estimate.

If we take a second-order Taylor expansion of Q around the mean action μ_θ′(s_t+1), we find that this expectation implicitly penalizes the Laplacian of Q with respect to its action input. That is, the added noise regularizes curvature, discouraging sharp spikes in Q-values that would otherwise destabilize learning.

Together, these three changes to DDPG make TD3 a stable and popular alternative.

Soft Actor-Critic (SAC)

Now it would be awesome if somehow SAC were a natural continuation of DDPG/TD3. That doesn’t do it justice, though, because SAC is actually philosophically and foundationally distinct from all of the methods that we’ve covered so far, even in previous posts.

Philosophical Justification

The thing about our original objective function (J(θ) := 𝔼_τ[G(τ)]) is that it only cares about expected return, not diversity of exploration. All of the exploration we’ve done so far has been either by solving for the parameters of a stochastic policy (e.g. PPO) or by adding randomness to the actions (e.g. DDPG). The optimal policy is still allowed to collapse into a deterministic mapping.

The reality is that there are many possible policies that could explain our observations. Which one should we prefer?

The principle of maximum entropy tells us that, out of all the distributions consistent with our observations, we should pick the one with maximum entropy. This is the distribution that makes the fewest assumptions about the underlying data.

For example, if we have a variable x such that 𝔼_p[f(x)] = c, we should solve:

The Lagrangian is:

and setting its derivative to zero yields:

This is the Boltzmann distribution, the unique maximum-entropy distribution that satisfies the constraint.

Applying to Reinforcement Learning

Applying this idea to RL, if we assume we want some level of return R̂, then we should pick (in discrete form):

subject to

Here q(τ) is a hypothetical “best” trajectory distribution that reflects our observations, and 𝔼_q denotes expectation over trajectories induced by q. Of course, if we set R̂ arbitrarily high, the resulting entropy will approach 0.

Forming the Lagrangian and setting the derivative to 0:

Taking the derivative with respect to q(τ):

which gives:

Setting α := 1/β, we obtain the canonical maximum entropy form used in SAC:

The last missing piece is that we’ve defined what the optimal trajectory distribution looks like, but the agent only controls the policy distribution. The environment dynamics also affect the likelihood of any trajectory:

Using Bayes’ rule (Pr[A | B] = Pr[A and B] / Pr[B]), we include this prior over trajectories:

Properties of the Optimal Policy

So we’ve solved for the optimal trajectory distribution q*. But we can only control the policy π_θ(a|s), which induces its own trajectory distribution through the environment dynamics:

The question becomes: which policy-induced trajectory distribution p_π(τ) is closest to q(τ)*? This leads naturally to minimizing the KL divergence:

Expanding:

Since log p_π(τ) − log p₀(τ) = ∑ₜ log π(a_t|s_t), this simplifies to:

Rewriting as a maximization gives:

since α is a constant so multiplying by it doesn’t change the outcome of the argmax. Thus, SAC explicitly maximizes both expected reward and policy entropy, balancing exploitation with exploration in a single unified framework.

State Value and Q-Function Derivation

To implement this, we’re going to need expressions for both state values and Q-values. The state value function is just the objective above, with discounting:

By the Bellman equations:

and the corresponding Q-function:

We can rewrite V* in terms of Q*:

This form is friendlier because now we’re only taking an expectation over the action a. That expectation is equal to:

Let’s try to solve for that V*. We write the Lagrangian and set the derivative to 0:

gives:

Therefore:

Normalizing yields the softmax policy:

Now we can substitute back into V:

Soft Policy Improvement

Finally, we need a way to computationally solve for the optimal policy pi*. We could theoretically fit a network to the definition of the optimal pi* above. But in continuous spaces that denominator is intractable. Instead, we define a “soft policy improvement” operator:

This operator has the same fixed point as the optimal policy above. (You can try substituting it in.) In practice, we minimize its negative:

The critic still follows a soft Bellman target:

with:

Finally:

SAC uses double critics (like TD3) to mitigate bias, and
Updates the temperature α automatically to maintain a target entropy.

SAC’s soft Bellman operator is a γ-contraction in the tabular case, ensuring convergence under idealized assumptions.

Wrapping Up

That concludes our discussion of the off-policy methods. In the next and final post of the series, I’ll cover incorporating human feedback, which is relevant in post-training LLMs: DPO and GRPO.

What You Didn’t Learn in Berkeley CS 188 — Part 2

Neel Somani — Tue, 07 Oct 2025 03:15:59 GMT

In my last post, I covered classical reinforcement learning methods. Some of these appeared in CS 188, but not at the depth needed to understand why they work. In this post, I show how these basic methods can be rethought or extended to handle very large state spaces or continuous action spaces.

If you recall, Q-learning, value iteration, and other tabular methods require storing a full set of state–action values. The policy is implicitly a function of the Q-values: iterate over actions and pick the one that maximizes expected value.

Even in a continuous state space, the idea still applies. Define a parameterized Q-function Q_θ(s,a) and an implicit greedy policy

Define a Bellman-type residual when you sample an action (a):

You can take gradient steps on Q to reduce this residual, typically by minimizing its square. In practice, the target term (r + γ max_a’Q_θ(s’,a’)) is computed with a stale copy θ^- to reduce instability from target chasing due to stochastic rewards and transitions. This is a Deep Q-Network (DQN).

That works for discrete action spaces. In continuous action spaces, computing max_aQ_θ(s, a) is generally intractable. This motivates learning the policy directly rather than inferring it from Q-values. We introduce a policy over a continuous action space, that is, a probability density function. Let π_θ(a | s) be a policy parameterized by θ, for example the parameters of a Gaussian. If we can properly define a loss function, we can optimize θ using SGD or Adam.

Introducing the policy gradient methods. These are the methods you’ll often hear about if you scroll X. In this post, I implement a couple of these methods on the Pendulum environment.

Code: https://github.com/neelsomani/policy-gradient

REINFORCE: Policy-Gradient Derivation

This is a common derivation which you can find in many places. Let:

π_θ be the policy,
τ = [(s₁, a₁), …, (s_n, a_n)] be a trajectory,
G_τ be the discounted return of τ,
ϕ(τ) = ∏_t=1,…,nP(s_t+1 | s_t, a_t) ∏_t=1,…,n π_θ(a_t | s_t) be the probability of τ.

Then we can define our objective as:

The gradient of the objective function is:

But we don’t want to compute the product rule across ∏_t=1,…,n π_θ(a_t | s_t). The classic way to get around that is using the log-trick, ∇f = f * ∇log(f):

To reduce the variance of the gradient, we use the equivalent unbiased estimator that pushes the return inside the sum:

The Causality Argument

We now justify focusing on the return from time t onward. First expand the trajectory expectation as a tower of expectations:

For any fixed t,

and the “const” term depends only on (s₁, a₁), …, (s_t-1, a_t-1). Taking the conditional expectation over a_t ~ π_θ( . | s_t) and using the log-trick in reverse,

so all terms prior to t vanish in expectation. Define the Monte Carlo return from t:

then:

This argument is often called causality.

Implementing REINFORCE in PyTorch

In PyTorch, you don’t pass gradients directly. You define a loss built from PyTorch primitives. Anything that has parameters that you need to differentiate the loss with respect to must be written using PyTorch’s primitives. A common surrogate for the objective above is:

which satisfies:

As long as we sample trajectories in an unbiased way, we are optimizing with respect to an unbiased estimate of ∇_θJ(θ).

How do we represent π(a | s) in continuous action spaces? We could try to build a model that takes (s, a) and outputs a probability, but a pdf must be non-negative and integrate to 1. Neural nets output arbitrary real numbers. With a discrete action set we could normalize with a softmax, but that does not extend to a continuum of actions. Instead, we make the network output the parameters of a distribution, for example a Gaussian with mean μ_θ(s) and scale σ_θ(s), then sample from it.

For Pendulum, actions lie in (-2, 2). One method to output within those bounds: A tanh head gives (-1, 1), which we scale to (-2, 2).

We also need σ > 0. Rather than predict σ directly, predict log(σ) and map it with exponentiation or softplus.

There are a ton of tricks like this to enforce bounds.

Just use the raw head if you’re down to output in across all of R
tanh or sigmoid if you want to keep it within a range and ensure its differentiable
Clip if you don’t care if it’s differentiable outside the range (common for log(σ))
Exponentiate or softplus to make it (0, inf)

Typically in PyTorch, the module’s forward method returns deterministic parameters (μ, σ), and sampling happens in a separate method.

Still, even with a correct REINFORCE, convergence can be slow.

Baselines and the justification for A2C

From the original REINFORCE paper, subtracting a baseline B(s_t) leaves the gradient unbiased:

Proof:

by the same reasoning as the causality argument above.

Baselines can reduce the variance of the gradient computation. Choosing B(s_t) = V^π(s_t) yields:

where A(s, a) = Q(s,a) - V^π(s) is called the “advantage”. Estimating V^π with another model, called a critic network, gives the actor–critic framework.

Practical notes from my final implementation for Pendulum:

Full Monte Carlo returns had too much variance, so I used TD(0) targets for the critic.
Second, I found the algorithm was highly sensitive to γ. γ=0.99 did not converge, while γ=0.9 did. The learning rate for the optimizer barely mattered.
Finally, the log standard deviation was not learning properly, and the recommendation in this notebook helped by using a softplus stabilization.

From TRPO to PPO

Motivation: Reusing Data

Suppose you compute a batch of trajectories under π_old to estimate the policy gradient:

All of that work gives you a single update to θ. After you update, the batch is no longer on-policy. To reuse the data, you would need to reweight old observations so that expectations match those under π_new:

but the product of ratios has high variance.

Performance Difference Lemma

To simplify things, we’re going to define the “discounted state visitation” distribution:

Then, as we’ll prove in this section, here’s what called the “performance difference lemma”:

Notice you’re taking an expectation over π_new, but you’re computing the advantages based on π_old.

First, note that:

Then:

Now, we’ll use an add & subtract trick to expose the advantage within the expectation:

This is useful because we can write:

Maximizing J(θ_new) amounts to maximizing:

since the first term is a constant and 1/(1-γ) is a scale. The issue is that this expectation relies on π_new (in the distributions of both the actions and the states), which would require resampling.

In theory we could importance weight both the state distribution and the action distribution:

We do not have access to d^π. Instead, TRPO assumes:

While we cannot enforce this directly, we can bound |d^π_new - d^π_old| by controlling a policy divergence. We bound the following expression:

which allows the authors to get a lower bound on J(π_new). In practice we constrain the expected KL under d^π_old, which is tractable.

With the state distribution approximated as unchanged, the remaining scaling π_new/π_oldis called “importance sampling.” TRPO’s final surrogate and constraint become:

Proximal Policy Optimization (PPO)

PPO-Penalty

The first variant of PPO comes directly from the Lagrangian of the TRPO objective with λ > 0:

Dropping the constant λ * δ and defining β := λ, we arrive at the standard form:

In practice, β is adapted so the empirical KL stays close to δ.

PPO-Clip

PPO-Clip takes a slightly different approach to staying within the trust region. Consider the importance ratio for a single sample:

Instead of constraining by the mean KL, we could just make it so the model doesn’t reward adjusting the policy π_new when it deviates wildly from π_old. You can imagine it is possible to prove a bound on KL divergence if π_new ~ π_old. If we could enforce per-sample constraints, we would maximize:

But it’s hard to jointly impose that many constraints over a single θ. Instead, PPO modifies the objective so there is no incentive to push r_t outside the interval. A naive attempt is:

But this surrogate can overestimate the true objective in two cases:

A_t< 0 and r_t> 1 + ε where the penalty is capped at (1 + ε) * A_t but should be more negative, and
A_t> 0 and r_t< 1 - ε where the reward should be smaller than (1 - ε) * A_t.

We need the surrogate to only underestimate the true objective, because that ensures that maximizing the surrogate also maximizes a lower bound on the objective. The conservative surrogate fixes both by lower bounding the unclipped objective:

This discourages large deviations from π_old. In practice we also track the mean KL over the batch and stop early if it exceeds the target δ. And that’s the second variant of PPO.

Scaling

The policies above assume trajectories are sampled on-policy from the current π. At scale, actors may be lagged or the data may be offline.

In future posts, I plan to cover off-policy methods such as DDPG, TD3, and SAC. I also plan to write a primer on incorporating human feedback using GRPO and non-RL approaches like DPO.

If you liked this material and want a reference for these algorithms and more, I recommend: Lilian Weng’s overview of policy gradients

What You Didn’t Learn in Berkeley CS 188 — Part 1

Neel Somani — Sat, 04 Oct 2025 02:00:15 GMT

Berkeley’s CS 188 covers many important foundations of reinforcement learning. But there’s still a gap between what’s taught in that undergraduate course and the baseline expected if you’re working in the field.

Berkeley’s course covers, in no particular order:

Basic search algorithms
Constraint satisfaction problems (CSPs)
Minimax (with alpha-beta pruning)
Bayes nets
Markov Decision Process (MDP) definition
Policy iteration
Q-learning (and some variations)

This material is foundational, but the way it’s taught often feels fragmented. My goal here is to reorganize the basics into a clearer ontology that naturally sets up modern, continuous-control methods. The information hierarchy, I’d argue, could be sharper than what’s presented in CS 188.

This post is the first part in a series on reinforcement learning. Later posts will cover continuous control, off-policy methods, and RL for post-training.

CS 188 Recap: Markov Decision Process Definition

If you already remember the basics from 188, you can skip this section. Reinforcement learning is typically formalized as a Markov Decision Process (MDP). The MDP specifies the environment:

States S: possible configurations of the world.
Actions A: moves the agent can take.
Transitions P(s’ | s, a): probability of landing in s’ after taking action a in s.
Rewards R(s, a, s’): immediate payoff for (s, a) → s’.
Discount γ ∈ [0, 1): how much you value the future.

Those five define the problem itself. On the agent side, we define constructs that depend on the MDP:

Policy π: a mapping from states to actions.
Value function V^π(s): expected discounted return from state s under π.
Q-function Q^π(s, a): expected return from (s, a) under π. If no π is present, then Q refers to the Q-values estimates that we have established so far.

A Clearer Ontology

CS 188 distinguishes “model-based” vs. “model-free” methods:

Model-based: assumes access to the transition probabilities and rewards (P and R).
Model-free: learns from sampled experience without ever observing P or R directly.

Another useful axis is “policy-based” vs. “value-based”:

Policy-based: directly solve for the optimal policy, then improve the value estimates by following that policy, repeating until convergence. (Many methods also use value estimates as baselines or for other purposes, e.g. actor-critic.)
Value-based: solve V or Q directly until convergence, using the greedy policy that maximizes the expected value of the next state:

So we have two orthogonal axes: model-based vs. model-free, and value-based vs. policy-based. Together they give us a 2-by-2 view of classical RL methods:

This is the ontology I propose. The last quadrant, model-free policy iteration, is the most interesting, and we’ll work our way toward it.

Value Iteration

Value iteration iteratively updates the value function until convergence. When you know P and R, the “Bellman optimality update” is:

In other words, the value of a state is the maximum expected reward + discounted value of the next state. It’s implicitly summing an infinite discounted series. If V* is the fixed point of the update operator above, then:

Bellman Operator and Contraction

It makes sense that if we can solve for the fixed point V*, then we can define the optimal policy by greedily following whichever action maximizes the expected value of the next state. But how do we prove that the iterative process above actually converges?

To do so, we define the “Bellman optimality operator” (T) - the new value function if you perform a single iteration above:

Then for any two value functions V and W:

And since we didn’t specify which s:

where that infinity operator refers to the maximum distance between any two states. In other words, when you apply the Bellman update operator to any two value functions, the resulting value functions are closer together.

That matters because you can now show that T must converge to a single fixed point. The proof is simple. Assume that there are two possible fixed points. Then the update above moves them closer together, meaning that there cannot be any non-zero distance between the points. To show existence, we need to know that if you keep getting closer and closer to some limit, then that limit is still a valid value function. For finite state spaces this is obvious because value functions are just real vectors, and in Rⁿ every Cauchy sequence converges. This is called the “Banach fixed point theorem”.

Iterative Expansion

Notice that after one application:

where a₁is the action recommended by the greedy policy given V. After two iterations:

Each step adds one more discounted term. Early actions can be wrong, but their contribution shrinks geometrically. Eventually you converge to V*. In practice, you can truncate after k terms.

Policy Iteration

Now the policy-based, model-based quadrant. Unlike value iteration, here we separate policy evaluation from policy improvement:

1. Policy evaluation: solve

Unlike value iteration, this is a linear system of equations that you can invert directly, since π is known.

2. Policy improvement: set

And repeat until convergence.

Proof of Convergence

The first step is showing that V(s) can only increase for any given s.

Define T^π as the “Bellman expectation operator,” which updates a state-value function V in the “direction of” π:

(Note that V might not have been generated by following π.) By definition, V^π = T^π V^π. Notice that if you apply T^π’enough times to any value function V, you’ll end up with V^π’. If π’ is greedy w.r.t. V^π, then

since after enough iterations of T^π’, the policy becomes π’. The first inequality holds because we are using the same state-values, and only deviating if the new action leads to a strictly higher state-value. The second inequality is more subtle, but it relies on the monotonicity of the Bellman update operator. If you have two value functions where V(s) > U(s), then T^πV(s) > T^πU(s). This result comes directly from the Bellman operator definition if you compare each term of the expression. We’re just applying this property to the last two terms of the inequality to produce the next term, infinitely many times.

So each greedy improvement can only increase value. Since there are finitely many deterministic policies (|A|^|S|), this process must terminate eventually.

Model-Free Value Iteration (Q-Learning)

In the real world, we might not know P and R. We just have to start taking actions and figuring it out empirically. This leads us to the “model-free” methods. We can’t directly compute the fixed point from before, because it relies on knowing P and the true reward function R:

So instead we define a new fixed point, which relies on a value function for each state-action pair:

In other words, the value of a state-action pair is the expected reward, plus the discounted value of the next state-action pair you’d find yourself in.

Naively you might try approximating this by gathering observations:

But that doesn’t quite work. r is noisy - it varies based on which s’ you land in. You need an averaging scheme. The natural instinct might be to use a sample mean:

This works, but exploration is non-stationary. Q itself is changing, so early samples are inaccurate. Instead, we use an exponential moving average (EMA):

Iterate, slowly lower alpha, and the process converges. Each update adds another discounted term, just like value iteration.

Proof of Convergence

This is trickier than proving convergence for the model-based methods, because now it’s stochastic. The standard proof defines each update as:

where M is zero-mean noise. Ignoring noise, this looks like:

Then they define a continuous version of Q called q, which is respect to a new variable tau:

In the limit as alpha goes to 0,

This ODE is used to demonstrate convergence. The full analysis is beyond the scope of this post.

Model-Free Policy Iteration

Now let’s try to apply the same methodology to policy iteration, just as Q-learning stochastically approximated value iteration.

In principle, we could evaluate Q^π by sampling, then improve π, and repeat. But exact evaluation by sampling is very slow and requires waiting for convergence each round. Worse, unlike the model-based case, you can’t invert the system of equations for V^π. So the advantages of policy iteration vanish in the model-free setting.

A practical compromise is approximate policy iteration: run only k evaluation steps before improving the policy. This weakens convergence guarantees. With k=1, you get a popular method called SARSA.

SARSA’s policy is to follow Q most of the time, except for epsilon probability where you take a random action, called an “epsilon-greedy policy”:

We update with:

This is called “TD(0)”, or temporal-difference learning with lambda equals 0. Temporal-difference learning allows us to extend further. Instead of just one-step lookahead, you can mix multiple n-step returns G_n:

In practice, computing this infinite sum is approximated dynamically. Lots of simple algebra goes into deriving the recursion (eligibility traces), but I’ll skip it here.

Wrap-Up

This is why you don’t really see a neat “model-free policy iteration.” Without P and R, exact evaluation is gone, and requiring near-converged sample evaluation before every improvement is prohibitively inefficient.

Is that all we need to know for reinforcement learning? Unfortunately the methods above don’t work in many domains. They require not “too big” of a state-action table, and there’s no way to handle continuous action spaces. I’ll cover how you can solve these more complex scenarios in my next post.

For a comprehensive overview that overlaps CS 188 material and this post, see Lilian Weng’s excellent writeup.

A Free Market for Eyeballs

Neel Somani — Tue, 29 Jul 2025 02:52:15 GMT

I have a friend who blew up on Twitter/X a few years ago. His account started like mine, just a couple thousand followers, mostly his friends. But by the end of the year, he had so many viral tweets that he was at 100K+.

I brought this up to my mom, a first-generation immigrant. She said, "So? What has he gotten out of that?" I remember thinking that she was completely out-of-touch, but I had no strong argument to defend my intuition.

Since then, this kid has met billionaires like Sam Altman, Elon Musk; he's working at a top company making 7-figures per year; and he manages a small fund on the side.

I think this anecdote says a lot about X and why exactly it's valuable. I used to work at a hedge fund, so I'm cursed to always think in finance terms. To me, posting on X is a form of regulatory arbitrage, because attention is capital that isn't taxed.

The Optics-to-Economics Pipeline

Finance moguls often preach about the "Section 1031 exchange," a tax feature that allows you to repeatedly swap your investment properties for more valuable ones, and continuously defer capital gains tax. What I've observed is that the same phenomenon happens with attention on X.

In some ways, my entrepreneurial career started on X. I was a quant at Citadel at the time, and a shitposter on the side. I left in 2022 at the calling of my wise friends, who advised me to build on the Terra blockchain.

Just two months later, Terra had collapsed:

Source: https://x.com/neelsomani/status/1525172426803380225

I received condolences via iMessage and DMs. But I wasn't alarmed, because I intuitively knew that I was better off. I had gained an asset: attention.

Just a few months later, I had raised $15 million for my next project. And I'm not the only one. Roy Lee was a student at Columbia who was expelled for using AI to cheat on his software engineering interviews; the incident went viral, and he went on to raise a $15 million round from a16z.

Reversal of Fortune

There's a famous documentary where a homeless man, Ted Rodrigue, is given $100K. Within six months, he unfortunately went back to being broke and living in a tent.

Examples like the poly-employed software engineer Soham Parekh are the Ted Rodrigue's of X. Not everyone is built for viral attention, and when you don't know how to manage it, you squander it. Soham's interview on TBPN was widely criticized. His story was inconsistent, and the attention wasn't directed toward anything greater.

That's not the only way things go wrong. Sometimes people have a huge following, but they channel that attention toward horrible ideas. Even as a founder with attention, you can pick the wrong idea. That's probably the most popular reason why some influencers don't end up converting their attention capital. They just can't figure out how to monetise it.

This Scroll Could Change Your Life

The value of posting is obvious. You get to build this valuable intangible, untaxable asset called attention. But why are we still scrolling? Who's doing the consuming?

I'm reminded of a time I was skiing in Aspen with my good friend from college, and we were talking about how when we're with our families, we find it tempting to scroll on our phones. That's obviously a very sad thing - those are our loved ones! But this chart explains it:

TRUMP memecoin chart: https://coinmarketcap.com/currencies/official-trump/

Donald Trump's token (TRUMP) was trading at merely $2 to $3 for almost an hour after it launched. It was dinnertime in California, and people weren't on their phones. Word took a while to spread, but over the next 24 hours, people had moved their capital over and drove the price up to $60-70.

Who's to say whether the TRUMP coin would go up at that time. But I use it as an example to show the extreme returns of being the first to get asymmetric information. In some sense, every post we read is in search of a metaphorical "TRUMP coin." Every so often, we come across a post that's so valuable that it justifies reading all of the slop. A job opportunity, the release of a hugely time-saving app, a Luma page for an awesome event happening in our city.

That's our opportunity cost when we're not scrolling. And that's why we're still on X.

A Tool To Save You Time

I hacked together a small, open-source project called Today On Tech Twitter.

I created an X account that follows what I think is a representative sample of accounts on "Tech Twitter". This website just scrapes the posts on that account's feed every evening and passes the posts to ChatGPT to summarize the latest events.

I built the website because I feel the primitive is useful in many ways. For the casual scroller, this tool offers an easy way to take a break and later catch up on what you missed. My intention was for the representative sample to include diverse views, to mitigate the possibility of users getting stuck in a "bubble."

For engineers, this data can be the input to a process that uses AI to generate content related to current events. That might be viral videos, or automated posts from your company account on X. The JSON API is publicly available: https://www.todayontechtwitter.com/api/s3-data?utc_date=2025-07-29

X isn't for everyone, but for those who are here, you might as well exploit the arbitrage.

The BLAST Playbook

Neel Somani — Mon, 24 Feb 2025 23:53:15 GMT

A growing consensus is forming: software investing is in a horrible position.

The software VC model is getting squeezed from both ends:

AI tools like Cursor and Devin have made software fast and cheap to build, meaning great builders don't need to take external capital.
Any software product is easily copied, so software ARR is less defensible by the day.

The implication? The $500B-1T+ annual software investing machine now lacks investable companies. Over the next 3-5 years, software funds will underperform relative to their peers, and many traditional VCs will eventually close shop.

Two big questions remain:

Where should capital be deployed instead?
What should a founder build today?

Spoiler alert: The answers don’t match.

Where To Deploy Capital: BLAST

Smart capital already sees the problem. But what assets are able to scalably absorb hundreds of billions of dollars?

Introducing BLAST: the Boredom, Loneliness, and Scarcity Thesis:

Boredom → People still need distractions (see: TikTok addiction, memecoin speculation).
Loneliness → People still want to feel special and seen (e.g. social communities).
Scarcity → People will pay more for things that others can’t have (e.g. Birkins, natural resources).

But while there's potential to profit in these sectors, it's difficult to deploy hundreds of billions of dollars into incumbents while achieving outsized returns. Capital needs a way to access BLAST assets at scale, in novel investments.

The ultimate BLAST investment is an entirely new city or country. Land is inherently scarce, particularly oceanfront property. New city development opens the door to building luxury housing, but also luxury services. Elite private schools, exclusive gyms, and high-end detox clinics absorb capital while entertaining residents and fostering community.

Key questions:

Where should these cities be built? Possibilities include land near existing cities, expensive oceanfront locations, or cheap land in the middle-of-nowhere.
What makes them unique? These cities could be friendly to choice industries.
What productive assets within them can absorb venture-scale capital?

Software Founders: Castles Without Moats

Thin AI wrappers should not be raising venture at all. Raising a $10M Series A with $5M ARR is a bad deal for the founder because it constrains optionality, forcing the founder to shoot for a billion dollar exit instead of pulling as much cash as possible during the (likely limited) shelf life of their product.

Source: https://x.com/arfurrock/status/1892977861604036863?s=46

Making commitments to keep growing in today’s software environment is too risky, because as Chris Paik puts it, the "end of software" is near. The new playbook:

Use AI tooling to quickly and cheaply build projects that spit out cash immediately.
Forget about long-term defensibility, and as a result, don't take venture capital.
Move onto the next project.

Many founders are avoiding these "castles without moats." But moats only matter when building is hard and expensive. If software takes days and <$10K to launch, you have so many shots on goal that your probability of succeeding is much higher, so smaller projects become positive EV.

This arbitrage only exists for so long, because soon, AI agents themselves will be the ones rapidly spitting out new cash-generating software projects.

The closest thing to a "moat" for software built today: The fact that there are so many ideas available, it’s easier for competitors to build something new than copy you. (At least at first.)

Alternatively, if you insist on raising venture, you need to go full moonshot. That means pursuing ideas that are so outlandish that they seem irrational, projects demanding massive capital, breakthrough research, or other forms of strong defensibility.

SaaS wasn't contrarian enough anyway. The next wave will be weirder, riskier, and hopefully more interesting.

Privatize the FDA

Neel Somani — Sat, 11 Jan 2025 01:34:00 GMT

The FDA, among other goals, is mandated by Congress to ensure that drugs are safe and effective. However, this directive has come at an unforeseen cost: delays that have resulted in preventable deaths, and drugs that are difficult for informed patients to obtain.

Examples include vaccines such as Fluad, which became available in the US 18 years after gaining widespread use in Europe, and peptides that demonstrate promising early results but face FDA distribution warnings.

These delays have potentially lost hundreds of thousands, if not millions, of life-years annually. Moreover, the FDA's centralized, inefficient processes stifle innovation, discouraging biotech investment and leaving society with fewer life-saving and life-enhancing options.

This essay advocates for dismantling the FDA's drug approval function and transferring this responsibility to privatized entities known as "Drug Certification Bodies" (DCBs).

Problem: The FDA holds a monopoly on drug approvals.

While the FDA does not explicitly make unapproved drugs illegal, its regulatory powers over manufacturing, marketing, and distribution render many drugs de facto impossible to obtain.

There is no natural counterbalance to the FDA's power. It has no economic competitors, and legal action against the FDA is impractical for bio companies, who risk retaliation and future delays. The major issues are as follows:

1. Time and Cost of Approval

Submitting a drug for review costs over $4 million in the United States, compared to less than $250,000 in Europe. On average, it takes 10 years and $1-2 billion to bring a new drug to market. This timeline includes not only the FDA's own review process but also the extensive trials and data collection required by the agency. These burdens result in fewer drugs funded and developed.

2. Economic Inefficiency

Since Congress has granted the FDA a monopoly, the agency has no incentive to maximize its ROI. Pharmaceutical companies must work with the FDA, allowing the agency to manually set "user fees" without market competition. These user fees—payments from drug companies for reviewing their products—make up ~45% of the FDA's $7.2 billion annual spend, which supports 18,000 employees.

While user fees have led to faster drug approvals, the lack of a competitive market means there's no standard for what these fees should be. For example, pharmaceutical companies might be willing to pay even higher fees to receive quicker reviews.

3. A One-Size-Fits-All Approach

The FDA's approval process generally applies the same standards to all patients, regardless of each patient's individual risk tolerance or demographic profile. This binary system, where a drug is either approved for all or none, ignores the diverse needs of patients and leaves many without viable options. While Congress allows the FDA to make limited exceptions for terminally ill patients and orphan drugs, these pathways are insufficient to address the broader systemic issues.

We should not eliminate all drug regulations.

A variety of approval models could be explored, but the first step is for Congress to relax the constraint that the FDA has final authority over all approvals. On the other hand, complete deregulation isn't likely to end well:

1. Private actors have demonstrated a poor track record in self-regulating, such as promoting unsafe drugs or inadequately testing devices before market release when not mandated. The FDA was initially created to address these misaligned incentives.

2. Laypeople are often unequipped to interpret statistical data on their own, so the laissez-faire market suffers from inefficiencies similar to those caused by imperfect information. The FDA acts as a trusted resource that the public relies on.

3. Eliminating the FDA runs the risk of enabling drug abuse. Addictive drugs are a national security and public health risk, e.g. the Opium Wars.

For these reasons, it is inadvisable to remove the FDA drug approval process with no alternative in its place.

Solution: Build a privatized FDA alternative.

My recommendation is for a competitive, privatized system of Drug Certification Bodies to be established. A Drug Certification Body (DCB) is a private sector entity that conducts drug approvals and subjects itself to relevant regulations.

DCBs should handle all drug approvals and adopt the following three reforms:

1. Separate the safety and efficacy approvals, where safety is the only requirement for usage: Safety and efficacy are inherently connected, as approvals weigh a drug's benefits against its risks. However, even when efficacy has not yet been demonstrated, drugs should be approved if they meet a sufficient safety threshold. This approach mirrors current off-label drug use practices, where 20-30% of prescriptions involve drugs prescribed for conditions beyond FDA approval. In such cases, physicians must assess whether the potential benefits justify the risks. Additionally, we might require that patients consuming a yet-to-be-proven drug consent to their physician sharing their medical records to build an argument for efficacy.

For instance, Tafamidis, a treatment for transthyretin amyloid cardiomyopathy (ATTR), could have reached the US market sooner. Initially, the FDA rejected Tafamidis as a safe but ineffective treatment for polyneuropathy. Later, it was found effective for a different condition, ATTR, and approved in the United States in 2019. During the interim, Tafamidis was only available in Europe pending its second FDA review. Such cases are common, as pharmaceutical companies investing billions in a safe drug are motivated to identify conditions where it proves efficacious.

DCBs can separately offer efficacy certifications. Efficacy will remain a desirable approval to receive, since insurers and payors will prefer to cover drugs that are proven to be effective. Safety approvals should allow for greater side effects when a drug's benefits outweigh its risks.

2. Adopt a heterogeneous review process, allowing for demographic-specific approvals: The current FDA review process is cumbersome and inconsistent, with multiple pathways. The timeline typically consists of preliminary tests and three phases of clinical trials (~6 years). This all leads to a 100,000+ page new drug application, which takes 1+ year to review alongside a manufacturing inspection.

Ironically, Congress implicitly acknowledges that a faster approval process is possible via the FDA's emergency pathways, such as the process for COVID-19 vaccines. Under a privatized system, DCBs will develop voluntary standards tailored to each drug's unique risks and benefits.

In some cases, DCBs might approve drugs for only subsets of the population based on available data, allowing faster access for targeted groups. New studies should include diverse demographic groups in line with the DEPICT Act of 2023, promoting equitable access, but this relaxation allows DCBs to utilize historical data, data from non-US jurisdictions like Honduras, or more unusual data that wouldn't meet the standard of a full clinical trial, such as human challenge studies for small target populations.

Some might argue that a lack of diversity in patient trials has led to failures in fields like oncology where studies did not generalize, but the real error is that the FDA granted efficacy approvals when only safety was properly supported.

3. Establish public credibility by bearing the cost of unsafe approvals & publishing results: A chief concern is that DCBs will only focus on minimizing approval times or offering competitive user fees, without prioritizing safety.

To align incentives, DCBs should offer liability insurance (up to a reasonable limit) to pharmaceutical companies for health hazards resulting from the use of approved drugs. The minimum required amount of liability insurance is to be determined.

All drug reviews should be published for public auditability. The results might be posted somewhere as simple as Arxiv.

How can this get congressional support?

The FDA should reduce its role to ensuring DCBs are run properly, similar to the regulation of credit rating agencies:

Conflict-of-interest protections: DCBs must adhere to stringent rules, like those used by the Department of Defense, to prevent bribery or undue influence.
Enforcement of labeling: Relevant marketing claims made by drug manufacturers must be approved by DCBs.
Fraud prevention: Any data submitted to a DCB must be accurate.

With a more limited scope, the FDA can focus its resources on accomplishing the above efficiently, while DCBs focus on approvals.

To get widespread support, the model must first be trialed successfully. Precedents, such as the expansion of the Third-Party Review Program for Class II medical devices under the 1997 Modernization Act, offer a roadmap for scaling privatized reviews. This involved selecting a representative sample of Class II devices and providing access to FDA databases and review templates.

The logical starting point might be "wellness therapies." Congress defines a drug as anything intended to (a) treat or prevent disease or (b) otherwise affect the body's structure or function. Privatized approvals may be more suitable for drugs that solely affect the body's structure or function ("wellness therapies"), like peptide injections.

Wellness therapies often involve novel mechanisms, which means the FDA must commit substantial resources to their review, unlike generics. This category of drugs is underserved, since there are already existing pathways for expedited approvals and expanded access for terminal illness drugs or orphan drugs. Wellness therapies have lower risks associated with an erroneous approval or rejection, since the therapies do not directly treat diseases, and their consumers are often high-paying and informed. Lastly, this category naturally expands to cover other drugs like homeopathic treatments, which have a fraught history with the FDA.

I am interested in collaborating with others who are working toward FDA privatization. While this possibility has been discussed for decades, the next four years under the Trump administration present a rare opportunity to finally reform this broken process. This might involve spinning up the first DCB that functions as a true alternative to the FDA.

The New Economy

Neel Somani — Fri, 25 Oct 2024 00:59:00 GMT

In this essay, I propose that the labor force will reallocate away from jobs like software engineering & medicine, and instead toward construction in the short-term and entertainment in the long-term.

Homelessness & joblessness is obviously a rampant problem in San Francisco; India's youth population (~400 million people) is 20%+ unemployed; and AGI will render millions of high-earners even in the United States (e.g. software engineers) without work.

UC Berkeley Grads with 4.0 GPAs Cannot Find Jobs (LinkedIn)

These are all instances of the same problem. What I am interested in is a human equivalent of Bitcoin mining. Bitcoin mining allows us to take idle compute resources and contribute them toward something valuable. What can we do with idle people?

Refining the Problem Statement

AGI poses many interesting questions, but this is a separate question from:

- What goals should we align humanity around in the long-term? It may be arrogant to suggest that such alignment is even possible or desirable. Instead, the problem here describes a medium-term issue facing the labor market. AGI's role may not be to guide humanity toward one grand objective, but rather to support a diversity of purposes and individual goals.

- If universal basic income is implemented, how should we capture and re-distribute it? Unrelated, but a useful lens for the profiteer.

- How should we keep people entertained, so they don't go insane? Entertainment/fulfillment is valuable, but not the only thing to optimize for.

- What should we use "extra" (~$0 marginal cost) inference power for? The problem above refers to the "extra" people, not compute.

Even without full-blown AGI, a solution for economically unproductive people is useful even today as automation continues to displace workers.

When AGI fully arrives, access to AGI might not be globally democratized, and the economic surplus from AI won't be re-distributed purely to the same individuals who held those high-paying jobs. People might either prefer to operate at their previous level of income/wealth, or they might find it fulfilling to contribute toward something greater than themselves, so this question remains interesting to me.

What constitutes a valid solution?

1. Is the work valuable? "Value" might not be measured in dollars generated, and it might sometimes be debatable whether something is valuable. It might be measured in fulfillment. This criterion eliminates meaningless redistributions of wealth or digging holes to fill them up again.

2. Is the job scalable? The solution must be able to employ hundreds of thousands, if not millions of people.

3. Is it ethical? At the same time, unethical solutions are still worth highlighting only because they might lead to an ethical solution.

4. Is it AGI-resistant? The solution should still be useful even in a post-AGI world. This is difficult to predict and might not be possible. Solutions that aren't AGI-resistant are still useful in the interim period where masses are unemployed, but AGI cannot yet produce everything desired.

To satisfy the last criterion, a valid solution cannot rely on superior thinking ability to produce value. The timeframe matters for the last criterion, since certain tasks might take years longer to replace than others.

What are the possible solutions?

I've ranked the following attributes alongside my view of their AGI-resistance:

Below, I dive into each of these categories to illustrate what the reallocation of labor resources might look like:

Medium-Term Solutions

While fields like construction and elderly care are fulfilling, even these tasks can likely be replaced by agents/robotics on a reasonable timeframe. At the same time, construction in particular offers a compelling vision that can align and employ millions in the near-term.

1. Construction: Build something grand that justifies the use of manual labor.

An example here would be if India were to commission the development of a Taj Mahal 2. The Indian government would guarantee a living wage to anyone who contributed, and the development of such magnificent structures is worthwhile as works-of-art.

The United States administration should institute a "new" New Deal, where the government employs millions of Americans to develop bleeding-edge public infrastructure. This is the most compelling vision to me in the short-term.

This might involve directing resources toward the necessary construction for interstellar exploration and colonization: space ports, launch infrastructure, and habitats for human life on Mars.

This work would be valuable, scalable, and ethical, but not fully AGI-resistant.

2. Community work: Care for other humans, in capacities where humans are preferred.

I am interested in someone defining what it means to be healthy, and then designing the surrounding environment and checks to promote healthy child rearing. This comes from a concern that children who lack real human interaction in their upbringing will be emotionally and socially stunted.

This category might also include elderly care or running in-person, human-only communities.

3. Biological utility: Use your human body to produce biological data or materials.

Individuals like Bryan Johnson generate tremendous high-fidelity biological data that could theoretically be used to improve drug development. A more dystopian expression would be financially incentivizing experimental drug testing, since real human bodies will be superior to biological simulations or models for some period of time.

Another instance of this might be surrogates, if babies from surrogates are superior to lab-grown babies. We can reject unethical examples like organ donation or using the human metabolism for energy production, which is too metabolically inefficient anyway.

These roles are not particularly scalable, and they provoke ethical questions.

Long-Term Solutions

1. Bias mitigation: Some decisions inherently require bias, and we prefer these decisions be made by biased humans rather than biased models.

Such examples include the interpretation of the law/ethics, including judges, lawyers, or governance of systems related to AGI itself. This might also include oversight of some types of AGI output.

I don't view this as particularly scalable, and I'm concerned that human error rates might be too high to be useful.

2. Entertainment: Self-explanatory; entertain/serve other humans.

Unethical examples abound in this category, from prostitution to real-life Squid Games, gladiators, and so forth. But entertainment has so far proven resistance to AI alternatives. For example, while chess bots are definitively stronger than human players, we still prefer watching humans play, though widespread sentiment can of course change over time.

This includes the service industry, where it's higher status to have human labor over machines, or the Olympics where we even limit drugs that could potentially interfere with natural human performance. This work is valuable, scalable, and there are sufficient ethical examples.

How do we act on this?

I am concerned about a potential shock to the labor market where large segments of the population are rendered unemployed very quickly. I have serious doubts that governments would adapt to quickly issuing UBI for a variety of reasons. In my view, it is wise for us to proactively institute the relevant legislation (e.g. a "new" New Deal) or large-scale private funding to incentivize a shift toward roles that are sustainable in the medium-term.

I'm interested in hearing feedback and connecting with others who are interested in this problem space. You can reach me on Twitter at @neelsomani.

A Year Around The Sun With Eclipse

Neel Somani — Tue, 03 Oct 2023 01:07:00 GMT

The Eclipse team has been hard at work for over a year now, so I thought I'd take this opportunity to reflect a bit on how we got here.

For those who have been following Eclipse, you'll know that our focus has zeroed in on building Eclipse Mainnet: Ethereum's fastest L2, powered by the SVM. Its architecture represents the culmination of our learnings from deploying our rollup framework for a variety of applications.

Before I get into where we came from, I want to touch on where we are now. I don't want to oversell where Eclipse is today. We're very much at the beginning of our journey. We still need to launch mainnet, grow an active ecosystem, decentralize our proofs, strengthen bridge contract upgradability, and many other things. That won't happen overnight. We're heads down focused on a successful mainnet launch over the coming months and will continue to build for years after.

Nonetheless, we have made some great progress already. This is what we've learned along the way:

The Dawn of Eclipse

We started talking to app developers a bit over a year ago, offering to spin up customizable rollups using the Solana Virtual Machine (SVM). A rollup-centric roadmap seemed to imply a world with thousands of rollups, so there was a lot of interest. We ended up running 30+ testnet chains alongside the application teams trying them out.

The operational burden was non-trivial. When chains went down at 2AM, our Head of Engineering David would receive the call. When teams had issues with their infrastructure integrations, our core engineers were expected to act as a liaison and intermediate. When apps wanted to launch a native token with the chain, we coordinated with all necessary parties. This workload wasn't scalable or sustainable.

Even for our customers, app-specific rollups weren't optimal. It's more difficult to onboard users, bridge between rollups, compose with apps on the L1, and bootstrap meaningful economic activity. Each additional rollup added complexity and reduced interoperability.

We dove into the myriad proposed solutions. Self-service bridges, shared sequencers, indexers-as-a-service, new "settlement layers" as liquidity hubs. Dozens of companies have been assembled to service the purportedly imminent influx of thousands of app-specific rollups. Trying to solve self-engineered complexity by adding new layers of complexity didn't strike us as convincing. We started to re-evaluate our position toward app-specific rollups.

We Can "Have Our Cake And Eat It Too"

@cburniske on Twitter

Finally, our learnings from this past year coupled with several technical advancements brought us to the Eclipse Mainnet architecture. A shared general-purpose L2 that addresses the challenges that app developers actually face without sacrificing UX or fragmenting liquidity.

We're excited to build in public and support the cutting-edge apps that developers build, kicking off a new wave of innovation on Ethereum.

Rollups-as-a-Service Are Going To Zero

Neel Somani — Wed, 09 Aug 2023 01:13:00 GMT

Long live Rollups-as-a-Service.

This blog post is adapted from a presentation that I gave at Modular Summit.

At Eclipse, we're building customizable app-specific rollup infrastructure to support verticals like gaming & social, DePIN, and DeFi.

Since we've been working on this for ~10 months, I feel compelled to push back on misconceptions in the space. Here are some thoughts about app-specific rollups:

Existing market segmentations have it wrong.

Rollup frameworks aren't charities.

Rollup frameworks like OP Stack are codebases that implement the key components of a rollup. They're not going to charge you to use their code, but they need to capture value somehow. At a high level, there are three places to capture value:

Execution: sequencing transactions, executing, and (for a zk-rollup) proving
Settlement: bridging and verifying validity proofs or fault proofs
Data availability: publishing the order of transactions

But only execution is suitable as the rollup framework's business model:

Settlement: Post-Bedrock, Optimism only pays ~$5 a day to Ethereum for settlement. The rest of the OP Stack costs are from posting blocks and the associated overhead. A competitive settlement layer would likely earn even less.
Data availability: A fragmented DA layer will have less stake securing the network compared to a shared DA layer such as Celestia. Many rollups don't want to move their DA off of Ethereum anyway because they would sacrifice their Ethereum-alignment.

Any market segmentation should also include rollup frameworks in at least one category related to execution, and any product that offers execution is competitive with the rollup framework.

Isolated Rollups-as-a-Service aren't defensible.

The naive interpretation of RaaS is actually isolated Sequencers-as-a-Service (iSaaS). These are companies who have no protocol of their own, but they're deploying existing open-source rollup frameworks and running a sequencer. OP Stack has a partnership with an iSaaS.

The business model for iSaaS is to charge some recurring fiat amount in addition to some percent of sequencer fees. (Additional support services, consulting, or custom feature development don't represent scalable business models.) To be clear, this would be a direct competitor to shared sequencer networks such as Espresso, Astria, Radius, and more; but they have some fatal disadvantages.

A big problem with iSaaS is that it is at odds with the rollup framework. As described above, an optimistic rollup framework like OP Stack has to monetize via sequencer fees. (A zk-rollup framework might be okay with neglecting the sequencer fees and keeping only prover fees.)

Other high-level problems with such a business are that it is commoditized, easy to enter, and there are no network effects, unlike a shared sequencer. iSaaS lacks the economies of scale of a shared sequencer since each sequencer is isolated.

Optimistic rollup frameworks must offer their own sequencer-as-a-service.

To play nice, the iSaaS might return sequencer fees to the optimistic rollup framework, keeping only the recurring fiat payment for itself.

But now the iSaaS and the rollup framework must both independently be profitable. For a large enterprise, the ideal pricing would be a high recurring fiat payment but low sequencer fee. But the iSaaS doesn't have the flexibility to decrease sequencer fees, since the sequencer fees aren't theirs to begin with; they're passed back to the rollup framework. If the iSaaS doesn't share revenue with the rollup framework, the rollup framework can deploy their own iSaaS and likely more deeply penetrate the market due to established trust.

The reason why so many iSaaS are popping up is because it seems attractive to the unsophisticated reader. It looks like SaaS, so a non-crypto investor might find it easier to reason about the fiat revenue. But iSaaS will have difficulty competing with a rollup framework with its own sequencer-as-a-service, which has protocol native revenue and a token. The latter has more optionality in pricing, and the token can be used to subsidize customer acquisition costs and fixed costs of running a chain (described below) for promising projects, which pays itself off as protocol native revenue.

Protocol-native network effects and amortized fixed costs will create stronger unit economics for protocols with traction, making rollup providers somewhat winner-takes-all.

Refined Market Maps

Now I can show how I'd adjust the graphic in this Messari piece, which I thought looked reasonable at the time:

Messari Market Map

Refined Messari Market Map

I'd rename the No Code Deployment category, and I would rename Rollup SDKs to Rollup Frameworks, because many rollup frameworks don't provide a full SDK to developers. I would also modify this Celestia ecosystem diagram:

Celestia Ecosystem Map

Refined Celestia Ecosystem Map

I'd remove Rollups-as-a-Service, Settlement Layers, and Virtual Machines. And for projects in the Rollup Framework bucket, they will almost certainly have to find themselves in another category as well, because otherwise they can't monetize.

No Free Lunch: Economic and Technical Limits

Most apps should not have their own rollup.

The easiest way to demonstrate the economics of app-specific rollups is by looking at a live rollup: Optimism (post-Bedrock). Props to the Optimism team for making this Dune dashboard.

The following assumes a ~25 gwei gas price on Ethereum:

One-time cost of deployment for an OP Stack mainnet chain: ~1 ETH
Fixed cost of an OP Stack chain, even if 0 transactions are run: ~0.5 ETH a day
Variable cost: 7.5 * 10^-5 ETH per transaction

To get the fixed cost, I took the average overhead cost per transaction and multiplied by the number of transactions run that day, and confirmed it by running an OP Stack chain on mainnet.

This variable cost is cheap but not quite Solana-level cheap, and the fixed cost can get amortized over many transactions. In the future with EIP-4844, we might generously assume this cost to come down by 10x. Still, assuming a $2,000 ETH price this represents something like a $0.015 lower bound per transaction plus some amortized fixed cost.

We might consider something like .00001 ETH (~$0.02 at the time of writing) as a reasonable transaction markup to cover this fixed cost, so we need 50,000 transactions per day for an app-specific rollup to make sense. The price for each transaction is roughly $0.17 before EIP-4844, and optimistically $0.03 after. We might add a small premium so it's economical for a (shared) sequencer to support the chain.

So as cool as something like Opclave is (I really like the idea, we're chatting with Dogan's team and we might incorporate this feature into Eclipse rollups), it doesn't make sense as a mainnet OP Stack chain. The constraint here is that the OP Stack chains are anchored to Ethereum which has expensive blockspace, and Optimism is intent on Ethereum-alignment.

With these unit economics in mind, dApps that don't make a lot of sense for their own chain are small DeFi dApps and NFT projects. What might make sense for these dApps is to subsidize the cost for gas if the long run unit economics of an Ethereum L2 make sense for their dApp, or they could be willing to take a loss on their app chain.

If an app requires too much transaction volume, then an Ethereum-anchored rollup doesn't work either, because a transaction fee greater than $0.01 is likely too high. These kinds of apps would require a novel approach such as what Eclipse is building with our highly parallelized virtual machine and sovereign rollup architecture.

Customizable rollups must be constrained.

Source: Introduction to OP Stack Hacks

As mentioned in the screenshot above, OP Hacks won't be included as part of Optimism's Superchain. That makes sense because in order to properly settle or provide stateful sequencing for the rollup, we need some invariants to hold true. Any modifications also need an audit before supporting real economic value.

Another good reason to constrain app-specific rollups is by looking at the adoption of Cosmos. The Cosmos SDK is incredibly generic and yet it never inspired the plethora of diverse chains that you might expect. This could be because customization requires too much technical sophistication, or more likely because the long tail of applications is well-suited by a handful of architectures. On the other hand, sector-specific templates can solve popular pain points for different verticals and provide repeatable architectures.

I'm interested to get the community's thoughts. Feel free to reach out via Twitter @neelsomani or @EclipseFND.

An Alternate Interchain Security Proposal

Neel Somani — Wed, 06 Jul 2022 01:20:00 GMT

In this post, I give two recommendations to the Cosmos ecosystem:

A CLI script to easily deploy a new app chain
A novel fee market for interchain security

Where Is Everyone?

After the Terra de-peg, I put my Terra EVM project on pause. I was thinking about where to build next, and I found I have issues with almost every ecosystem:

Ethereum is incredibly slow at 15 TPS
Solana has known stability issues
Cosmos has low TVL ($600m) and the activity is dwarfed by Ethereum

But I like Cosmos the best architecturally. The app chain thesis has been vindicated with the recent deployment of dYdX, an indictment of the theory that everything would soon become an Ethereum rollup. There are several advantages of making a Cosmos app chain vs. a dApp elsewhere:

Tendermint has formally verified liveness.
If a serious error occurs, your governance can vote to roll back the chain or take remedial action. This isn't possible if a tragedy happens on a chain like Ethereum, such as the $30m Parity multisig hack.
There is no congestion or competition for block space between other dApps.
With ABCI, you can theoretically use whatever language you want.
App chains avoid state bloat, which every monolithic L1 will have to deal with.

So it begs the question: Why isn't everyone deploying as a Cosmos app chain? Moreover, why do people still prefer deploying their dApps as smart contracts to general-purpose chains?

The Status Quo

It's too complicated to launch an app chain.

For a developer who codes Solidity dApps, now you must learn about Tendermint, ABCI, and the Cosmos SDK. Even with the help of Ignite CLI, just to instantiate a "Hello World" application, we need to modify protocol buffers and learn about Keeper. And deploying to production is a beast of its own: there is no ignite deploy command.

You can't bootstrap a validator set.

After implementing your app with the Cosmos SDK and launching your chain, now you need to bootstrap your validator set. The difficulty here is that no one wants to validate for a token with no known economic value.

The current proposal for interchain security: you basically apply for interchain security, and the governance for the "provider chain" votes on whether they want to validate for you. $ATOM delegators and validators are rewarded via additional fees from your chain via the distribution module.

What's good about the current proposal is that it creates a use case for $ATOM. $ATOM certainly needs use cases beyond governance. Some issues:

Security: Security for your chain is actually better if you use your own token. You can simply hold a large percentage of tokens yourself and make an attack prohibitively expensive or impossible.
Flexibility: Once interchain security is turned on for a chain, an individual validator cannot opt out. They're stuck validating for this new chain. (I've been told ICS v2 will allow individual validators to opt in or out.)
Resource efficiency: Some consumer chains are resource-intensive to validate, and some validator nodes have greater resources than others. Since all validators from the provider chain must now validate for the consumer chain, we fail to capture this property.
Startup time: A governance vote takes time to process.

The Alternate Proposal

Simplify launching an app chain.

The first step is to greatly simplify the process of deploying an app chain with a script. It has to be as easy as deploying a Solidity smart contract to Ethereum, similar to what Informal is describing with their CLI tool.

Make a fee market for interchain security.

We can solve the issues with the current interchain security proposal by making it a market, creating an even better use case for $ATOM, although theoretically this works for any token.

From a validator's perspective, you make an ask of the form: "In exchange for you bonding $x of $ATOM to me, I will provide 1 validator on your app chain."

For a consumer chain that needs interchain security, the bid structure is more interesting. I would model the problem as each consumer chain requiring k_c validators, and they are willing to pay any price to get them (inelastic demand). In this case, what should be the price for each validator?

The competitive market structure here is a uniform clearing-price (UCP) auction. We order the validators from cheapest to most expensive, and if there is demand for k* total validators, then the price that the k*th validator offered is the price for all validators. This market structure incentivizes validators to make their ask price as low as possible, and it enables orders to be filled quickly.

We could modify the structure to allow consumer chains to include a bid price: "In exchange for $x of $ATOM, you will serve as a validator for my app chain." This scheme allows validators and consumer chains to place orders consistent with their resource specifications. The market clearing price for validators is simply where the supply and demand curves intersect. The result is something like a decentralized Amazon Web Services for validators to provide app-specific compute.

When you think your token has enough economic value to get your own validator set, you can simply unbond your $ATOM, and your old validators will fill the next orders placed to the fee market consistent with their ask. As the fee market becomes more popular, we might include a devops environment for validators to quickly spin up new nodes for networks without eventual resource starvation.

Who's In Charge?

Where would we run a fee market like this? The most natural option might be the Hub itself, but dApps like Gravity Bridge and Gravity DEX have proven to be controversial. The right answer might be a dedicated chain that supports a central limit order book.

I wanted to keep this post short, but I am interested to hear the Cosmos community's thoughts. Let me know what you think on Twitter @neelsomani or email at neeljaysomani [at] gmail.com.

The Future of Terra DeFi

Neel Somani — Wed, 06 Apr 2022 01:24:00 GMT

tl;dr: 75% of $UST is held in Anchor because of its high fixed yield. When the yield drops, in order for $UST to hold its peg, there need to be organic use cases for $UST. I propose a Terra EVM (Terranova) in addition to some novel DeFi infrastructure below.

History will view algorithmic stablecoins as either a) a disaster as inevitable as the subprime mortgage crisis or b) the greatest recent innovation in financial history.

In this post, I will give an economic analysis of Terra's $UST, leading to concrete recommendations for the Terra ecosystem.

A Brief History of Terra

If you're already familiar with Terra, you can skip this section.

Context for the uninitiated: $UST is the largest algorithmic stablecoin, with other examples being Beanstalk and Basis (rest in peace). Much of the draw to $UST is because of the high fixed yield available to $UST holders via Anchor Protocol, Terra's lending platform: if you leave your money in $UST for one year, Anchor will guarantee 20% APY. And this isn't 20% of imaginary Monopoly money, because $UST is pegged to the US dollar.

So what's the problem? Well, some people say that 20% can't last forever. Anchor is only able to afford paying out that 20% yield by loaning out $UST, but if they aren't able to make the 20% through cash flows, then it comes from their reserve - which is being depleted. Eventually, the Anchor Protocol will not be able to provide this 20% APY. Recently, Anchor adopted a proposal for a dynamic rate, which will gracefully reduce the APY.

When the APY drops, people might sell their $UST. This is a big deal, since about 75% of all $UST is locked in Anchor! While Terra will attempt to maintain the $UST peg by minting $LUNA, this great article by @damsondao explains the risk of a death spiral:

"UST redemptions in favor of LUNA that is being sold on the market by arbitrageurs leads to a significant decrease in its price, which necessitates more LUNA being minted for each UST burned, creating a hyper-inflationary loop in LUNA's supply. This then trigger [sic] a crisis of confidence in LUNA's ability to retain value that further reduces demand for UST until the mechanism implodes as it fails to adequately reduce supply and UST's peg inevitably breaks."

It leads to the question: How can we incentivize people to hold their $UST once the Anchor yields decline? (We can ask the same question about $LUNA, since it is used to stabilize $UST. A simplified explanation is that when $UST falls below peg, the system issues $LUNA and buys back $UST until it re-pegs. When $UST rises above peg, the system sells $UST and burns the $LUNA.)

Make Something People Want

A natural starting point is to ask how the US dollar does it. It all comes down to supply and demand:

At any given "price" for dollars, some people decide to continue holding dollars. The US dollar must provide sufficient incentives for people to hold this currency over all others, or else the currency will lower in "price" (depreciate).

Moreover, the US dollar must be especially good at incentivizing people to hold, since it qualifies as a coupon coin. While the Federal Reserve can temporarily decrease the money supply by selling securities through open market operations, eventually the principal must be repaid in addition to any interest, so the money supply only increases in the long run. So the supply curve in the diagram above is continually shifting right, and yet the dollar doesn't have a depreciation crisis.

In fact, the US does such a good job incentivizing us to hold $USD that most of us don't even think about whether we're going to trade our dollars for some other currency. The primary reason to hold the US dollar is because we use it in our day-to-day life: to receive our paychecks, to pay for our groceries, to buy our stocks.

Which leads me to this tweet by Do Kwon:

8/ New algo stablecoin developers need to remember that their challenge is economy building > mech design.

The only way to stability sustainable use cases around the stablecoins, and stability will increase as these use cases become more sticky, distributed and uncorrelated.
— Do Kwon 🌕 (@stablekwon) June 18, 2021

How can we think about the sources of demand for a currency? Here is a (very incomplete) framework:

In the diagram above, the recently-introduced Bitcoin reserve increases the intrinsic value of $UST. While the Anchor yields will decline over time, as more protocols launch on Terra, we'll see staking yields climb.

The Terra DeFi Ecosystem Must Grow

Recommendations:

1. EVM compatibility

Composability drives network effects: the more that is built on Terra, the stickier it becomes.

In summer 2021, Binance Smart Chain usage exploded with a variety of applications that resembled popular Ethereum dApps: SushiSwap vs. UniSwap, Ellipsis vs. Curve, etc. This is why I'm building Terranova: an Ethereum Virtual Machine (EVM) on Terra.

EVM compatibility on Terra will win for several reasons. In a multichain future, EVM compatible L1s will capture the growing transaction volume and mitigate Ethereum network congestion. Terra becomes a contender as the dominant EVM compatible chain in the Cosmos ecosystem (cf. Evmos). Ethereum dApps will interoperate easily with Terra dApps through native "cross chain messaging" between Terranova and Terra, giving access to yields through Anchor, synthetics through Mirror, etc. And Ethereum dApps will unlock more use cases for $UST.

An EVM on Terra brings the top Ethereum dApps (e.g., money markets, option protocols, DEXes) to Terra and empowers developers who are already familiar with Ethereum tooling. Terra gains access to the tremendous amount of liquidity on EVM compatible chains, since it's possible some percent of that liquidity is unfamiliar with non-EVM compatible chains. If you are interested in working on this problem, email me at neeljaysomani [at] gmail.com.

2. Build novel DeFi infrastructure

I see Terra as a high-potential contender to have the most sophisticated DeFi infrastructure in crypto, creating network effects that both usher in and retain institutional capital.

Some novel areas we should explore:

A wider variety of oracles: Given my background as a quant in power pricing, some oracles that are interesting to me involve the weather, power prices, and gas prices. Such oracles would enable institutional investors to participate in popular commodities trading devices, such as heat rate call options, spark spreads, and weather options. The large institutional electricity trading market is a great fit for trading on the blockchain because crypto is becoming increasingly relevant for power traders anyway. For example, Talen Energy (owner of coal plant Brandon Shores) recently announced a Bitcoin mining venture.
More sophisticated financial instruments: First we had Mirror for synthetics, soon we'll have Sigma for options. Next we should develop futures and forwards on Terra. These would likely rely on oracles like those described above. By combining forwards on the price of Bitcoin and the electricity price, a Bitcoin miner can hedge out their price exposure. In addition, Terra's native stablecoins are perfect for constructing derivatives like currency swaps.

What's next?

We're about to witness the outcome of a groundbreaking experiment in DeFi. $UST's success depends on whether we can collectively build the necessary network effects to prevent it from collapsing. I'm actively building a Terra EVM to support the ecosystem and would love to talk with others who are interested in working on some of the related problems. I'm curious to hear the Terra community's thoughts, so let me know what you think on Twitter (@neelsomani, @TerranovaEVM) or at neeljaysomani [at] gmail.com.

Explaining to My Parents What I Do

Neel Somani — Sun, 09 Jan 2022 02:29:00 GMT

Albert Einstein said: "If you can't explain it to a six year old, you don't understand it yourself." I like to think this is true about explaining things to my parents.

As is the case for many people my age, my parents don't understand what I do. They're both in medicine and they have no background in math nor economics, while I'm a quantitative researcher at a hedge fund. In this post I will attempt to give a high-level explanation of one of the problems that I work on solving: power pricing. This is a simplified explanation that neglects details and edge cases.

Intro: the power balance

Power (or electricity) must be generated by power plants. That power is consumed by "load" (demand), often a load-serving entity (LSE) that distributes that power to other people. My parents are in northern California, where the LSE is Pacific Gas & Electric. Your electricity bill comes in kilowatt-hours (kWhr), but I typically think about things in megawatt-hours (MWhr).

At any given moment, the amount of power produced must exactly equal the amount of power consumed. This is called the "balance." If the grid does not balance, then the frequency of the grid will lower (or increase) from the 60 Hz that it must operate at. If the frequency of the grid is not exactly right, then it can cause serious damage to the power generators. This is why when the grid does not balance, generators prefer to completely shut down, causing brownouts or blackouts.

Why does the price for power change?

For some supply, it is very cheap to produce each MW of power. This is called the "marginal cost": the cost to produce an additional unit of power. Fuel types with very low marginal cost are solar, wind, and nuclear. Gas and coal have higher marginal costs.

If the demand is very low (such as in the middle of the night, called off-peak hours), then the cheapest producers will supply the electricity, and therefore the price is not very high. When demand increases, we must incentivize the less efficient producers to supply that power. It costs them more money to make the power, so the price increases.

The power is always supplied by the producer that minimizes the total cost to the system. This is ensured by a central organization called a regional transmission organization (RTO), to which all of the producers submit their marginal costs and LSEs submit their demand. The RTO in California is called CAISO, and my parents are in northern California, so they are specifically in the region called NP-15. Other RTOs include PJM, which covers where I live in Chicago, and ERCOT, which you might have heard about in Texas.

How is power traded?

Power is traded in many ways (you can trade the average price of power, the theoretical profitability of a coal-powered plant, the number of hot or cold days in a month, the ratio or difference between power and gas prices, etc.). I'm going to just give a couple of examples in this section.

A "forward" is an agreement to buy or sell something at a certain price in the future. So let's say I buy a forward on a banana for $10 one year from now. One year from now the contract settles and the market is pricing the banana at $25. That means I buy the banana at $10 (because of my forward) and I get to sell at $25, so I made $15 profit. If the price went down, then I would have lost money.

Now onto power: there is a "day-ahead" market for power, and a "real-time" market. I am not going to go into the details of the day-ahead market, but you can buy forwards on the day-ahead price.

The real-time market recalculates the price for power every 5 minutes throughout the day. The price is calculated using an optimization that is simplified below.

One of the most basic instruments is a "bal-day," short for balance of the day. The bal-day represents the average price for power over the peak hours in the real-time market. So in the morning you might see that you can buy the bal-day at $30, but you think the average price in the day will be $40. You would buy the bal-day, and then wait until the end of the day when the contract will settle, and you will make or lose the difference.

Another distinction is that most people will trade the average price of power over a given area called a "zone." But there are thousands of physical busses that people connect to, each of which has its own power price, and you can trade those via nodal trading (with instruments like FTRs and ARRs, again not worth going into).

How does the RTO figure out the market price for power?

So you might have seen graphs like this before:

The line sloping downwards represents demand. As quantity increases (bottom axis), the price people are willing to pay also decreases. The upward sloping line represents supply. As we talked about above, for small quantities, the power does not cost much to produce, so people are willing to supply it for very cheap prices. As the quantity increases, the price that producers require also increases.

The area in green is called "economic surplus." It is good! We want to maximize that area, because it represents all of the people who are getting a good deal. The people who demand power got it way cheaper than they were willing to pay for it, and the producers are selling power for higher than they were willing to sell it for. In fact, the "efficient market price" is the price that maximizes this area.

For power specifically, demand is described as "inelastic," meaning it doesn't really change much even if the price changes. So the downward sloping line is almost a straight line down. Therefore maximizing this area is equivalent to just picking the cheapest suppliers to supply the power!

The problem is that we have "constraints." Power lines can only transmit so much power and producers can only supply so much. There are actually tons of other constraints that I won't go into. But in short, the problem statement looks something like this:

Maximize the the economic surplus (minimize the cost to produce power), subject to:

(power supplied) = (demand)
The amount of power transmitted across each line cannot exceed the line's capacity
Each producer cannot produce more power than they have capacity

Or the formulation in the ISO New England (ISONE) slides:

This model is called a production cost model. One thing that makes this more complicated is that generators have a "start cost": it costs them some amount of money to even turn on. Another complication is that power generators must either be fully on or fully off, and when they are on, they must pay a "no load cost," the cost of just existing while you're on.

This problem becomes complicated enough that it is a well-known type of problem, called a mixed integer program (MIP), equivalent to the hardest problems in computer science. What's nice about this model is that if you solve it, one of the outputs ends up being the price at each location. It takes hours to solve the complete version of this problem.

So what do I do exactly?

I develop various models to get the possible outcomes for a variety of different investments. I hope this is helpful!

How To Derive Useful Financial Approximations

Neel Somani — Fri, 16 Oct 2020 02:58:00 GMT

I was recently doing some finance trainings for my new job, and I thought it would be interesting to demonstrate how you can derive useful financial approximations using Taylor polynomials. These rules of thumb can get you close enough to the truth to do napkin math. In this post, I'll work through some examples.

Calculating Bond Yields

Prerequisite Knowledge

For the purpose of this section, a bond (loan) is a financial instrument where you pay some price P, and every year you receive dollar amounts C (called coupons). At the end of the loan (n years) you receive back what's called the par value V.

One metric that investors are interested in is called the yield y of the bond. This is essentially a measure of the bond's rate of return, or the percent yearly return you'd have to make on another investment in order to be indifferent between that other investment and this bond. We find the yield by solving for y in:

Since in general y > 0, we are valuing payments lower the further they are in the future (called "discounting").

The Approximation

Here is an approximation for the bond's yield:

The intuitive understanding is that the coupon payments contribute C/V to the yield every year. To explain the other term, you initially paid P and you ultimately receive V, contributing about (V-P)/V to the yield over the entire course of the bond. So for each year, it's about [(V - P)/(V)]/n in yield.

Why does this approximation work? We start with the normal equation for calculating bond yield, but we don't discount the coupon payments - that is, we pretend like we don't care about when the coupon payments are made:

To simplify the remainder of the equation, we recall the first-order Taylor approximation centered on 0. That is:

We'll just use the Taylor approximation as a tool in this post rather than discussing it at length. We see that:

So after simplifying and isolating for y:

Time For Money To Double: The Rule Of 72

Prerequisite Knowledge

Let's say that you have an investment that grows by r% every year. So for some initial investment P, after n years, you'll have P * (1 + r)^n. The question: after how many years does your investment double?

The Approximation

The common approximation given is that you divide 72 by your rate of return, and that gives you the number of years it takes to double. So for an investment that gives a 6% return, it should take about 12 years to double.

We start with the exact equation:

First we observe that ln(2) ~= .693. Then we calculate the first-order Taylor approximation centered on 0 of the right-hand side:

So we finally get:

where .693 is frequently substituted with .72 to keep the math simple, since 72 is easily divisible by many numbers. To see when it is most accurate, we can see when this function is equal to the second-order Taylor approximation:

so we solve for:

which simplifies to a linear equation. By solving for r, we see that the rule is most accurate for rates around 7.46%.

Chance Of Winning A Poker Hand: The Rule Of Fours

Prerequisite Knowledge

This is a fun one. In a standard game of Texas hold'em, you're holding on to two cards, and there are some cards on the table. Your goal is to make the best hand possible out of the cards in your hand plus the cards on the table. Once five cards are on the table, no more cards will be drawn. So you might be curious what your chance of winning is if there are three cards out (and two to go, known as the flop).

The Approximation

The common rule is that you count the number of cards that would lead to your win (called "outs") and multiply that number by 4% for your approximate chance of winning.

For example, if you have a 3, 4, 5, and 6, then you are hoping for a 2 or a 7 to complete your straight. There are four 2's and four 7's, so you have roughly a 8 * 4% = 32% chance of getting the straight.

We start with the exact probability as usual. Since you have two cards in your hand and three on the table, there are 52 - 5 = 47 cards remaining. The easiest way to calculate your chance of winning is to calculate 1 - Pr[you lose]. Since there are k cards that would lead to your win, there are 47 - k cards that would not lead to you winning after the next card is drawn, or a (47 - k)/47 chance. The final card has a (46 - k)/46 chance of you not winning by the same reasoning. So:

Since this is a polynomial, the first-order Taylor approximation is just the first term of the polynomial:

To see for how many outs k the approximation 4% * k is most accurate, we compare it to the exact solution and solve for k:

which gives about k = 6.52.

Three Controversial Beliefs About Living Things

Neel Somani — Thu, 24 Oct 2019 03:06:00 GMT

I've always been interested in philosophy of science, and earlier today I was thinking a bit about some commonly-held ideas in the field of biology which I disagree with. Here are a few controversial views that I have about living things and evolution. They're strong beliefs that are loosely held. What are your thoughts?

1. There is no essential difference between a living thing and a non-living thing.

We cannot define life.

Below are a few proposed definitions of "life," but they don't work. I'll explain how they either encompass too much or too little (Cleland and Chyba 2002).

Life is matter that can reproduce itself and evolve as survival dictates (source). This common definition stems from Darwinian evolution. Let's consider someone who is infertile. They cannot reproduce, yet we still clearly consider them to be living, so the definition doesn't work. (In the next section, I'll demonstrate that "evolve as survival dictates" is meaningless.)

That definition seems to stem from a misunderstanding of evolution anyway. Survival and reproduction are just terms that come up in the description of natural selection. That is, organisms that have a better chance of survival and reproduction are going to be overrepresented in the next generation (i.e., have greater reproductive fitness). Those aren't essential features of life.

For any other definition, I can come up with exceptions just as easily. Something with a metabolism? Look at a car. Something with negative entropy? Here's my refrigerator.

In McKay's 2004 article on how to search for life, he acknowledges (and cites) the difficulties associated with defining life, but ultimately concludes that we should search for "energy, carbon, liquid water, and a few other elements such as nitrogen, sulfur, and phosphorus." I don't see the reason to search for any of those things if we're not even confident that we would identify "life" if we saw it.

I'll admit that even if we can't define what it means for an individual thing to be alive, there might still be notable characteristics about a group of things that are alive. My point is that there is nothing fundamentally different about an individual living thing.

I'll also admit that just because we can't define something doesn't mean that there's nothing that fundamentally distinguishes it. As Cleland and Chyba suggest, we had no acceptable definition of water before the development of molecular theory. To be fair, a similar definition for life would require discovering some sort of fundamental "life force" that falls outside of our current framework for explaining things. That seems unlikely to me and almost more like a religious belief.

What about the argument that life is a spectrum?

One solution is that life is a spectrum from "undeniably non-living" to "undeniably living." That would explain why there's no hard and fast rule as to whether something is alive.

If that's the case, then the extreme of "living" still needs to be described. I think that if it is a spectrum, then the side of "living" is very human-centric (or anthropocentric). That is, when we say something is "living," we really mean "more like humans." No one would deny that a gorilla is alive (it looks so much like us), a sponge is questionable at first glance, and a virus barely doesn't meet the cutoff. Of course we'd place humans on the extreme of "undeniably living," which is suspicious to me and makes me think that it's still an arbitrary spectrum.

Why do people feel like there's something special about life?

You might wonder why we have such a strong urge to believe that there's something fundamentally different about living things if it's not actually the case. My argument is that our propensity for identifying things as "living" just served evolutionary utility. To identify something as living allowed us to see it as predator or prey and respond accordingly. The organisms that didn't have this sense were at a huge disadvantage.

2. Evolution is just a description of how a group of entities, whether living or non-living, can change over time. It is not a "force."

Example of the misconception:

"Evolution is the single greatest force in the universe; it is the only thing that is permanent and it drives everything." - Ray Dalio's "Principles"

In the quote, Dalio fails to understand that evolution is not a force, and it does not drive anything. (In Dalio's defense, he might not have been using "evolution" in the traditional biological sense.)

None of the mechanisms of evolution require things to be living.

Evolution happens in a few different ways, but there's nothing about those mechanisms that's specific to living things. They're just logical statements that basically amount to describing the only ways that a population of things (anything, not just life) could possibly change. Evolution isn't a "force."

Natural selection is almost circular. It's saying that if something has characteristics that make it more likely to exist in the following generation, then it will be overrepresented in the next generation. There's nothing about that process that's specific to living things. For example, maybe we're looking at a bowl of Starburst candies (red, pink, orange, and yellow) each day. If people tend to eat the red and pink candies more frequently, then we're going to be left with a bucket of yellow and orange candies. That's essentially the mechanism of natural selection. (I'm sort of clumping artificial selection in this same group.)
Genetic drift is the random fluctuation of the proportions of different groups. For example, I might have a group of shirts. One day, I accidentally spill something on one of my blue shirts, so I have to throw it out. The proportion of blue shirts in the group has decreased, but it had nothing to do with the color or characteristics of those shirts. It just happened by chance, as opposed to natural selection.
Gene flow is like another group joining the first one and affecting the proportions. Let's say I get my brother's wardrobe when he moves out. If he owned a higher proportion of white shirts than I did, then my wardrobe is going to have a higher proportion of white shirts than before.

Although the names "genetic drift" and "gene flow" appear to be tied to living things, the underlying concepts are not.

3. Evolution does not help us understand the "purpose" of humans in any way whatsoever.

Here's a random blog post that makes reference to this misconception: https://blogs.scientificamerican.com/guest-blog/is-the-meaning-of-your-life-to-make-babies

"So is making babies -- and having genes survive through the generations -- the meaning of life? The answer is yes -- from an evolutionary gene's eye view… This is modern knowledge that is not to be taken lightly."

I disagree. Evolution doesn't imply that it's "good" for your genes to survive or that your genes have a "meaning" to perpetuate themselves. It's just that the genes that do happen to perpetuate themselves will be the genes that exist next generation. That's all there is to it. To evolution, it's not a good thing, it's not a bad thing, it's just necessarily the case.

To add on, the only reason why we have a will to survive and reproduce is because the organisms without that drive were at a reproductive disadvantage. Fewer of them made it to the following generations, so we're all basically left with the will to survive and reproduce. It still doesn't make it good or bad.

That means that evolution doesn't imply that the "purpose" of living things is to survive and reproduce. It's not your purpose, and it's not your genes' purpose.

It also means that evolution doesn't justify any other behavior. For example, someone might argue that hoarding resources makes them more likely to survive, so it's justified by evolution. That's not the case, since evolution does not imply that our purpose is to increase our reproductive fitness. If you want that to be your purpose, that's fine, but that's not what evolution says.