What You Didn’t Learn in Berkeley CS 188 — Part 2

Implementing the policy gradient methods: REINFORCE, A2C, TRPO, PPO.

Oct 07, 2025

In my last post, I covered classical reinforcement learning methods. Some of these appeared in CS 188, but not at the depth needed to understand why they work. In this post, I show how these basic methods can be rethought or extended to handle very large state spaces or continuous action spaces.

If you recall, Q-learning, value iteration, and other tabular methods require storing a full set of state–action values. The policy is implicitly a function of the Q-values: iterate over actions and pick the one that maximizes expected value.

Even in a continuous state space, the idea still applies. Define a parameterized Q-function Q_θ(s,a) and an implicit greedy policy

\( \pi(s) := \arg\max_a Q_\theta(s,a)\)

Define a Bellman-type residual when you sample an action (a):

\(J(\theta) := \big[r + \gamma \max_{a’} Q_\theta(s’,a’)\big] - Q_\theta(s,a)\)

You can take gradient steps on Q to reduce this residual, typically by minimizing its square. In practice, the target term (r + γ max_a’Q_θ(s’,a’)) is computed with a stale copy θ^- to reduce instability from target chasing due to stochastic rewards and transitions. This is a Deep Q-Network (DQN).

That works for discrete action spaces. In continuous action spaces, computing max_aQ_θ(s, a) is generally intractable. This motivates learning the policy directly rather than inferring it from Q-values. We introduce a policy over a continuous action space, that is, a probability density function. Let π_θ(a | s) be a policy parameterized by θ, for example the parameters of a Gaussian. If we can properly define a loss function, we can optimize θ using SGD or Adam.

Introducing the policy gradient methods. These are the methods you’ll often hear about if you scroll X. In this post, I implement a couple of these methods on the Pendulum environment.

Code: https://github.com/neelsomani/policy-gradient

REINFORCE: Policy-Gradient Derivation

This is a common derivation which you can find in many places. Let:

π_θ be the policy,
τ = [(s₁, a₁), …, (s_n, a_n)] be a trajectory,
G_τ be the discounted return of τ,
ϕ(τ) = ∏_t=1,…,nP(s_t+1 | s_t, a_t) ∏_t=1,…,n π_θ(a_t | s_t) be the probability of τ.

Then we can define our objective as:

\(J(\theta) = \mathbb{E}_\tau[G_\tau] = \sum_\tau \phi(\tau)G_\tau\)

The gradient of the objective function is:

\( \nabla_\theta J(\theta) = \sum_\tau \Big(\prod_{t=1}^{n} P(s_{t+1}\mid s_t,a_t)\Big) \nabla_\theta \Big(\prod_{t=1}^{n} \pi_\theta(a_t\mid s_t)\Big) G_\tau.\)

But we don’t want to compute the product rule across ∏_t=1,…,n π_θ(a_t | s_t). The classic way to get around that is using the log-trick, ∇f = f * ∇log(f):

\(\begin{aligned} \nabla_\theta J(\theta) &= \sum_{\tau} \Bigg(\prod_{t=1}^{n} P(s_{t+1}\mid s_t,a_t)\Bigg) \Bigg(\prod_{t=1}^{n} \pi_\theta(a_t\mid s_t)\Bigg) \nabla_\theta \sum_{t=1}^{n} \log \pi_\theta(a_t\mid s_t)\, G_\tau \\ &= \mathbb{E}_{\tau}\Bigg[ \sum_{t=1}^{n} \nabla_\theta \log \pi_\theta(a_t\mid s_t)\, G_\tau \Bigg] \end{aligned} \)

To reduce the variance of the gradient, we use the equivalent unbiased estimator that pushes the return inside the sum:

\( \nabla_\theta J(\theta) = \mathbb{E}_\tau \Big[\sum_{t=1}^{n} \nabla_\theta \log \pi_\theta(a_t\mid s_t) G_\tau\Big]\)

The Causality Argument

We now justify focusing on the return from time t onward. First expand the trajectory expectation as a tower of expectations:

\(\begin{aligned} \mathbb{E}_{\tau}[f(\tau)] &= \sum_{\tau} \Bigg(\prod_{t=1}^{n} P(s_{t+1} \mid s_t, a_t)\Bigg) \Bigg(\prod_{t=1}^{n} \pi_\theta(a_t \mid s_t)\Bigg) f(\tau) \\ &= \mathbb{E}_{a_1 \sim \pi(\cdot \mid s_1)} \mathbb{E}_{s_2 \sim P(\cdot \mid s_1, a_1)} \mathbb{E}_{a_2 \sim \pi(\cdot \mid s_2)} \cdots \big[f(\tau)\big]. \end{aligned}\)

For any fixed t,

\(\nabla_\theta\left[\log \pi_\theta(a_t\mid s_t)G_\tau\right] = \nabla_\theta\left[\log \pi_\theta(a_t\mid s_t) \big(\text{const} + \sum_{k=t}^{n} \gamma^{k-1} r_k\big)\right]\)

and the “const” term depends only on (s₁, a₁), …, (s_t-1, a_t-1). Taking the conditional expectation over a_t ~ π_θ( . | s_t) and using the log-trick in reverse,

\(\mathbb{E}_{a_t}\big[\nabla\theta \log \pi_\theta(a_t\mid s_t)\big] = \nabla_\theta \sum_{a_t} \pi_\theta(a_t\mid s_t) = \nabla_\theta 1 = 0\)

so all terms prior to t vanish in expectation. Define the Monte Carlo return from t:

\( G_t := \sum_{k=t}^{n} \gamma^{k-1} r_k\)

then:

\(\nabla_\theta J(\theta) = \mathbb{E}_\tau\Big[\sum_{t=1}^{n} \nabla_\theta \log \pi_\theta(a_t\mid s_t) G_t\Big]\)

This argument is often called causality.

Implementing REINFORCE in PyTorch

In PyTorch, you don’t pass gradients directly. You define a loss built from PyTorch primitives. Anything that has parameters that you need to differentiate the loss with respect to must be written using PyTorch’s primitives. A common surrogate for the objective above is:

\( L(\theta) := - \sum_{t} \log \pi_\theta(a_t\mid s_t) G_t\)

which satisfies:

\(\nabla_\theta L(\theta) = -\nabla_\theta J(\theta)\)

As long as we sample trajectories in an unbiased way, we are optimizing with respect to an unbiased estimate of ∇_θJ(θ).

How do we represent π(a | s) in continuous action spaces? We could try to build a model that takes (s, a) and outputs a probability, but a pdf must be non-negative and integrate to 1. Neural nets output arbitrary real numbers. With a discrete action set we could normalize with a softmax, but that does not extend to a continuum of actions. Instead, we make the network output the parameters of a distribution, for example a Gaussian with mean μ_θ(s) and scale σ_θ(s), then sample from it.

For Pendulum, actions lie in (-2, 2). One method to output within those bounds: A tanh head gives (-1, 1), which we scale to (-2, 2).

We also need σ > 0. Rather than predict σ directly, predict log(σ) and map it with exponentiation or softplus.

There are a ton of tricks like this to enforce bounds.

Just use the raw head if you’re down to output in across all of R
tanh or sigmoid if you want to keep it within a range and ensure its differentiable
Clip if you don’t care if it’s differentiable outside the range (common for log(σ))
Exponentiate or softplus to make it (0, inf)

Typically in PyTorch, the module’s forward method returns deterministic parameters (μ, σ), and sampling happens in a separate method.

Still, even with a correct REINFORCE, convergence can be slow.

Baselines and the justification for A2C

From the original REINFORCE paper, subtracting a baseline B(s_t) leaves the gradient unbiased:

\(\nabla_\theta J(\theta) = \mathbb{E}_\tau \Big[\sum_t \nabla\theta \log \pi_\theta(a_t\mid s_t) (G_t - B(s_t))\Big]\)

Proof:

\(\mathbb{E}_\tau[\nabla\theta \log \pi_\theta(a_t\mid s_t) B(s_t)] = \mathbb{E}_{s_t}\left[ B(s_t) \mathbb{E}_{a_t\sim \pi}[\nabla_\theta \log \pi_\theta(a_t\mid s_t)] \right] = 0\)

by the same reasoning as the causality argument above.

Baselines can reduce the variance of the gradient computation. Choosing B(s_t) = V^π(s_t) yields:

\(\nabla_\theta J(\theta) = \mathbb{E}_\tau \Big[\sum_t \nabla\theta \log \pi_\theta(a_t\mid s_t) (G_t - V^\pi(s_t))\Big]\)

where A(s, a) = Q(s,a) - V^π(s) is called the “advantage”. Estimating V^π with another model, called a critic network, gives the actor–critic framework.

Practical notes from my final implementation for Pendulum:

Full Monte Carlo returns had too much variance, so I used TD(0) targets for the critic.
Second, I found the algorithm was highly sensitive to γ. γ=0.99 did not converge, while γ=0.9 did. The learning rate for the optimizer barely mattered.
Finally, the log standard deviation was not learning properly, and the recommendation in this notebook helped by using a softplus stabilization.

From TRPO to PPO

Motivation: Reusing Data

Suppose you compute a batch of trajectories under π_old to estimate the policy gradient:

\( \nabla_\theta J(\theta) = \mathbb{E}_\tau\Big[\sum_t \nabla\theta \log \pi_\theta(a_t\mid s_t) G_t\Big]\)

All of that work gives you a single update to θ. After you update, the batch is no longer on-policy. To reuse the data, you would need to reweight old observations so that expectations match those under π_new:

\( \mathbb{E}_{\tau\sim \pi{\text{new}}}[G_\tau] = \mathbb{E}_{\tau\sim \pi{\text{old}}}\Big[\Big(\prod_t \frac{\pi_{\text{new}}(a_t\mid s_t)}{\pi_{\text{old}}(a_t\mid s_t)}\Big) G_\tau\Big]\)

but the product of ratios has high variance.

Performance Difference Lemma

To simplify things, we’re going to define the “discounted state visitation” distribution:

\(d^\pi(s) = (1-\gamma) \sum_{t=0}^\infty \gamma^t \Pr(s_t=s \mid \pi)\)

Then, as we’ll prove in this section, here’s what called the “performance difference lemma”:

\(J(\pi_{\text{new}}) - J(\pi_{\text{old}}) = \frac{1}{1 - \gamma} \, \mathbb{E}_{s \sim d^{\pi_{\text{new}}},\, a \sim \pi_{\text{new}}} \big[ A^{\pi_{\text{old}}}(s, a) \big].\)

Notice you’re taking an expectation over π_new, but you’re computing the advantages based on π_old.

First, note that:

\(J(\pi) = \mathbb{E}_{s_0}\big[ V^{\pi}(s_0) \big]\)

Then:

\(\begin{aligned} J(\pi_{\text{new}}) - J(\pi_{\text{old}}) &= \mathbb{E}_{s_0}\!\left[ V^{\pi_{\text{new}}}(s_0) - V^{\pi_{\text{old}}}(s_0) \right] \\ &= \mathbb{E}_{s_0,\,a_0 \sim \pi_{\text{new}}(\cdot \mid s_0)}\! \left[ r(s_0,a_0) + \gamma \,\mathbb{E}_{s_1 \sim P(\cdot \mid s_0,a_0)} \big[ V^{\pi_{\text{new}}}(s_1) \big] - V^{\pi_{\text{old}}}(s_0) \right] \end{aligned}\)

Now, we’ll use an add & subtract trick to expose the advantage within the expectation:

\(\begin{aligned} &= \mathbb{E}_{s_0,\,a_0 \sim \pi_{\text{new}}}\! \left[ \underbrace{ r(s_0,a_0) + \gamma \,\mathbb{E}_{s_1}[ V^{\pi_{\text{old}}}(s_1) ] - V^{\pi_{\text{old}}}(s_0) }_{=\,A^{\pi_{\text{old}}}(s_0,a_0)} \right] \\ &\qquad + \gamma \,\mathbb{E}_{s_0,\,a_0 \sim \pi_{\text{new}},\,s_1 \sim P(\cdot \mid s_0,a_0)} \left[ V^{\pi_{\text{new}}}(s_1) - V^{\pi_{\text{old}}}(s_1) \right] \\ &= \mathbb{E}_{s_0,\,a_0 \sim \pi_{\text{new}}}\!\left[ A^{\pi_{\text{old}}}(s_0,a_0) \right] \;+\; \gamma \,\mathbb{E}_{s_1,\,a_1 \sim \pi_{\text{new}}}\!\left[ V^{\pi_{\text{new}}}(s_1) - V^{\pi_{\text{old}}}(s_1) \right] \end{aligned}\)

This is useful because we can write:

\( J(\theta_{\text{new}}) = J(\theta_{\text{old}}) + \frac{1}{1-\gamma} \mathbb{E}_{s\sim d^{\pi{\text{new}}}, a\sim \pi_{\text{new}}}\big[A^{\pi_{\text{old}}}(s, a)\big]\)

Maximizing J(θ_new) amounts to maximizing:

\(\mathbb{E}_{s\sim d^{\pi_{\text{new}}}, a\sim \pi_{\text{new}}}[A^{\pi_{\text{old}}}(s,a)]\)

since the first term is a constant and 1/(1-γ) is a scale. The issue is that this expectation relies on π_new (in the distributions of both the actions and the states), which would require resampling.

In theory we could importance weight both the state distribution and the action distribution:

\( \mathbb{E}_{s\sim d^{\pi_{\text{new}}}, a\sim \pi_{\text{new}}}[A^{\pi_{\text{old}}}(s,a)] = \mathbb{E}_{s\sim d^{\pi_{\text{old}}}, a\sim \pi_{\text{old}}} \left[\frac{d^{\pi_{\text{new}}}(s)}{d^{\pi_{\text{old}}}(s)} \cdot \frac{\pi_{\text{new}}(a\mid s)}{\pi_{\text{old}}(a\mid s)} \cdot A^{\pi_{\text{old}}}(s,a)\right]\)

We do not have access to d^π. Instead, TRPO assumes:

\(d^{\pi_{\text{new}}} \sim d^{\pi_{\text{old}}}\)

While we cannot enforce this directly, we can bound |d^π_new - d^π_old| by controlling a policy divergence. We bound the following expression:

\(\max_s D_{\mathrm{KL}}(\pi_{\text{old}}(\cdot\mid s)|\pi_{\text{new}}(\cdot\mid s))\)

which allows the authors to get a lower bound on J(π_new). In practice we constrain the expected KL under d^π_old, which is tractable.

With the state distribution approximated as unchanged, the remaining scaling π_new/π_oldis called “importance sampling.” TRPO’s final surrogate and constraint become:

\(\max_{\theta} \; \mathbb{E}_{s \sim d^{\pi_{\text{old}}},\, a \sim \pi_{\text{old}}} \!\left[ \frac{\pi_{\theta}(a \mid s)}{\pi_{\text{old}}(a \mid s)} \, A^{\pi_{\text{old}}}(s,a) \right] \quad \text{s.t.} \quad \mathbb{E}_{s \sim d^{\pi_{\text{old}}}} \!\big[ D_{\mathrm{KL}}\big( \pi_{\text{old}}(\cdot \mid s) \;\|\; \pi_{\theta}(\cdot \mid s) \big) \big] \;\le\; \delta\)

Proximal Policy Optimization (PPO)

PPO-Penalty

The first variant of PPO comes directly from the Lagrangian of the TRPO objective with λ > 0:

\(\mathcal{L}(\theta,\lambda) = \mathbb{E}\left[\frac{\pi_\theta}{\pi_{\text{old}}} * A\right] - \lambda\Big(\mathbb{E}[D_{\mathrm{KL}}(\pi_{\text{old}}|\pi_\theta)] - \delta\Big).\)

Dropping the constant λ * δ and defining β := λ, we arrive at the standard form:

\(\max_\theta \mathbb{E}\left[\frac{\pi_\theta}{\pi_{\text{old}}} * A\right] - \beta * \mathbb{E}\left[D_{\mathrm{KL}}(\pi_{\text{old}}|\pi_\theta)\right]\)

In practice, β is adapted so the empirical KL stays close to δ.

PPO-Clip

PPO-Clip takes a slightly different approach to staying within the trust region. Consider the importance ratio for a single sample:

\( r_t(\theta) := \frac{\pi_\theta(a_t\mid s_t)}{\pi_{\text{old}}(a_t\mid s_t)}\)

Instead of constraining by the mean KL, we could just make it so the model doesn’t reward adjusting the policy π_new when it deviates wildly from π_old. You can imagine it is possible to prove a bound on KL divergence if π_new ~ π_old. If we could enforce per-sample constraints, we would maximize:

\(r_t A_t \quad\text{s.t.}\quad 1-\varepsilon \le r_t \le 1+\varepsilon\)

But it’s hard to jointly impose that many constraints over a single θ. Instead, PPO modifies the objective so there is no incentive to push r_t outside the interval. A naive attempt is:

\(\mathbb{E}\big[\mathrm{clip}(r_t,1-\varepsilon,1+\varepsilon) * A_t\big]\)

But this surrogate can overestimate the true objective in two cases:

A_t< 0 and r_t> 1 + ε where the penalty is capped at (1 + ε) * A_t but should be more negative, and
A_t> 0 and r_t< 1 - ε where the reward should be smaller than (1 - ε) * A_t.

We need the surrogate to only underestimate the true objective, because that ensures that maximizing the surrogate also maximizes a lower bound on the objective. The conservative surrogate fixes both by lower bounding the unclipped objective:

\( L_{\text{clip}}(\theta) = \mathbb{E}\big[\min\big(r_t(\theta)A_t, \mathrm{clip}(r_t(\theta),1-\varepsilon,1+\varepsilon) * A_t\big)\big]\)

This discourages large deviations from π_old. In practice we also track the mean KL over the batch and stop early if it exceeds the target δ. And that’s the second variant of PPO.

Scaling

The policies above assume trajectories are sampled on-policy from the current π. At scale, actors may be lagged or the data may be offline.

In future posts, I plan to cover off-policy methods such as DDPG, TD3, and SAC. I also plan to write a primer on incorporating human feedback using GRPO and non-RL approaches like DPO.

If you liked this material and want a reference for these algorithms and more, I recommend: Lilian Weng’s overview of policy gradients

Neel Somani's Blog