What You Didn’t Learn in Berkeley CS 188 — Part 2
Implementing the policy gradient methods: REINFORCE, A2C, TRPO, PPO.
In my last post, I covered classical reinforcement learning methods. Some of these appeared in CS 188, but not at the depth needed to understand why they work. In this post, I show how these basic methods can be rethought or extended to handle very large state spaces or continuous action spaces.
If you recall, Q-learning, value iteration, and other tabular methods require storing a full set of state–action values. The policy is implicitly a function of the Q-values: iterate over actions and pick the one that maximizes expected value.
Even in a continuous state space, the idea still applies. Define a parameterized Q-function Qθ(s,a) and an implicit greedy policy
Define a Bellman-type residual when you sample an action (a):
You can take gradient steps on Q to reduce this residual, typically by minimizing its square. In practice, the target term (r + γ maxa’Qθ(s’,a’)) is computed with a stale copy θ- to reduce instability from target chasing due to stochastic rewards and transitions. This is a Deep Q-Network (DQN).
That works for discrete action spaces. In continuous action spaces, computing maxaQθ(s, a) is generally intractable. This motivates learning the policy directly rather than inferring it from Q-values. We introduce a policy over a continuous action space, that is, a probability density function. Let πθ(a | s) be a policy parameterized by θ, for example the parameters of a Gaussian. If we can properly define a loss function, we can optimize θ using SGD or Adam.
Introducing the policy gradient methods. These are the methods you’ll often hear about if you scroll X. In this post, I implement a couple of these methods on the Pendulum environment.
Code: https://github.com/neelsomani/policy-gradient
REINFORCE: Policy-Gradient Derivation
This is a common derivation which you can find in many places. Let:
πθ be the policy,
τ = [(s1, a1), …, (sn, an)] be a trajectory,
Gτ be the discounted return of τ,
ϕ(τ) = ∏t=1,…,nP(st+1 | st, at) ∏t=1,…,n πθ(at | st) be the probability of τ.
Then we can define our objective as:
The gradient of the objective function is:
But we don’t want to compute the product rule across ∏t=1,…,n πθ(at | st). The classic way to get around that is using the log-trick, ∇f = f * ∇log(f):
To reduce the variance of the gradient, we use the equivalent unbiased estimator that pushes the return inside the sum:
The Causality Argument
We now justify focusing on the return from time t onward. First expand the trajectory expectation as a tower of expectations:
For any fixed t,
and the “const” term depends only on (s1, a1), …, (st-1, at-1). Taking the conditional expectation over at ~ πθ( . | st) and using the log-trick in reverse,
so all terms prior to t vanish in expectation. Define the Monte Carlo return from t:
then:
This argument is often called causality.
Implementing REINFORCE in PyTorch
In PyTorch, you don’t pass gradients directly. You define a loss built from PyTorch primitives. Anything that has parameters that you need to differentiate the loss with respect to must be written using PyTorch’s primitives. A common surrogate for the objective above is:
which satisfies:
As long as we sample trajectories in an unbiased way, we are optimizing with respect to an unbiased estimate of ∇θJ(θ).
How do we represent π(a | s) in continuous action spaces? We could try to build a model that takes (s, a) and outputs a probability, but a pdf must be non-negative and integrate to 1. Neural nets output arbitrary real numbers. With a discrete action set we could normalize with a softmax, but that does not extend to a continuum of actions. Instead, we make the network output the parameters of a distribution, for example a Gaussian with mean μθ(s) and scale σθ(s), then sample from it.
For Pendulum, actions lie in (-2, 2). One method to output within those bounds: A tanh head gives (-1, 1), which we scale to (-2, 2).
We also need σ > 0. Rather than predict σ directly, predict log(σ) and map it with exponentiation or softmax.
There are a ton of tricks like this to enforce bounds.
Just use the raw head if you’re down to output in across all of R
tanh or sigmoid if you want to keep it within a range and ensure its differentiable
Clip if you don’t care if it’s differentiable outside the range (common for log(σ))
Exponentiate or softplus to make it (0, inf)
Typically in PyTorch, the module’s forward method returns deterministic parameters (μ, σ), and sampling happens in a separate method.
Still, even with a correct REINFORCE, convergence can be slow.
Baselines and the justification for A2C
From the original REINFORCE paper, subtracting a baseline B(st) leaves the gradient unbiased:
Proof:
by the same reasoning as the causality argument above.
Baselines can reduce the variance of the gradient computation. Choosing B(st) = Vπ(st) yields:
where A(s, a) = Q(s,a) - Vπ(s) is called the “advantage”. Estimating Vπ with another model, called a critic network, gives the actor–critic framework.
Practical notes from my final implementation for Pendulum:
Full Monte Carlo returns had too much variance, so I used TD(0) targets for the critic.
Second, I found the algorithm was highly sensitive to γ. γ=0.99 did not converge, while γ=0.9 did. The learning rate for the optimizer barely mattered.
Finally, the log standard deviation was not learning properly, and the recommendation in this notebook helped by using a softplus stabilization.
From TRPO to PPO
Motivation: Reusing Data
Suppose you compute a batch of trajectories under πold to estimate the policy gradient:
All of that work gives you a single update to θ. After you update, the batch is no longer on-policy. To reuse the data, you would need to reweight old observations so that expectations match those under πnew:
but the product of ratios has high variance.
Performance Difference Lemma
To simplify things, we’re going to define the “discounted state visitation” distribution:
Then, as we’ll prove in this section, here’s what called the “performance difference lemma”:
Notice you’re taking an expectation over πnew, but you’re computing the advantages based on πold.
First, note that:
Then:
Now, we’ll use an add & subtract trick to expose the advantage within the expectation:
This is useful because we can write:
Maximizing J(θnew) amounts to maximizing:
since the first term is a constant and 1/(1-γ) is a scale. The issue is that this expectation relies on πnew (in the distributions of both the actions and the states), which would require resampling.
In theory we could importance weight both the state distribution and the action distribution:
We do not have access to dπ. Instead, TRPO assumes:
While we cannot enforce this directly, we can bound |dπ_new - dπ_old| by controlling a policy divergence. We bound the following expression:
which allows the authors to get a lower bound on J(πnew). In practice we constrain the expected KL under dπ_old, which is tractable.
With the state distribution approximated as unchanged, the remaining scaling πnew/πold is called “importance sampling.” TRPO’s final surrogate and constraint become:
Proximal Policy Optimization (PPO)
PPO-Penalty
The first variant of PPO comes directly from the Lagrangian of the TRPO objective with λ > 0:
Dropping the constant λ * δ and defining β := λ, we arrive at the standard form:
In practice, β is adapted so the empirical KL stays close to δ.
PPO-Clip
PPO-Clip takes a slightly different approach to staying within the trust region. Consider the importance ratio for a single sample:
Instead of constraining by the mean KL, we could just make it so the model doesn’t reward adjusting the policy πnew when it deviates wildly from πold. You can imagine it is possible to prove a bound on KL divergence if πnew ~ πold. If we could enforce per-sample constraints, we would maximize:
But it’s hard to jointly impose that many constraints over a single θ. Instead, PPO modifies the objective so there is no incentive to push rt outside the interval. A naive attempt is:
But this surrogate can overestimate the true objective in two cases:
At < 0 and rt > 1 + ε where the penalty is capped at (1 + ε) * At but should be more negative, and
At > 0 and rt < 1 - ε where the reward should be smaller than (1 - ε) * At.
We need the surrogate to only underestimate the true objective, because that ensures that maximizing the surrogate also maximizes a lower bound on the objective. The conservative surrogate fixes both by lower bounding the unclipped objective:
This discourages large deviations from πold. In practice we also track the mean KL over the batch and stop early if it exceeds the target δ. And that’s the second variant of PPO.
Scaling
The policies above assume trajectories are sampled on-policy from the current π. At scale, actors may be lagged or the data may be offline.
In future posts, I plan to cover off-policy methods such as DDPG, TD3, and SAC, as well as large-scale variants like IMPALA and V-trace. I also plan a primer on incorporating human feedback using GRPO and non-RL approaches like DPO.
If you liked this material and want a reference for these algorithms and more, I recommend: Lilian Weng’s overview of policy gradients