What You Didn't Learn in Berkeley CS 188 — Part 3
Off-policy methods, for better sample efficiency and scalability.
So far in this series on reinforcement learning, we’ve covered classical methods and the foundations of continuous-control methods.
Let’s say you want to use these methods at scale. Ideally, we’d have a method to run as many actors as we want, alongside some way to consolidate the results.
Basic knowledge of PPO tells us that we can do this as long as the actors are using a policy that isn’t “too far” from the latest policy πcurrent. But that restriction inherently caps how many actors we can run concurrently. If an update drifts πcurrent too far, then all of the other actors’ work becomes much less valuable.
This naturally leads us to the “off-policy” methods: DDPG, TD3, and SAC.
Deep Deterministic Policy Gradient (DDPG)
One nice thing about Q-learning was that, in theory, you could fill out the state-action table in parallel. Work was never really wasted, because every visit to an (s, a) pair contributed information toward the optimal value function.
Of course, the argument for convergence of Q-learning relied on the contraction property of the Bellman update operator. That’s going to be harder to prove in a continuous action space, because we have to use something like a neural net to output Q-values (DQN), meaning we can’t guarantee that the policy is strictly improving like in the tabular method. In fact, there is no clean convergence proof for DQN.
Regardless, the question remains: is there a reasonable method that never “throws away” old samples and can use every (s, a) collected? Note that the reason we couldn’t use Q-learning in a continuous action space is that we couldn’t solve the maxₐ in this update:
What if we try just outputting maxₐ Q(s′, a′) directly? Is it possible to build an algorithm around this?
Derivation of the Deterministic Policy Gradient Theorem
The deterministic policy gradient theorem is a way to differentiate our objective function by integrating over states rather than actions. Let’s define a “deterministic policy” μ:
and its objective:
Our goal is to compute ∇θJ(θ) so we can perform gradient ascent. Starting from:
Direct differentiation is messy because Pr[st = s] depends on θ. We’d like to express this in terms of value functions instead, and eliminate the Pr[st = s] term.
Intuitively, this discounted and probability-weighted summation of rewards across each state is equivalent to the expected value of starting the game at all: 𝔼s_0[V(s₀)]. Let’s prove that equivalence formally.
Step 1. Relating reward and value
We relate r and V via the Bellman equation:
and define the “discounted state visitation”:
Then:
So we want an expression for ∇Σₛ pθ(s)Vμ_θ(s) and/or ∇Σₛ pθ(s)γ𝔼[Vμ_θ(s′)].
Step 2. Recursive property of state visitation
We expand transitions:
Sum over t:
Apply discounting by γ:
Define:
The left-hand side is pθ(s′) except it omits t = 0:
In words: Each state’s discounted occupancy pθ(s′) consists of two parts: the initial-state contribution d₀(s′) and the γ-discounted flow of visitation mass from predecessor states, weighted by their transition probabilities under μθ.
It’s starting to look pretty close to what we want above.
Step 3. Multiplying by V(s′) and summing over s′
Multiply both sides by V(s′) and sum over s′:
Recognize that the inner sum is an expectation:
So:
Now recall our earlier expression for the gradient, and substitute our expression above back in:
Or equivalently,
Step 4. Chain rule on Q
Now the expression is much friendlier. How do we differentiate V(s)? Since V(s) = Q(s, μθ(s)) for deterministic policies, we apply the multivariable chain rule:
The first term, ∇θQ(s,a), is recursive with respect to the original gradient ∇J via the Bellman equation, except now it’s over the distribution of s1 rather than the starting distribution of s0:
Unrolling this recursion gives:
Continuing indefinitely:
Equivalently, using the discounted state-visitation form:
This is the Deterministic Policy Gradient Theorem. The key difference from the traditional (stochastic) policy gradient theorem is that the expectation is taken over the state distribution rather than the action distribution.
That single shift, integrating over pθ(s) instead of π(a|s), makes deterministic continuous control methods tractable and forms the foundation for DDPG and its successors.
Practical Considerations of DDPG
We can’t compute this expectation analytically, so we approximate Q and μ with neural networks.
The first relaxation that we need to make, in line with our effort to increase sample efficiency, is to allow ourselves to use samples that don’t necessarily come from the current state distribution pθ. The original DPG paper shows that for any sampling distribution p with sufficient coverage:
Thus, we can reuse samples from “replay buffers,” where we store all of the previous samples that we’ve observed, even after our gradient has had many updates. This is the key to off-policy learning.
Second, DDPG uses two networks: the actor μθ and the critic Qφ. During exploration, Gaussian noise is added to μθ(s). The critic is trained by minimizing:
where
and the actor loss is simply:
actions = actor(states)
q_values = critic(states, actions)
actor_loss = -q_values.mean()
actor_loss.backward()
In practice, DDPG maintains frozen target networks Qφ’ and μθ’. These networks are updated “softly”:
for small τ. In practice, this reduces the variance of the network updates. But buyer beware: DDPG is still known to be very unstable and sensitive to hyperparameters.
Twin Delayed DDPG (TD3)
TD3 is directly a response to DDPG. It makes three changes to DDPG, all starting with the letter D, two of which are fairly simple:
Double critics: Instead of one critic network, now we have two, and we use min(Q_{φ′₁}, Q_{φ′₂}). The reasoning is that Q-networks can be spiky and randomly assign too high of a value to some states. μ_θ(s) is computing arg maxₐ Q(s, a), which implicitly relies on maxₐ Q(s, a). This leads to systematic over-estimation of the true Q-value. By taking min(Q_{φ′₁}, Q_{φ′₂}), we counteract this overestimation bias and err toward underestimating the true objective function.
Delayed actor updates: The heuristic reasoning for convergence (not formally proven) is structurally identical to the convergence argument for policy iteration. We want the state values to converge first, and then we update the policy. TD3 solves this by only updating the actor once every two times the critic network is updated.
The last change to DDPG, called deterministic target smoothing, is more nuanced.
Deterministic Target Smoothing Analysis
The critical insight here is that (s, a) is continuous, and therefore it’s extremely unlikely that you’ll ever hit the same (s, a) twice. That means that if you have a randomly high Q(s′, a′) from initialization, that spike will permanently distort that part of the Q-network. Worse, it propagates downstream through the Bellman update:
This contamination affects nearby states too, since Q-networks generalize over continuous space. The double critics mitigate but don’t eliminate this problem.
We solve it by smoothing out that local, spurious (s, a) pair with its neighbors. Before, the target was:
Instead, TD3 uses a smoothed target:
Maybe a single (s, a) pair has a random spike, but on average, the neighborhood of points is likely to be reasonable. Computing that expectation is intractable, but we can approximate it via Monte Carlo. In practice, a single ε-sample per update is sufficient to get a good estimate.
If we take a second-order Taylor expansion of Q around the mean action μθ′(st+1), we find that this expectation implicitly penalizes the Laplacian of Q with respect to its action input. That is, the added noise regularizes curvature, discouraging sharp spikes in Q-values that would otherwise destabilize learning.
Together, these three changes to DDPG make TD3 a stable and popular alternative.
Soft Actor-Critic (SAC)
Now it would be awesome if somehow SAC were a natural continuation of DDPG/TD3. That doesn’t do it justice, though, because SAC is actually philosophically and foundationally distinct from all of the methods that we’ve covered so far, even in previous posts.
Philosophical Justification
The thing about our original objective function (J(θ) := 𝔼τ[G(τ)]) is that it only cares about expected return, not diversity of exploration. All of the exploration we’ve done so far has been either by solving for the parameters of a stochastic policy (e.g. PPO) or by adding randomness to the actions (e.g. DDPG). The optimal policy is still allowed to collapse into a deterministic mapping.
The reality is that there are many possible policies that could explain our observations. Which one should we prefer?
The principle of maximum entropy tells us that, out of all the distributions consistent with our observations, we should pick the one with maximum entropy. This is the distribution that makes the fewest assumptions about the underlying data.
For example, if we have a variable x such that 𝔼p[f(x)] = c, we should solve:
The Lagrangian is:
and setting its derivative to zero yields:
This is the Boltzmann distribution, the unique maximum-entropy distribution that satisfies the constraint.
Applying to Reinforcement Learning
Applying this idea to RL, if we assume we want some level of return R̂, then we should pick (in discrete form):
subject to
Here q(τ) is a hypothetical “best” trajectory distribution that reflects our observations, and 𝔼q denotes expectation over trajectories induced by q. Of course, if we set R̂ arbitrarily high, the resulting entropy will approach 0.
Forming the Lagrangian and setting the derivative to 0:
Taking the derivative with respect to q(τ):
which gives:
Setting α := 1/β, we obtain the canonical maximum entropy form used in SAC:
The last missing piece is that we’ve defined what the optimal trajectory distribution looks like, but the agent only controls the policy distribution. The environment dynamics also affect the likelihood of any trajectory:
Using Bayes’ rule (Pr[A | B] = Pr[A and B] / Pr[B]), we include this prior over trajectories:
Properties of the Optimal Policy
So we’ve solved for the optimal trajectory distribution q*. But we can only control the policy πθ(a|s), which induces its own trajectory distribution through the environment dynamics:
The question becomes: which policy-induced trajectory distribution pπ(τ) is closest to q(τ)*? This leads naturally to minimizing the KL divergence:
Expanding:
Since log p_π(τ) − log p₀(τ) = ∑ₜ log π(a_t|s_t), this simplifies to:
Rewriting as a maximization gives:
since α is a constant so multiplying by it doesn’t change the outcome of the argmax. Thus, SAC explicitly maximizes both expected reward and policy entropy, balancing exploitation with exploration in a single unified framework.
State Value and Q-Function Derivation
To implement this, we’re going to need expressions for both state values and Q-values. The state value function is just the objective above, with discounting:
By the Bellman equations:
and the corresponding Q-function:
We can rewrite V* in terms of Q*:
This form is friendlier because now we’re only taking an expectation over the action a. That expectation is equal to:
Let’s try to solve for that V*. We write the Lagrangian and set the derivative to 0:
gives:
Therefore:
Normalizing yields the softmax policy:
Now we can substitute back into V:
Soft Policy Improvement
Finally, we need a way to computationally solve for the optimal policy pi*. We could theoretically fit a network to the definition of the optimal pi* above. But in continuous spaces that denominator is intractable. Instead, we define a “soft policy improvement” operator:
This operator has the same fixed point as the optimal policy above. (You can try substituting it in.) In practice, we minimize its negative:
The critic still follows a soft Bellman target:
with:
Finally:
SAC uses double critics (like TD3) to mitigate bias, and
Updates the temperature α automatically to maintain a target entropy.
SAC’s soft Bellman operator is a γ-contraction in the tabular case, ensuring convergence under idealized assumptions.
Wrapping Up
That concludes our discussion of the off-policy methods. In the next and final post of the series, I’ll cover incorporating human feedback, which is relevant in post-training LLMs: DPO and GRPO.