16. Policy gradients
In this last lecture on planning, we look at policy search through the lens of applying gradient ascent. We start by proving the so-called policy gradient theorem which is then shown to give rise to an efficient way of constructing noisy, but unbiased gradient estimates in the presence of a simulator. We discuss at a high level the ideas underlying gradient ascent and stochastic gradient ascent methods (as opposed to more common case in machine learning where the goal is to minimize a loss, or objective function, we are maximizing rewards, hence ascending on the objective rather than descending). We then find out about the limitations of policy gradient even in the presence of “perfect representation” (unrestricted policy classes, tabular case) and perfect gradient information, which motivates the introduction of a variant known as “natural policy gradients” (NPG). We then uncover a close relationship between this method and Politex. The lecture concludes with comparing results for NPG and Politex.
The policy gradient theorem
Fix an MDP
denote the expected value of using policy
Hence, the derivative,
The theorem we give though is not limited to this case and also applies to when the action space is infinite and even when the policy is deterministic. For the theorem statement, recall that for a policy
Theorem (Policy Gradient Theorem): Fix an MDP
exists and is continuous in a neighborhood of and exists; exists and is continuous in a neighborhood of and exists;
Then,
where the last equality holds if
For the second expression, we treat
Above, the second expression where we moved the derivative with respect to the parameter inside the expression will only be valid in infinite state spaces when some additional regularity assumption is met. One such assumption is that
In words, the theorem shows that the derivative of the performance of a policy can be obtained by integrating a simple derivative that involves the action-value function of the policy.
Of the two conditions of the theorem, the first condition is the one that is generally easier to verify. In particular, the condition on the continuous differentiability of
Proof: The proof is based on a short calculation that starts with writing the value difference identity for
The details are as follows: Recall from Calculus 101 the following result: Assume that
exists and is continuous in a neighborhood of and exists; exists and is continuous in a neighborhood of and exists.
Then
Let
where the last equality just used that that
Now, focusing on the first term on the right-hand-side, let
Provided that
Taking the derivative of both sides of
Now let
Finally, the conditions to apply
When the action-space is discrete and
and thus, for finite
While this can be used as the basis of evaluating (or approximating) gradient, it may be worthwhile to point out an alternate form which is available when
Using this in
which has the pleasant property that it takes the form of an expected value over the actions of the score function of the policy map correlated with the action-value function.
Before moving on it is worth pointing out that an equivalent expression is obtained if
This may have significance when using simulation to evaluate derivatives: One may attempt to use an appropriate “bias” term to reduce the variance of the estimate of the gradient. Before discussing simulation any further, it may be also worthwhile to discuss what happens when the action-space is infinite.
For countable infinite action spaces, the only difference is that
For uncountably infinite action spaces, this argument works with the minimal necessary changes. In the most general case,
In fact, this is a strictly more general form:
In the special case when
Indeed, in this case,
and hence
and thus, if either
If
which is known as the “deterministic policy gradient formula”.
Gradient methods
The idea of gradient methods is to make small steps in the parameter space in the direction of the gradient of an objective function that is to be maximized. In the context of policy search, this works as follows: If
where for
if
or
For **any$$
In all cases the key to the success of gradient methods is the appropriate choice of the stepsizes; these choices are based on a refinement of the above simple argument that shows that moving towards the direction of the gradient helps. There are also ways of “speeding up” convergence; these “acceleration methods” use a refined iteration (two iterates updated simultaneously) and can greatly speed up convergence. As there are many excellent texts that describe various aspects of gradient methods which cover these ideas, we will not delve into them any further, but I will rather give some pointers to this literature in the endnotes.
The elephant in the room here is that the gradient of
Gradient estimation
Generally speaking there are two types of errors when construction an estimate of the gradient: The one that is purely random, and the one that is not. Defining
When the gradient estimates are biased, the bias will in general put a limit on how close a gradient method can get to a stationary point. While generally a zero bias is preferred to a nonzero bias, a nonzero bias which is positively aligned with the gradient (
The next question is of course, how to estimate the gradient. For this many approaches have been proposed in the literature. When a simulator is available, as in our case, a straightforward approach is to start from the policy gradient theorem. Indeed, under mild regularity conditions (e.g., if there are finitely many states)
Now note that
is an unbiased estimate of
The argument to show this has partially be given earlier in Lecture 8. One can also show that
Vanilla policy gradients (PG) with some special policy classes
Given the hardness result presented in the previous lecture, there is no hope that gradient methods or any other method will find the global optima of the objective function in policy search in a policy-class agnostic manner. To guarantee computational efficiency, one then
- either needs to give up on convergence to a global optima, or
- give up on generality, i.e., give up on that the method should work for any policy class and/or policy parameterization.
Gradient ascent to find a good policy (“vanilla policy gradients”) is one possible approach to take even if it faces these restrictions. In fact, gradient ascent in some cases will find a globally optimal policy.
In particular, it has been long known that with small enough stepsizes gradient ascent converges at a reasonable speed to a global optimum provided that two conditions hold:
- The objective function
is smooth (its derivative is Lipschitz continuous); - The objective function is gradient dominated, i.e., with some constants
, , satisfies for any .
An example when both of these conditions are met is the direct policy parameterization, which does not allow any compression and is thus not helpful per se, but can serve as a test-case to see how far policy gradient (PG) methods can be pushed.
In this case, the parameter vector
This is done as follows: When a proposed update moves the parameter vector outside of
and
Thus, the probability of an action in a state is increased in proportion to the value of that state.
That the action-value of action
where
Policy gradient methods can be sensitive to how policies are parameterized. For illustration, consider still the “tabular case”, just now change the way the memoryless policies are represented. One possibility is to use the Boltzmann, also known as the softmax representation. In this case
A straightforward calculation gives
and hence
where recall that
which is a function mapping state-action pairs to reals, is called the advantage function underlying policy
The gradient ascent rule prescribes that
where
Just like in the previous update rule, we also see the occupancy measure “weighting” the update. This is again not necessarily helpful and if anything, again, speaks to the arbitrariness of gradient methods. And while this does not entirely stop policy gradient to find an optimal policy, and again, one can even show that the speed is geometric, though, as before, the algorithm altogether fails to run in polynomial time in the relevant quantities. For this theorem which we give without proof recall that
Theorem (PG is slow with Boltzmann policies): There exists universal constants
iterations.
As one expects that without any compression, the chosen planner should behave reasonably, this rules out the “vanilla” version of policy gradient.
Natural policy gradient (NPG) methods
In fact, a quite unsatisfactory property of gradient ascent that the speed at which it converges can greatly depend on the parameterization used. Thus, for the same policy class, there are many possible “gradient directions”, depending on the parameterization chosen. What is a gradient direction for one parameterization is not necessarily a gradient direction for another one. But what is common about these directions that an infinitesimal step along them is guaranteed increase the objective. One can in fact take a direction obtained with a parameterization and look at what direction it gives with another parameterizations. To get some order, consider transforming all these directions into the space that corresponds to the direct parameterization. It is not hard to see that all possible directions that are within 90 degrees of the gradient direction with this parameterization can be obtained by considering an appropriate parameterization.
More generally, regardless of parameterization, all directions within 90 degrees of the gradient direction are ascent directions. This motivates changing the stepsize
There are many ways to choose a matrix stepsize. Newton’s method is to choose it so that the direction is the “best” if the function is replaced by its local quadratic approximation. This provably helps to reduce the number of iterations when the objective function is “ill-conditioned”, though all matrix stepsize methods incur additional cost per each iteration, which will often offset the gains.
Another idea, which comes from statistical problems where one often works with distributions is to find the direction of update which coincides with the direction one would obtain if one used the steepest descent direction directly in the space of distributions where distances are measured with respect to relative entropy. In some cases, this approach, which was coined the “natural gradient” approach, has been shown to give better results, though the evidence is purely empirical.
As it turns out, the matrix stepsize to be used with this approach is the (pseudo)inverse of the so-called Fischer information matrix. In our context, for every state, we have distributions over the actions. Fixing a state
To get the “information rate” over the states, one can sum these matrices up, weighted by the discounted state occupancy measure underlying
The update rule then takes the form
where for a square matrix
Proposition: We have
where
Proof: Just recall the formula that gives the solution to a least-squares problem. The details are left to the reader.
As an example of how things look like consider the case when
where
Then, the natural policy gradient update takes the form
where
In the tabular case
and thus
Note that this update rule eliminates the term
NPG is known to enjoy a reasonable speed of convergence, which gives altogether polynomial planning time. This is promising. No similar results are available for the nontabular case.
Note that if we (arbitrarily) change the definition of
Note that the only difference between Q-NPG and Politex is that in Politex one uses
where
The price of not using
where
gives a bound on how much the distribution
In contrast, the same quantity in Politex is
Not only the uncontrolled constant
The proof of the Calculus 101 result
For completeness, here is the proof of
Hence, it suffices to show that
To minimize clutter we will write
By definition we have
and
Putting these together we get
where the last equality follows if
That the result also holds under the assumption that
Summary
While policy gradient methods remain extremely popular and the idea of directly searching in the set of policies is attractive, at the moment it appears that they not only lack theoretical support, but the theoretical results suggest that it is hard to find any setting where policy gradient methods would be provably competitive with alternatives. At minimum, they need careful choices of policy parameterizations and even in that case the update rule may need to be changed to guarantee efficiency and effectiveness, as we have seen above. As an approach to algorithm design their main advantage is their generality and a strong support through various software libraries. Compared to vanilla “dynamic programming” methods they make generally smaller, more incremental changes to the policies, which seems useful. However, this is also achieved by methods like Politex, which is derived using a “bound minimization” approach. While this may seem more ad hoc than following gradients, in fact, one may argue that following gradients is more ad hoc as it fails to guarantee good performance. However, perhaps the most important point here is that one should not care too much about how a method is derived, or what “interpretation” it may have (is Politex a gradient algorithm? does this matter?). What matters is the outcomes: In this case how the methods perform. It is thus wise to learn about all possible ways of designing algorithms as there remains much room for improving the performance of current algorithms.
Notes
TBA
References
- Sutton, R. S., McAllester, D. A., Singh, S. P., and Man-sour, Y. (1999). Policy gradient methods for reinforce-ment learning with function approximation. InNeuralInformation Processing Systems 12, pages 1057–1063.
- Silver, David, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. 2014. “Deterministic Policy Gradient Algorithms.” In ICML. http://hal.inria.fr/hal-00938992/.
- Bhandari, Jalaj, and Daniel Russo. 2019. “Global Optimality Guarantees For Policy Gradient Methods,” June. https://arxiv.org/abs/1906.01786v1.
- Agarwal, Alekh, Sham M. Kakade, Jason D. Lee, and Gaurav Mahajan. 2019. “On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift.” arXiv [cs.LG]. arXiv. http://arxiv.org/abs/1908.00261.
- Mei, Jincheng, Chenjun Xiao, Csaba Szepesvari, and Dale Schuurmans. 2020. “On the Global Convergence Rates of Softmax Policy Gradient Methods.” arXiv [cs.LG]. arXiv. http://arxiv.org/abs/2005.06392.
- Zhang, Junyu, Alec Koppel, Amrit Singh Bedi, Csaba Szepesvari, and Mengdi Wang. 2020. “Variational Policy Gradient Method for Reinforcement Learning with General Utilities.” arXiv [cs.LG]. arXiv. http://arxiv.org/abs/2007.02151.
- Bhandari, Jalaj, and Daniel Russo. 2020. “A Note on the Linear Convergence of Policy Gradient Methods.” arXiv [cs.LG]. arXiv. http://arxiv.org/abs/2007.11120.
- Chung, Wesley, Valentin Thomas, Marlos C. Machado, and Nicolas Le Roux. 2020. “Beyond Variance Reduction: Understanding the True Impact of Baselines on Policy Optimization.” arXiv [cs.LG]. arXiv. http://arxiv.org/abs/2008.13773.
- Li, Gen, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Chen. 2021. “Softmax Policy Gradient Methods Can Take Exponential Time to Converge.” arXiv [cs.LG]. arXiv. http://arxiv.org/abs/2102.11270.
The paper to read about natural gradient methods:
- Martens, James. 2014. “New Insights and Perspectives on the Natural Gradient Method,” December. https://arxiv.org/abs/1412.1193v9. Last update: September, 2020.