least-squares policy iteration with -optimal design (LSPI-G) can produce a policy such that the suboptimality gap of satisfies
where is the worst-case error with which the -dimensional features can approximate the action-value functions of memoryless policies of the MDP . In fact, the result continues to hold if we restrict the memoryless policies to those that are -measurable in the sense that the probability assigned by such a policy to taking some action in some state depends only on . Denote the set of such policies by . Then, for an MDP and associated feature-map , let
Checking the proof, noticing that LSPI produces -measurable policies only, it follows that provided the first policy it uses is also -measurable, in can be replaced by .
Earlier, we also proved that the amplification of by the -factor is unavoidable by any efficient planner. However, this leaves open the question of whether the amplification by a polynomial power of is necessary, and whether in particular, the quadratic dependence is necessary? Our first result, which is given without proof, shows that in the case of LSPI this amplification is real and the quadratic dependence cannot be improved.
Theorem (LSPI error amplification lower bound): The quadratic dependence in is tight: There exists a constant such that for every and every there exists a featurized MDP , a policy of the MDP, a distribution over the states such that LSPI when it is allowed infinitely many rollouts of infinite length produces a sequence of policies such that
The result of the theorem holds even when LSPI is used with state-aggregation. Intuitively, state-aggregation means that states are groups into a number of groups and states belonging to the same group are treated identically when it comes to representing value functions. This, value-functions based on state-aggregation are constant over any group. When we are concerned with state-value functions, aggregating the states based on a partitioning of the states into the groups (i.e., and all the subsets are disjoint from each other), a feature-map that allows to represent these piecewise constant functions is
where is the indicator function that takes the value of one when its argument (a logical expression) is true, and is zero otherwise. In other words, . Any feature map of this form defines a partitioning of the state-space and thus corresponds to the state-aggregation. Note that the piecewise constant functions can also be represented if we rotate all the features by the same rotation. The only important aspect here is that the features of different states are either identical, or orthogonal to each other, making the rows of the feature matrix an orthonormal system.
For approximating action-value functions, state-aggregation uses the same partitioning of states regardless of the identity of the actions: In effect, for each action, one uses the feature map from above, but with a private parameter vector. This effectively amounts to stacking -times, to get one copy of it for each action . Note that for state-aggregation, there is no amplification of the approximation errors: State-aggregation is extrapolation friendly, as will be explained at the end of the lecture.
Returning to the result, an inspection of the actual proof reveals that in this case LSPI leads to a sequence of policies that alternate between the initial policy and . “Convergence” is fast, yet, the guarantee is far from satisfactory. In particular, in the same example, an alternate algorithm, which we will cover next can reduce the quadratic dependence on the horizon to a linear dependence.
Politex
Politex comes from Policy Iteration with Expert Advice. Assume that one is given a featurized MDP with state-action feature-map and access to a simulator, and a -optimal design for .
Politex generates a sequence of policies such that for ,
where
with
where for , is the parameter vector obtained by running the least-squares policy evaluation algorithm based on G-optimal design (LSPE-G) to evaluate policy (see this lecture). In particular, recall that this algorithm rolls out policy from the points of a G-optimal design to produce independent trajectories of length each, calculates the average return for each of these design points and then solves the (weighted) least-squares regression problem where the features are used to regress on the obtained values.
Above, truncates its argument to the interval:
Note that to calculate , one does need to calculate and then compute .
Unlike in policy iteration, the policy returned by Politex after iterations is either the “mixture policy”
or the policy which gives the best value with respect to the start state, or start distribution. For simplicity, let us just consider the case when is used as the output. The meaning of a mixture policy is simply that one of the policies is selected uniformly at random and then the selected policy is followed for the rest of time. Homework 3 gives precise definitions and asks you to prove that the value function of is just the mean of the value functions of the constituent policies:
We now argue that the dependence on the approximation error of the suboptimality gap of only scales with , unlike the case of approximate policy iteration.
For this, recall that by the value difference identity
Summing up, dividing by , and using gives
Now, . Also, . Let . Elementary algebra then gives
We see that the approximation errors appear only in term . In particular, taking pointwise absolute values, using the triangle inequality, we get that
which shows the promised dependence. It remains to show that above is also under control. However, this is left to the next lecture.
Notes
State aggregation and extrapolation friendliness
The in our results comes from controlling the extrapolation errors of linear prediction. In the case of state-aggregretion, however, this extra error amplification is completely avoided: Clearly, if we measure a function with a precision and there is at least one measurement per part, then by using the value measured at each part (at an arbitrary state there) over the whole part, the worst-case error is bounded by . Weighted least-squares in this context just takes the weighted average of the responses over each part and uses this as the prediction, so it also avoids amplifying approximation errors.
In this case, our analysis of extrapolation errors is clearly conservative. The extrapolation error was controlled in two steps: In our first lemma, for weighted least-squares we reduced this problem to that of controlling where is the moment matrix for . In fact, the proof of this lemma is the culprit: By carefully inspecting the proof, we can see that the application of Jensen’s inequality introduces an unnecessary term: For the case of state aggregation (orthonormed feature matrix),
as long as the design is such that it chooses any group exactly once. Thus, the case of state-aggregation shows that some feature-maps are more extrapolation friendly than others. Also, note that the Kiefer-Wolfowitz theorem, of course, still gives that is the smallest value that we can get for when optimizing for .
It is a fascinating question of how extrapolation errors behave for various feature-maps.
Least-squares value iteration (LSVI)
In homework 2, Question 3 was concerned with least-squares value iteration. The algorithm concerned (call it LSVI-G) uses a random approximation of the Bellman operator, based on a G-optimal design (and action-value functions). The problem was to show a result similar to what holds for LSPI-G holds for LSVI-G, as well. That is, for any MDP feature-map pair and any excess suboptimality target, with a total runtime of
least-squares policy iteration with -optimal design (LSPI-G) can produce a policy such that the suboptimality gap of satisfies
Thus, the dependence on the horizon of the approximation error is similar to the one that was obtained for LSPI. Note that the definition of is different from what we have used in analyzing LSPI:
Above, is the Bellman optimality oerator for action-value functions and is defined so that for , is also a function which is obtained from by truncating for each input the value to : . In , “BOO” stands for “Bellman-optimality operator” in reference to the appearance of in the definition.
In general, the error measures used in LSPI and are incomparable. The latter quantity measures a “one-step error”, while is concerned with approximating functions defined over an infinite-horizon.
Linear MDPs
Call an MDP linear if both the reward function and the next state distributions for each state lie in the span of the features: with some and , as an matrix takes the form with some . Clearly, this is a notion that captures how well the “dynamics” (including the reward) of the MDP can be “compressed”.
When an MDP is linear, . We also have in this case that . More generally, defining and , it is not hard to see that and , which shows that both policy iteration (and its soft versions) and value iteration are “valid” approaches, though, by ignoring the fact that we are comparing upper bounds, this also shows that value iteration may have an edge over policy iteration when the MDP itself is compressible. This should not be too surprising given that value-iteration is “more direct” in aiming to calculate . Yet, they may exist cases when the action-value functions are compressible, while the dynamics is not.
Stationary points of a policy search objective
Let . A stationary point of with respect to some set of memoryless policies is any such that
It is known that if are state-aggregation features then any stationary point of satisfies
where is defines as the worst-case error of approximation action-value functions of -measurable policies with the features (the same constant as used in the analysis of approximate policy iteration).
Soft-policy iteration with Averaging
Politex can be seen as a “soft” version of policy iteration with averaging. The softness is controlled by : When , Politex uses a greedy policy w.r.t. to an average of all previous -functions. Notice that in this case if Politex were to use a greedy policy w.r.t. the last -function, then it would reduce exactly to LSPI-G. As we have seen, in LSPI-G the approximation error can get quadratically amplified with the horizon . Thus, one way to avoid this quadratic amplification is to stay soft with averaging. As we shall see in the next lecture, the price of this is a relatively slower convergence to a target suboptimality excess value. Nevertheless, the promise is that the algorithm will still stay polynomial in all the relevant quantities.
References
Politex was introduced in the paper
POLITEX: Regret Bounds for Policy Iteration using Expert Prediction. Abbasi-Yadkori, Y.; Bartlett, P.; Bhatia, K.; Lazic, N.; Szepesvári, C.; and Weisz, G. In ICML, pages 3692–3702, May 2019. pdf
However, as this paper also notes, the basic idea goes back to the MDP-E algorithm by Even-Dar et al:
Even-Dar, E., Kakade, S. M., and Mansour, Y. Online Markov decision processes. Mathematics of Operations Research, 34(3):726–736, 2009.
This algorithm considered a tabular MDP with nonstationary rewards – a completely different setting. Nevertheless, this paper introduces the basic argument presented above. The Politex paper notices that the argument can be extended to the case of function approximation. In particular, it also notes the nature of the function approximator is irrelevant as long as the approximation and estimation errors can be tightly controlled.
The Politex paper presented an analysis for online RL and average reward MDPs. Both add significant complications. The argument shown here is therefore a simpler version. Connecting Politex to LSPE-G in the discounted setting is trivial, but has not been presented before in the literature.
The first paper to use the error decomposition shown here together with function approximation is
Abbasi-Yadkori, Y., Lazic, N., and Szepesvári, C. Modelfree linear quadratic control via reduction to expert prediction. In AISTATS, 2019.