The following lemma can be extracted from the calculations found at the end of the last lecture:
Lemma (Mixture policy suboptimality): Fix an MDP . For any sequence of policies, any sequence of functions, and any policy , the mixture policy satisfies
In particular, the only restriction is on policy so far and that is that it has to be a memoryless policy. To control the suboptimality of the mixture policy, one just needs to control the action-value approximation errors and the term and for this we are free to choose the policies in any way we want them to be chosen. To help with this choice, let us now inspect for a fixed state :
where, abusing notation, we use for . Now, recall that will be computed based on while is unknown. One must thus wonder whether it is possible to control this term?
Online linear optimization
As it happens, the problem of controlling terms of this type is the central problem studied in a subfield of learning theory, online learning. In particular, in online linear optimization, the following problem is studied:
An adversary and a learner are playing a zero-sum minimax game in discrete rounds, taking actions in an alternating manner. In round (), first, the learner needs to choose a vector . Then, the adversary chooses a vector, . Before its choice, the adversary learns about all previous choices of the learner, and the learner also learns about all previous choices of the adversary. They also remember their own choices. For simplicity, let us constraint the adversary and the learner to be deterministic. The payoff to the adversary at the end of the rounds is
In particular, the adversary’s goal is maximize this, while the learner’s goal is to minimize this (the game is zero-sum). Both the adversary and the learner are given and the sets . Letting to denote the learner’s strategy (a sequence of maps of histories to ) and to denote the adversary’s strategy (a sequence of maps of histories to ), the above quantity depends on and : .
Taking the perspective of the learner, the quantity defined in is called the learner’s regret. Denote the minimax value of the game by : .
Thus, this only depends on , and . The dependence is suppressed when it is clear from the context. The central question then is how depends on and also on and . In online linear optimization both sets and are convex.
Connecting these games to our problem, we can see that in matches the regret definition in if we let , be the simplex of and . Furthermore, needs to be chosen first, which is followed by the choice of . While will not be chosen in an adversarial fashion, a bound on the regret against arbitrary choices will also serve as a bound for the specific choice we will need to make for .
Mirror descent
Mirror descent (MD) is an algorithm that originates in optimization theory. In the context of online linear optimization, MD is a strategy for the learner which is known to guarantee near minimax regret for the learner under a wide range of circumstances.
To align with the large body of literature on online linear optimization, it will be beneficial to switch signs. Thus, in what follows we assume that the learner will aim at minimizing by its choice and the adversary will aim at maximizing the same expression over its choice . This means that we also redefine the regret to
Everything else remains the same: The game is zero-sum, minimax, the regret is the payoff for the adversary and the negative regret is the payoff of the learner. This version is called a loss-game. The reason to prefer the loss game is because most of optimization theory is written for minimizing convex functions rather than for maximizing concave functions. However, clearly, this is an arbitrary choice. The second form of the regret shows that the player’s goal is to compete with the best single decision from but chosen given the hindsight of knowing all the choices of the adversary. That is, the learner’s goal is to keep its cumulative loss close to, or even below the best cumulative loss in hindsight, . (With this, matches when we change .)
MD is recursively defined and in its simplest form it has two design parameters. The first is an extended real-valued convex function , called the “regularizer”, while the second is a stepsize, or learning rate parameter . (The extended reals is just together with and an appropriate extension of basic arithmetic. By allowing convex functions to take the value allows to merge “constraints” with objectives in a seamless fashion. The value is added because sometimes we have to work with negated extended real-valued convex functions.)
The specification of MD is as follows: In round , is picked to minimize :
In what follows, we assume that all the minimizers that we need in the definition of MD do exist. In the specific case that we need, is the simplex, which is a closed convex set, and since convex functions are also continuous, the minimizers that we will need are guaranteed to exist.
Then, in round , MD chooses as follows:
Here,
is the remainder term in the first-order Taylor-series expansion of the value of at when the expansion is carried out at and, for simplicity, we assume that is differentiable on the interior of its domain . Since for any convex function and any linear approximation of it stays below the graph of the convex function, we immediately get that is nonnegative valued. For an illustration see the figure on the right, which shows a convex function, the first-order Taylor approximation of the function at some point.
One should think of is a “nonlinear distance inducing function”; above can be thought of penalty imposed on deviating from . However, is more often than not is not a distance, i.e., often it is not even symmetric. Because of this, we can’t really call a distance. Hence, it is called a divergence. In particular, is called the Bregman divergence of from .
In the definition of the MD update rule, we tacitly assumed that is well-defined. This requires that should be differentiable at , which one needs to check when applying MD. In our specific case, this will hold, again.
The idea of the MD update rule is to (1) allow the learner to react to the last loss vector chosen by the adversary, while also (2) limiting how much can depart from , thus, effectively stabilizing the algorithm, the tradeoff governed by the choice of . (Separating from only makes sense because there are some standard choices for , but is really just a scale parameter for ). In particular, the larger the value of is, the less “data-sensitive” MD will be (here, constitute the data), and vice versa, the smaller is, the more data-sensitive MD will be.
Where is the mirror?
Under some technical conditions on , the update rule has a two step-implementation:
The first equation above explains the name: To obtain , one first transforms using to the “mirror” (dual) space where “gradients”/”slopes live”, where one then adds to the result , which can be seen as a “gradient step” (interpreting as the gradient of some loss). Finally, the result is then mapped back to the original (primal) space using the inverse of .
The second step of the update takes the resulting point and “projects” it to in a way that respects the “geometry induced by ” on the space .
The use of complex terminology, like “primal” and “dual” spaces, which happen to be the same old Euclidean space, , probably sounds like an overkill. Indeed, in the simple case we consider when these spaces are identical it is. The distinction would become important when working with infinite dimensional spaces, which we leave to others for now.
Besides helping with understanding the terminology, the two-step update shown can also be useful for computation. In fact, this will be the case in the special case that we need.
Mirror descent on the simplex
We have seen that in the special case we need,
To use MD we need to specify the regularizer and the learning rate. For the former, we choose
which is known as the unnormalized negentropy function. Note that takes on finite values when (since , we set whenever ). Outside of this quadrant, we define the value of to be . The plot of for is shown on the right.
It is not hard to verify that is convex: First, is convex. Taking the first derivative, we find that for any ,
where is applied componentwise. Taking the derivative again, we find that for ,
i.e., the matrix whose th diagonal entry is . Clearly, this is a positive definite matrix, which suffices to verify that is a convex function.
The Bregman divergence induced by is
where again we use an “intuitive” notation when operations are first applied componentwise (i.e., denotes a vector whose th component is ). Note that the domain of is . If both and lie in the -simplex, becomes the well-known relative entropy, or Kullback-Leibler (KL) divergence.
It is not hard to verify that can be obtained as shown in - and in particular this two-step update takes the form
Unrolling the recursion, we can also that this is the same as
Based on this, it is obvious that MD can be efficiently implemented with this choice of . As far as the regret is concerned, the following theorem holds:
Theorem (MD with negentropy on the simplex): Let amd . Then, no matter the adversary, a learner using MD with
is guaranteed that its regret in rounds is at most
When the adversary plays in with , we can use MD on the transformed sequence . Then, for any ,
where the third equality used that . Taking the maximum over gives that
By the update rule in ,
Note that the “shift” by cancels out in the normalization step. Hence, MD in this case takes the form
which is the same as before, except that the learning rate is scaled by . In particular, in this case one can set
and use update rule .
MD applied to MDP planning
As agreed, from takes the form of a -round regret against in online linear optimization on the simplex with losses in . This suggest to use MD in a state-by-state manner to control . Using and gives
to be used with
Note that this is the update used by Politex. Then, gives that simultaneously for all ,
Putting things together, we get the following result:
Theorem (Politex suboptimality gap bound): Pick a featurized MDP with a full rank feature-map and let . Assume that B2 holds for and the rewards in are in the interval. For , define
Then, in iterations, Politex produces a mixed policy such that with probability , the suboptimality gap of satisfies
In particular, for any , choosing so that
policy is -optimal with
while the total computation cost is .
Note that as compared to the result of LSPI with G-optimal design, the amplification of the approximation error is reduced by a factor of , as it was promised. The price is that now the number of iterations , is a polynomial of , whereas before it was logarithmic. This suggest that perhaps a higher learning rate can help initially to speed up convergence to get the best of both words.
Proof: As in the proof of the suboptimality gap for LSPI, we get that for any , with probability at least , for any ,
where the first inequality uses that takes values in . On the event when the above inequalities hold, by and ,
The details of this calculation are left to the reader.
Notes
Optimality of the Final Policy
Notice that we said the policy returned by Politex after iterations should be a mixture policy . A more natural policy to return is the final policy . The question then is: can one ensure similar optimality gaurantees for the final policy as we have seen for ? The answer turns out to be yes, if we use the unnormalized negentropy regularizer for mirror descent (as we have already been using in this lecture note). To see this, we aim to bound . We begin by writing.
Notice how and are defined as before, and we already have bounds for both of them. It is also easy to see that that takes a very similar form to and can also be bounded in the same way as . If we can show that for all then we would get the result that
Which is identical to the result of the main theorem in the lecture note above, except with the constant scaling replaced with a constant scaling infront of the approximation error (since used the same bound as ).
We are left to show that indeed . To do this we first write out in vector notation to help us aline with the math syntax to come. Fix a state , then
Since we will hold fixed for all the following steps we slightly abuse notation in favor of avoiding clutter and write the above equation as follows where it is assumed that all functions were first evaluated at .
Next recall that the policy selected by MD at iteration is defined as
where we have negated to formulate our problem as a minimization problem as was needed for the MD analysis. If we set to the unnormalized negentropy regularizer (as was done in the notes above)
we have that
which turns out to be equivilant to
The above equation is the policy selection made by the Follow The Regularized Leader (FTRL) algorithm. For further details of the equavilance between MD and FTRL when is the unnormalized negentropy one can refer to chapter 28 of the Bandit Book. Importantly, the above equation will be useful for our proof.
We will now show that by showing that
To do this notice that
where the first inequality holds since by we know that
The second inequality holds by repeatadly apply the first two steps.
The third inequality holds since was initialized as so we have that . Which concludes the argument.
Online convex optimization, online learning
Online linear optimization is a special case of online convex/concave optimization, where the learner chooses elements of some nonempty convex set and the adversary needs to choose an element of a nonempty set of concave functions over : . Then, the definition of regret is changed to
where as before is the choice of the learner for round and is the choice of the adversary for the same round. Identifying any vector of with the linear map , we see that online linear optimization is a special case of this problem.
Of course, by negating all functions in (i.e., letting ) and redefining the regret to
we get a definition that is used in the literature, which prefers the convex case to the concave. Here, the interpretation is that is a “loss function” chosen by the adversary in round .
The standard function notation ( is applied to ) injects unwarranted asymmetry in the notation. After all, from the perspective of the learner, they need to choose a value in that works for the various functions in . Thus, we can consider any element of as a function that maps elements of to reals through . Whether has functions in them or has functions in them does not matter that much; it is the interconnection between and that matters more. For this reason, one can study online learning when above is replaced by , where is a specific map that assigns payoffs to every pair of points in and . When the map is fixed, one can spare an extra symbol by just using in place of , which makes things almost a full circle given that we started with the linear case when .
Truncation or no truncation?
We introduced truncation to simplify the analysis. The proof can be made to go through even without it, with a mild increase of the suboptimality gap (or runtime). The advantage of removing the projection is that without projection, , which leads to a practically significant reduction of the runtime.
References
The optimality of the final policy presented in the Notes was shown by Tadashi Kozuno when he taugh this lecture in Winter 2022.