Link Search Menu Expand Document

14. Politex

PDF Version

The following lemma can be extracted from the calculations found at the
end of the last lecture:


Lemma (Mixture policy suboptimality): Fix an MDP M. For any sequence π0,,πk1 of policies, any sequence q^0,,q^k1:S×AR of functions, and any policy π, the mixture policy π¯k=1/k(π0++πk1) satisfies

(1)vπvπ¯k1k(IγPπ)1j=0k1Mπq^jMπjq^jT1+2max0jk1qπjq^j1γ.

In particular, the only restriction is on policy π so far and that is that it has to be a memoryless policy. To control the suboptimality of the mixture policy, one just needs to control the action-value approximation errors qπjq^j and the term T1 and for this we are free to choose the policies π0,,πk1 in any way we want them to be chosen. To help with this choice, let us now inspect T1(s) for a fixed state s:

(2)T1(s)=j=0k1π(s,),q^j(s,)πj(s,),q^j(s,),

where, abusing notation, we use π(s,a) for π(a|s). Now, recall that q^j will be computed based on πj while π is unknown. One must thus wonder whether it is possible to control this term?

Online linear optimization

As it happens, the problem of controlling terms of this type is the central problem studied in a subfield of learning theory, online learning. In particular, in online linear optimization, the following problem is studied:

An adversary and a learner are playing a zero-sum minimax game in k discrete rounds, taking actions in an alternating manner. In round j (0jk1), first, the learner needs to choose a vector xjXRd. Then, the adversary chooses a vector, yjYRd. Before its choice, the adversary learns about all previous choices of the learner, and the learner also learns about all previous choices of the adversary. They also remember their own choices. For simplicity, let us constraint the adversary and the learner to be deterministic. The payoff to the adversary at the end of the k rounds is

(3)Rk=maxxXj=0k1x,yjxj,yj.

In particular, the adversary’s goal is maximize this, while the learner’s goal is to minimize this (the game is zero-sum). Both the adversary and the learner are given k and the sets X,Y. Letting L to denote the learner’s strategy (a sequence of maps of histories to X) and A to denote the adversary’s strategy (a sequence of maps of histories to Y), the above quantity depends on L and A: Rk=RK(A,L).

Taking the perspective of the learner, the quantity defined in (3) is called the learner’s regret. Denote the minimax value of the game by Rk: Rk=infLsupARk(A,L).

Thus, this only depends on k, X and Y. The dependence is suppressed when it is clear from the context. The central question then is how Rk depends on k and also on X and Y. In online linear optimization both sets X and Y are convex.

Connecting these games to our problem, we can see that T1(s) in (2) matches the regret definition in (3) if we let d=A, X=M1(A)={p[0,1]A:apa=1} be the A1 simplex of RA and Y=[0,1/(1γ)]A. Furthermore, πj(s,) needs to be chosen first, which is followed by the choice of q^j(s,). While q^j(s,) will not be chosen in an adversarial fashion, a bound B on the regret against arbitrary choices will also serve as a bound for the specific choice we will need to make for q^j(s,).

Mirror descent

Mirror descent (MD) is an algorithm that originates in optimization theory. In the context of online linear optimization, MD is a strategy for the learner which is known to guarantee near minimax regret for the learner under a wide range of circumstances.

To align with the large body of literature on online linear optimization, it will be beneficial to switch signs. Thus, in what follows we assume that the learner will aim at minimizing x,y by its choice xX and the adversary will aim at maximizing the same expression over its choice yY. This means that we also redefine the regret to

Rk=maxxXj=0k1xj,yjx,yj(4)=j=0k1xj,yjminxXj=0k1x,yj.

Everything else remains the same: The game is zero-sum, minimax, the regret is the payoff for the adversary and the negative regret is the payoff of the learner. This version is called a loss-game. The reason to prefer the loss game is because most of optimization theory is written for minimizing convex functions rather than for maximizing concave functions. However, clearly, this is an arbitrary choice. The second form of the regret shows that the player’s goal is to compete with the best single decision from X but chosen given the hindsight of knowing all the choices of the adversary. That is, the learner’s goal is to keep its cumulative loss j=0k1xj,yj close to, or even below the best cumulative loss in hindsight, minxXj=0k1x,yj. (With this, T1(s) matches Rk when we change Y=[1/(1γ),0]A.)

MD is recursively defined and in its simplest form it has two design parameters. The first is an extended real-valued convex function F:RdR¯, called the “regularizer”, while the second is a stepsize, or learning rate parameter η>0. (The extended reals is just R together with +, and an appropriate extension of basic arithmetic. By allowing convex functions to take the value + allows to merge “constraints” with objectives in a seamless fashion. The value is added because sometimes we have to work with negated extended real-valued convex functions.)

The specification of MD is as follows: In round 0, x0X is picked to minimize F:

x0=argminxXF(x).

In what follows, we assume that all the minimizers that we need in the definition of MD do exist. In the specific case that we need, X is the d1 simplex, which is a closed convex set, and since convex functions are also continuous, the minimizers that we will need are guaranteed to exist.

Then, in round j>0, MD chooses xj as follows:

(5)xj=argminxXηx,yj1+DF(x,xj1)

drawing Here,

DF(x,x)=F(x)(F(x)+F(x),xx)

is the remainder term in the first-order Taylor-series expansion of the value of F at x when the expansion is carried out at x and, for simplicity, we assume that F is differentiable on the interior of its domain dom(F)={xR:F(x)<+}. Since for any convex function and any linear approximation of it stays below the graph of the convex function, we immediately get that DF is nonnegative valued. For an illustration see the figure on the right, which shows a convex function, the first-order Taylor approximation of the function at some point.

One should think of F is a “nonlinear distance inducing function”; above DF(x,x) can be thought of penalty imposed on deviating from x. However, DF is more often than not is not a distance, i.e., often it is not even symmetric. Because of this, we can’t really call DF a distance. Hence, it is called a divergence. In particular, DF(x,x) is called the Bregman divergence of x from x.

In the definition of the MD update rule, we tacitly assumed that DF(x,xj1) is well-defined. This requires that F should be differentiable at xj1, which one needs to check when applying MD. In our specific case, this will hold, again.

The idea of the MD update rule is to (1) allow the learner to react to the last loss yj1 vector chosen by the adversary, while also (2) limiting how much xj can depart from xj1, thus, effectively stabilizing the algorithm, the tradeoff governed by the choice of η>0. (Separating η from F only makes sense because there are some standard choices for F, but η is really just a scale parameter for F). In particular, the larger the value of η is, the less “data-sensitive” MD will be (here, y0,,yk1 constitute the data), and vice versa, the smaller η is, the more data-sensitive MD will be.

Where is the mirror?

Under some technical conditions on F, the update rule (5) has a two step-implementation:

(6)x~j=(F)1(F(xj1)ηyj1),(7)xj=argminxXDF(x,x~j).

The first equation above explains the name: To obtain x~j, one first transforms xj1 using F:dom(F)Rd to the “mirror” (dual) space where “gradients”/”slopes live”, where one then adds to the result ηyj1, which can be seen as a “gradient step” (interpreting yj1 as the gradient of some loss). Finally, the result is then mapped back to the original (primal) space using the inverse of F.

The second step of the update takes the resulting point x~j and “projects” it to X in a way that respects the “geometry induced by F” on the space Rd.

The use of complex terminology, like “primal” and “dual” spaces, which happen to be the same old Euclidean space, Rd, probably sounds like an overkill. Indeed, in the simple case we consider when these spaces are identical it is. The distinction would become important when working with infinite dimensional spaces, which we leave to others for now.

Besides helping with understanding the terminology, the two-step update shown can also be useful for computation. In fact, this will be the case in the special case that we need.

Mirror descent on the simplex

We have seen that in the special case we need,

X=Pd1:={p[0,1]d:apa=1},Y=[1/(1γ),0]d,andd=A.

drawing To use MD we need to specify the regularizer F and the learning rate. For the former, we choose

F(x)=ixilog(xi)xi,

which is known as the unnormalized negentropy function. Note that F takes on finite values when x[0,]d (since limx0+xlog(x)=0, we set xilog(xi)=0 whenever xi=0). Outside of this quadrant, we define the value of F to be +. The plot of xlog(x)x for x0 is shown on the right.

It is not hard to verify that F is convex: First, dom(F)=[0,]d is convex. Taking the first derivative, we find that for any x(0,)d,

F(x)=log(x),

where log is applied componentwise. Taking the derivative again, we find that for x(0,)d,

2F(x)=diag(1/x),

i.e., the matrix whose (i,i)th diagonal entry is 1/xi. Clearly, this is a positive definite matrix, which suffices to verify that F is a convex function.

The Bregman divergence induced by F is

DF(x,x)=1,xlog(x)xxlog(x)+xlog(x),xx=1,xlog(x/x)x+x,

where again we use an “intuitive” notation when operations are first applied componentwise (i.e., xlog(x) denotes a vector whose ith component is xilog(xi)). Note that the domain of DF is [0,)d×(0,)d. If both x and x lie in the d1-simplex, DF becomes the well-known relative entropy, or Kullback-Leibler (KL) divergence.

It is not hard to verify that xj can be obtained as shown in (6)-(7) and in particular this two-step update takes the form

x~j,i=xj1,iexp(ηyj1,i),xj,i=x~j,iix~j,i,i[d].

Unrolling the recursion, we can also that this is the same as

(8)x~j,i=exp(η(y0,i++yj1,i)),xj,i=x~j,iix~j,i,i[d].

Based on this, it is obvious that MD can be efficiently implemented with this choice of F. As far as the regret is concerned, the following theorem holds:


Theorem (MD with negentropy on the simplex): Let X=Pd1 amd Y=[0,1]d. Then, no matter the adversary, a learner using MD with

η=2log(d)k

is guaranteed that its regret Rk in k rounds is at most

Rk2klog(d).

When the adversary plays in Y=[a,b]d with a<b, we can use MD on the transformed sequence y~j=(yja1)/(ba)[0,1]d. Then, for any xX,

Rk(x):=j=0k1xjx,yj=j=0k1xjx,(ba)y~j+a1=(ba)j=0k1xjx,y~j(ba)2klog(d),

where the third equality used that xj,1=x,1=1. Taking the maximum over xX gives that

(9)Rk(ba)2klog(d).

By the update rule in (8),

x~j,i=exp(η(y~0,i++y~j1,i))=exp(η/(ba)(y0,i++yj1,ijb)),i[d].

Note that the “shift” by jb cancels out in the normalization step. Hence, MD in this case takes the form

(10)x~j,i=exp(η/(ba)(y0,i++yj1,i)),xj,i=x~j,iix~j,i,i[d],

which is the same as before, except that the learning rate is scaled by 1/(ba). In particular, in this case one can set

(11)η=1ba2log(d)k.

and use update rule (8).

MD applied to MDP planning

As agreed, T1(s) from (2) takes the form of a k-round regret against π(s,) in online linear optimization on the simplex with losses in [1/(1γ),0]A. This suggest to use MD in a state-by-state manner to control T1(s). Using (8) and (11) gives

Ej(s,a)=exp(η(q^0(s,a)++q^j1(s,a))),πj(a|s)=Ej(s,a)aEj(s,a),aA

to be used with

η=(1γ)2log(A)k.

Note that this is the update used by Politex. Then, (9) gives that simultaneously for all sS,

(12)|T1(s)|11γ2klog(A).

Putting things together, we get the following result:


Theorem (Politex suboptimality gap bound): Pick a featurized MDP (M,ϕ) with a full rank feature-map φ:S×ARd and let K,m,H1. Assume that B2ε holds for (M,ϕ) and the rewards in M are in the [0,1] interval. For 0ζ<1, define

κ(ζ)=ε(1+d)+d(γH1γ+11γlog(d(d+1)K/ζ)2m),

Then, in K iterations, Politex produces a mixed policy π¯K such that with probability 1ζ, the suboptimality gap δ of π¯K satisfies

δ1(1γ)22log(A)K+2κ(ζ)1γ.

In particular, for any ε>0, choosing K,H,m so that

K32log(A)(1γ)4(ε)2,HHγ,(1γ)ε/(8d)andm32d(1γ)4(ε)2log((d+1)2K/ζ),

policy πK is δ-optimal with

δ2(1+d)1γε+ε,

while the total computation cost is poly(11γ,d,A,1(ε)2,log(1/ζ)).


Note that as compared to the result of LSPI with G-optimal design, the amplification of the approximation error ε is reduced by a factor of 1/(1γ), as it was promised. The price is that now the number of iterations K, is a polynomial of 1(1γ)ε, whereas before it was logarithmic. This suggest that perhaps a higher learning rate can help initially to speed up convergence to get the best of both words.

Proof: As in the proof of the suboptimality gap for LSPI, we get that for any 0ζ1, with probability at least 1ζ, for any 0kK1,

qπkq^k=qπkΠΦθ^kqπkΦθ^kκ(ζ),

where the first inequality uses that qπk takes values in [0,1]. On the event when the above inequalities hold, by (1) and (12),

δ1(1γ)22log(A)K+2κ(ζ)1γ.

The details of this calculation are left to the reader.

Notes

Optimality of the Final Policy

Notice that we said the policy returned by Politex after k iterations should be a mixture policy π¯k=1k(π0++πk1). A more natural policy to return is the final policy πk1. The question then is: can one ensure similar optimality gaurantees for the final policy πk1 as we have seen for π¯k? The answer turns out to be yes, if we use the unnormalized negentropy regularizer for mirror descent (as we have already been using in this lecture note). To see this, we aim to bound vπvπk1. We begin by writing.

vπvπk1=vπvπ¯k+vπ¯kvπk1=1k(IγPπ)1j=0k1MπqπjMπjqπj+1k(IγPπk1)1j=0k1MπjqπjMπk1qπj=1k(IγPπ)1j=0k1Mπq^jMπjq^jT1+1k(IγPπ)1j=0k1(MπMπj)(qπjq^j)T2+1k(IγPπk1)1j=0k1Mπjq^jMπk1q^jT3+1k(IγPπk1)1j=0k1(MπjMπk1)(qπjq^j)T4.

Notice how T1 and T2 are defined as before, and we already have bounds for both of them. It is also easy to see that that T4 takes a very similar form to T2 and can also be bounded in the same way as T2. If we can show that T3(s)0 for all sS then we would get the result that

vπvπk1δ1(1γ)22log(A)K+4κ(ζ)1γ.

Which is identical to the result of the main theorem in the lecture note above, except with the constant 2 scaling replaced with a constant 4 scaling infront of the approximation error (since T4 used the same bound as T2).

We are left to show that indeed T3(s)0. To do this we first write out T3(s) in vector notation to help us aline with the math syntax to come. Fix a state sS, then

T3(s)=j=0k1πj(|s),q^j(s,)πk1(|s),q^j(s,)

Since we will hold s fixed for all the following steps we slightly abuse notation in favor of avoiding clutter and write the above equation as follows where it is assumed that all functions were first evaluated at s.

T3=j=0k1πj,q^jπk1,q^j

Next recall that the policy selected by MD at iteration k is defined as

πk=argminπM1(A)ηπ,q^k1+DF(π,πk1)=argmaxπM1(A)ηπ,q^k1DF(π,πk1)

where we have negated q^k1 to formulate our problem as a minimization problem as was needed for the MD analysis. If we set F to the unnormalized negentropy regularizer (as was done in the notes above)

F(x)=ixilog(xi)xi,

we have that

πk=argmaxπM1(A)ηπ,q^k1KL(π||πk1)

which turns out to be equivilant to

(13)πk=argmaxπM1(A)ηπ,j=0k1q^jF(π).

The above equation is the policy selection made by the Follow The Regularized Leader (FTRL) algorithm. For further details of the equavilance between MD and FTRL when F is the unnormalized negentropy one can refer to chapter 28 of the Bandit Book. Importantly, the above equation will be useful for our proof.

We will now show that T30 by showing that

j=0k1πk1,q^jj=0k1πj,q^j

To do this notice that

ηj=0k1πk1,q^j=ηπk1,q^k1+ηπk1,j=0k2q^jF(πk1)+F(πk1)ηπk1,q^k1+ηπk2,j=0k2q^jF(πk2)+F(πk1)ηj=0k1πj,q^jF(π0)+F(πk1)ηj=0k1πj,q^j

where the first inequality holds since by (13) we know that

πk1=argmaxπM1(A)ηπ,j=0k2q^jF(π).

The second inequality holds by repeatadly apply the first two steps.

The third inequality holds since π0 was initialized as π0=argminπM1(A)F(π) so we have that F(πk1)F(π0)0. Which concludes the argument.

Online convex optimization, online learning

Online linear optimization is a special case of online convex/concave optimization, where the learner chooses elements of some nonempty convex set XRd and the adversary needs to choose an element of a nonempty set Y of concave functions over X: Y{f:XR:f is concave}. Then, the definition of regret is changed to

(14)Rk=maxxXj=0k1yj(x)yj(xj),

where as before xjX is the choice of the learner for round j and yjY is the choice of the adversary for the same round. Identifying any vector u of Rd with the linear map xx,u, we see that online linear optimization is a special case of this problem.

Of course, by negating all functions in Y (i.e., letting Y~={y:yY}) and redefining the regret to

(15)Rk=maxxXj=0k1y~j(xj)y~j(x)

we get a definition that is used in the literature, which prefers the convex case to the concave. Here, the interpretation is that y~jY~ is a “loss function” chosen by the adversary in round j.

The standard function notation (yj is applied to x) injects unwarranted asymmetry in the notation. After all, from the perspective of the learner, they need to choose a value in X that works for the various functions in Y. Thus, we can consider any element of X as a function that maps elements of Y to reals through yy(x). Whether Y has functions in them or X has functions in them does not matter that much; it is the interconnection between X and Y that matters more. For this reason, one can study online learning when y(x) above is replaced by b(x,y), where b:X×YR is a specific map that assigns payoffs to every pair of points in X and Y. When the map is fixed, one can spare an extra symbol by just using [x,y] in place of b(x,y), which makes things almost a full circle given that we started with the linear case when [x,y]=x,y.

Truncation or no truncation?

We introduced truncation to simplify the analysis. The proof can be made to go through even without it, with a mild increase of the suboptimality gap (or runtime). The advantage of removing the projection is that without projection, q^0++q^j1=Φ(θ^0++θ^j1), which leads to a practically significant reduction of the runtime.

References

The optimality of the final policy presented in the Notes was shown by Tadashi Kozuno when he taugh this lecture in Winter 2022.