Link Search Menu Expand Document

23. Tabular MDPs

PDF Version

In this lecture we will analize an online learning algorithm for the finite-horizon episodic MDP setting. Let M=(S,A,P,r,μ,H) be an MDP with finite state and action spaces S and A, unknown transition matrix P, known reward function ra(s)[0,1], an initial state distribution μ, and length of each episode H1. The star-superscript in P is used to distinquish the true environment from other (e.g. estimated) environments that occur in the algorithm and the analysis. The assumption that the reward function r is known is for simplicity. In fact, most of the hardness (in terms of sample complexity and designing the algorithm) comes from unknown transition probabilities.

We will focus on the finite-horizon setting where the learner interacts with the MDP over k=1,,K episodes of length H1. Most, but not all ideas translate to the infinite-horizon discounted or average reward settings.

Recall that the regret is defined as follows:

RK=k=1Kv0(S0(k))Vk

where Vk=h=0H1rAh(k)(Sh(k)).

UCRL: Upper Confidence Reinforcement Learning

The UCRL algorithm implements the optimism princple. For this we need to define a set of plausible models. First, we define the maximum likelihood estimates using data from rounds 1,,k1:

Pa(k)(s,s)=Nk(s,a,s)1Nk(s,a)

The definition makes use of the notation ab=max(a,b), and empirical counts:

Nk(s,a)=k<kh<HI(Sh(k)=s,Ah(k)=a)Nk(s,a,s)=k<kh<HI(Sh(k)=s,Ah(k)=a,Sh+1(k)=s)

Define the confidence set

Ck,δ={Pa(s):s,aPa(k)(s)Pa(s)1βδ(Nk(s,a))}

where βδ:N(0,) is a function that we will choose shortly. Our goal of choosing βδ is to ensure that

  1. PCk,δ for all k=1,,K with probability at least 1δ
  2. Ck,δ is “not too large”

The second point will appear formually in the proof, however note that from a statistical perspective, we want the confidence set to be as efficient as possible.

With the confidence set, we can now introduce the UCRL algorithm:


UCRL (Upper confidence reinforcement learning):

In episodes k=1,,K,

  1. Compute confidence set Ck,δ
  2. Use policy π~k=argmaxπmaxPCk,δvPπ
  3. Observe episode data S0(k),A0(k),S1(k),,SH1(k),SH1(k),SH(k)

Note that we omitted the rewards from the observation data. Since we made the assumption that the reward vector ra(s) is known, we can always recompute the rewards from the state and action sequence.

For now we we also glance over the point of how to compute the optimistic policy πk efficently, but we will get back to this point later.

Step 1: Defining the confidence set


Lemma (L1-confidence set): Let βδ(u)=2Slog(2)+log(u(u+1)SA/δ)2u and define the confidence sets

Ck,δ={Pa(s):s,aPa(k)(s)Pa(s)1βδ(Nk(s,a))}

Then, with probability at least 1δ,

k1,PCk,δ

Proof: Let s,a be fixed and denote by XvS the next state observed upon visiting (s,a) the vth time. Assume that (s,a) was visited in total u times. Then Pu,a(s,s)=1uv=1uI(Xv=s).

The Markov property implies that (Xv)v=1u is i.i.d. Note that for any vector pRS we can write the 1-norm as p1=supx1p,x. Therefore

Pu,a(s)Pa(s)1=maxx{±1}SPu,a(s)Pa(s),x

Fix some x{±1}S.

Pu,a(s)Pa(s),x=1uv=1usxs(I(Xv=s)Pa(s,s))=1uv=1uΔv

where in the last line we defined Δv=sSxs(I(Xv=s)Pa(s,s)). Note that E[Δv]=0, |Δv|1 and (Δv)v=1u is an i.i.d. random variable. Therefore Hoeffding’s inequality implies that with probability at least 1δ,

1uv=1uΔv2log(1/δ)2u

Next note that |{±1}S|=2S, therefore taking the union bound over all x{±1}S, we get that with probability at least 1δ,

Pu,a(s)Pa(s)12Slog(2)+log(1/δ)2u

In a last step, we take a union bound over sS, aA and u1. For taking the union bound over the infinite set of natural numbers, we can use the following simple trick. Note that

u=1δu(u+1)=δ

This follows from the simple obseration that 1u(u+1)=1u1u+1 and using a telescoping sum argument. Therefore, with probability at least 1δ, for all u1, sS and aA

Pu,a(s)Pa(s)12Slog(2)+log(u(u+1)SA/δ)2u

Lastly, the claim follows by noting that Pa(k)(s)=PNk(s,a),a(s).

Step 2: Bounding the regret


Theorem (UCRL Regret): The regret of UCRL defined with confidence sets Ck,δ satisfies with probability at least 13δ:

RK4cδHSAHK+2cδH2SA+3HHK2log(1/δ)

where cδ=2Slog(2)+log(HK(HK+1)SA/δ). In particular, for large enough K, surpressing constants and logarithmic factors, we get

RKO~(H3/2SAKlog(1/δ))

Proof: Denote by πk the UCRL policy defined as

πk=argmaxπmaxPCk,δv0,Pπ(S0(k))

Further, let P~(k)=argmaxPCk,δv0,P(S0(k)) be the optimistic model.

In what follows we assume that we are on the event E=k1Ck,δ. By the previous lemma, P(E)1δ.

Fix k1 and decompose the (instantenous) regret in round k as follows:

v0(S0(k))Vk=v0,P(S0(k))v0,P~k(S0(k))(I)+v0,P~kπk(S0(k))v0,Pπk(S0(k))(II)+v0,Pπk(S0(k))Vk(III)

Note that we used that v0,P~k(S0(k))=v0,P~kπk(S0(k)) which holds because by definition πk is an optimal policy for P~k.

The first term is easily bounded. This is the crucial step that makes use of the optimism principle. By PCk,δ and the choice of P~k it follows that (I)0. In particular, we already eliminated the dependence on the (unknown) optimal policy from the regret bound!

The last term is also relatively easy to control. Denote ξk=(III). Note that by the definition of the value function we have E[ξk|S0(k)]=0 and |ξk|H. Hence ξk behaves like noise! If ξk was an i.i.d variable we could directly apply Hoeffding’s inequality to bound k=1Kξk.

The sequence ξk has a property that allows us to obtain a similar bound. Let

Fk={S0(l),A0(l),S1(l),,SH1(l),SH1(l),SH(l)}l=1k1

be the data available to the learner at the beginning of the episode k. Then by definition of the value function, E[ξk|Fk,S0(k)]=0.

A sequence of random variables (ξk)k1 with this property is called a martingale difference sequence. Lucky for us, most properties that hold for (zero-mean) i.i.d. sequences can also be shown for martingale difference sequences. The analogue result to Hoeffding’s inequality is called the Azuma-Hoeffding’s inequalty. Applied to the sequence ξk, Azuma-Hoeffdings inequality implies that

k=1KξkHK2log(1/δ)

It remains to bound term (II) in the regret decomposition:

(II)=v0,Pπk(S0(k))v0,P~(k)πk(S0(k))

Using the Bellman equation, we can recursively compute the value function for any policy π:

vh,Pπ=rπ+MπPvh+1,Pπ,0hH1vH,Pπ=0

We introduce the following shorthand for the value difference of policy πk under models P and P~(k):

δh(k)=vh,P~(k)πk(Sh(k))vh,Pπk(Sh(k))

Let Fh,k contain all observation data up to episode k and step h including Shk. Using the Bellman equation, we can write

δh(k)=MπkP~(k)vh+1,P~(k)πk(Sh(k))MπkPvh+1,Pπk(Sh(k))±MπkPVh+1,P~(k)(Sh(k))=(Mπk(P~(k)P)vh+1,P~(k)πk)(Sh(k))+(MπkP(vh+1,P~(k)πkvh+1,Pπk)(Sh(k))PAh(k)(Sh(k))P~Ah(k)(k)(Sh(k))1H+δh+1(k)+(E[δh+1(k)|Fh,k]δh+1(k))=:ηh+1(k)2Hβδ(Nk(Sh(k),Ah(k)))+δh+1(k)+ηh+1(k)

The first inequality uses that for any two vectors w,v, we have w,vw1v and vh+1,P~(k)πkH. Further we use that πk is a deterministic policy, therefore MπkP(Sh(k))=PAh(k)(Sh(k)). The second follows from the definition of the confidence set in the previous lemma:

PAh(k)(Sh(k))P~Ah(k)(k)(Sh(k))1PAh(k)(Sh(k))PAh(k)(k)(Sh(k))1+PAh(k)(k)(Sh(k))P~Ah(k)(k)(Sh(k))12βδ(Nk(Sh(k),Ah(k)))

Telescoping and using that δH(k)=0 yields

δ0(k)η1(k)++ηH1(k)+2Hh=0H1βδ(Nk(Sh(k),Ah(k)))(IV)

Note that (ηh(k))h=1H1 is another martingale difference sequence (with ηh(k)|H) that can be bounded by Azuma-Hoeffding:

k=1Kh=1H1ηh(k)2HHK2log(1/δ)

It remains to bound term (IV). For this we make use of the following algebraic lemma:


Lemma:

For any sequence m1,,mk that satisfies m1++mk0:

k=1Kmk1(m1++mk)2m1++mk

Proof of Lemma: Let f(x)=1/x. f(x) is a concave function on (0,). Therefore f(A+x)f(A)+xf(A) for all A,A+x,>0. This translates to:

A+xA+x2A

The claim follows from telescoping.

Continuing the proof of the theorem where we need to bound (IV). Denote cδ=2Slog(2)+log(HK(HK+1)SA/δ). Further let Mk(s,a)=h=1H1I(Sh(k)=s,Ah(k)=a) and note that Nk(s,a)=M1++Mk1. Then

k=1Kh=0H1βδ(Nk(Sh(k),Ah(k)))cδs,ak=1Kh=0H1I(Sh(k)=s,Ah(k)=a)1Nk(s,a)=cδs,ak=1KMk1(M1++Mk1)

Next, using the algebraic lemma above and the fact that Mk(s,a)H, we find

k=1Kh=0H1βδ(Nk(Sh(k),Ah(k)))cδs,ak=1KMk(s,a)1(M1(s,a)++Mk1(s,a))cδs,ak=1KMk(s,a)1(M1(s,a)++Mk(s,a)H)cδs,ak=1KMk(s,a)I(M1(s,a)++Mk(s,a)>H)M1(s,a)++Mk(s,a)H+cδHSA2cδs,aNk(s,a)+cδHSA2cδSAs,aNk(s,a)/SA+cδHSA=2cδSAHK+cδHSA

The last inequality uses Jensen’s inequality.

Collecting all terms and taking the union bound over two applications of Azuma-Hoeffdings and the event E completes the proof.

Unknown reward functions

In our analysis of UCRL we assumed that the reward function is known. While this is quite a common assumption in the literature, it is mainly for simplicity. We also don’t expect the bounds to change by much: Estimating the rewards is not harder than estimating the transition kernels.

To modify the analysis and account for unkown rewards, we first consider the case with deterministic reward function ra(s)[0,Rmax], where Rmax is some known upper bound on the reward per step.

Embracing the idea of optimism, we define reward estimates

r^a(k)(s)={rAh(k)(Sh(k))(s,a) was visited in a round k<k and step hRmaxelse.

Clearly this defines an optimistic estimate, r^a(k)(s)ra(s). Moreover, we have r^Ah(k)(k)(Sh(k))rAh(k)(Sh(k)) at most SA times. Therefore the regret in the previous analysis is increased by at most RmaxSA.

When the reward is stochastic, we can use a maximum likelihood estimate of the reward and construct confidence bounds around the estimate. This way we can define an optimistic reward. Still not much changes, as the reward estimates concentrate at the same rate as the estimates of P.

UCBVI: Upper Confidence Bound Value Iteration

Computing the UCRL policy can be quite challenging. However, we can relax the construction so that we can use backward induction. We define a time-inhomogenous relaxation of the confidence set:

Ck,δH=Ck,δ××Ck,δH times

Let P~1:H,k:=(P~1,k,,P~H,k)=argmaxPCk,δHvP(s0(k)) be the optimistic (time-inhomogenous) transition matrices and πk=argmaxπvP~1:H,kπ the optimal policy for the optimistic model P~1:H,k. Then vP~1:H,kπk=vP~1:H,k=v(k) is defined by the following backwards induction:

vH(k)(s)=0s[S]Qh(k)(s,a)=r(s,a)+maxPCk,δPa(s)vh+1(k)vh(k)(s)=maxaQh(k)(s,a)

Note that the maximum in the second line is a linear optimization with convex constraints that can be solved efficiently. Further, the proof of the UCRL regret still applies, because we used the same (step-wise) relaxation in the analysis.

We can further relax the backward induction to avoid the optimization over Ck,δ completely:

maxPCk,δPa(s)vh+1(k)Pa(k)(s)vh+1(k)+maxPCk,δ(Pa(s)Pa(k)(s))vh+1(k)Pa(k)(s)vh+1(k)+maxPCk,δPa(s)Pa(k)(s))1vh+1(k)Pa(k)(s)vh+1(k)+βδ(Nk(s,a))H

This leads us to the the UCBVI (upper confidence bound value iteration) algorithm. In episode k, UCBVI uses value iteration for the estimated transition kernel Pa(k)(s) and optimistic reward function ra(s)+Hβδ(Nk(s,a)) to compute the policy.


UCBVI (Upper confidence bound value iteration):

In episodes k=1,,K,

  1. Compute optimistic value function:
vH(k)(s)=0s[S]bk(s,a)=Hβδ(Nk(s,a))Qh(k)(s,a)=min(r(s,a)+bk(s,a)+Pa(k)(s)vh+1(k),H)vh(k)(s)=maxaQh(k)(s,a)
  1. Follow greedy policy Ah(k)=argmaxAQh(k)(Sh(k),A)
  2. Observe episode data S0(k),A0(k),S1(k),,SH1(k),SH1(k),SH(k)

Note that we truncate the Qh(k)-function to be at most H, this avoids a blow up by a factor of H in the regret bound. Carefully checking that the previous analysis still applies shows that UCBVI has regret at most RKO(H2SAK).

By more carefully designing the reward bonuses for UCBVI, it is possible to achieve RKO~(H3/2SAK) which matches the lower bound up to logarithmic factors in the time in-homogeneous setting.

Notes

References

The original UCRL paper. Notice that they consider the infinite horizon average reward setting, which is different from the episodic setting we present.

Auer, P., & Ortner, R. (2006). Logarithmic online regret bounds for undiscounted reinforcement learning. Advances in neural information processing systems, 19. [link]

The UCBVI paper. Notice that they consider the homogeneous setting, which is different from the in-homogeneous setting we present.

Azar, M. G., Osband, I., & Munos, R. (2017, July). Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning (pp. 263-272). PMLR. [link]

The paper that presents the lower bound. Notice the they consider the infinite horizon average reward setting. Thus, there results contains a diameter term D instead of a horizon term of H.

Auer, P., Jaksch, T., & Ortner, R. (2008). Near-optimal regret bounds for reinforcement learning. Advances in neural information processing systems, 21. [link]