Link Search Menu Expand Document

18. Sample complexity in finite MDPs

Let Z=S×A be the set of state-action pairs. A Z-design assigns a count to every member of Z, that is, to every state-action pair. In the last lecture we saw that

n=O~(H6SAδtrg)

samples are sufficient to obtain a δtrg-suboptimal policy with high probability provided that data is generated from a Z-design that assigns the same count to each state-action pair and to get a policy one uses the straightforward plug-in approach that estimates the rewards and transitions using empirical estimates and uses the policy that is optimal with respect to the estimated model. Above, the dependence on the number of state-action pairs is optimal, but the dependence on the horizon H=11γ is suboptimal. In the first half of this lecture, I sketch how the analysis presented in the previous lecture can be improved to get the optimal cubic dependence, together with a sketch that shows that the cubic dependence is indeed optimal.

In the second half of the lecture, we consider policy-based data collection, or experimental designs, where the goal is to find a near optimal policy from an initial state, where the data consists of trajectories obtained by rolling out the data-collection policy from the said initial state. Here, we will show a lower bound that shows that the sample complexity in this case is at least as large Ω(Amin(S,H)), which shows that there exist an exponential separation between both Z-designs and policy-based designs, and also between passive and active learning. To see the latter, note that in the presence of a simulator, with only a reset to an initial state, one can use approximate policy iteration with rollouts, or Politex with rollouts, to get a policy that is near-optimal when started from the initial state that one can reset to but with polynomially many samples in S,A and H (cf. Lecture 8 and Lecture 14).

Improved analysis of the plug-in method: First attempt

The improvement in the analysis of the plug-in method comes from two sources:

  1. Using a version of the value-difference identity and avoiding the use of the policy error bound
  2. Using Bernstein’s inequality in place of Hoeffding’s inequality

In this section, we focus on the first aspect. The second aspect will be considered in the next section.

We continue to use the same notation as in the previous lecture. In particular, M denotes the “true” MDP, M^ denotes the estimated MDP and we put ^ on quantities related to this second MDP. We further let π be one of the memoryless optimal policies of M. For simplicity, we will assume that the reward function in M^ is the same as in M: As we have seen, the higher order term in our error bound came from errors in the transition probability; the simplifying assumption allows us to focus on reducing this term while minimizing clutter. The arguments are easy to extend to the case when r^r.

Let π^ be a policy whose suboptimality in M we want to bound. The idea is to bound the suboptimality of π^ by its suboptimality in M^ and also by how much value functions for fixed policies differ when we switch from P to P^. In particular, we have

vvπ^=vv^+v^vπ^(1)vπv^π+v^v^π^opt. error+v^π^vπ^,

where π^ denotes an optimal policy in M^ and the inequality holds because v^=v^π^v^π. The term marked as “opt. error” is the optimization error that arises when π^ is not (quite) optimal in M^. This term is controlled by the choice of π^. For simplicity, assume for now that π^ is an optimal policy in M^, so that we can drop this term. We further assume that π^ is a deterministic optimal policy of M^.

It remains to bound the first and last terms. Both of these terms have the form vπv^π, i.e., the difference between the value functions of the same policy π in the two MDPs (here, π is either π or π^). This difference, similar to the value difference identity, can be expressed as a function of the difference PP^, as shown in the next result:


Lemma (value difference from transition differences): Let M and M^ be two MDPs sharing the same state-action space, rewards, but differing in their transition probabilities. Let π be a memoryless policy over the shared state-action space of the two MDPs. Then, the following identities holds:

(2)vπv^π=γ(IγPπ)1Mπ(PP^)v^πδ(v^π),(3)v^πvπ=γ(IγP^π)1Mπ(P^P)vπδ^(vπ).

Proof: We only need to prove (2) since (3) follows from this identity by symmetry. Concerning the proof of (2), we start with the closed form expression for value functions. From this we get

vπv^π=(IγPπ)1rπ(IγP^π)1rπ.

Inspired by the elementary identity that states that 11x11y=xy(1x)(1y), we calculate

vπv^π=(IγPπ)1[(IγP^π)(IγPπ)](IγP^π)1rπ=γ(IγPπ)1[PπP^π](IγP^π)1rπ=γ(IγPπ)1Mπ[PP^]v^π,

finishing the proof.

Note that in (3), the empirical transition kernel P^ appears through its inverse by left-multiplying Mπ(P^P), while in (2), through v^π, it appears by right-multiplying the same deviation term. In the remainder of this section we use (3), but in the next section we will use (2).

Combining (3) with our previous inequality, we immediately get that

(4)vvπ^γ1γ[(PP^)vπ+(PP^)vπ^].

Assume that P^ is obtained by sampling m next states at each state-action pair. By Hoeffding’s inequality and a union bound over the state-action pairs, for any fixed v[0,H]SA and 0ζ<1, with probability 1ζ, we have

(5)(PP^)v=Hlog(SA/ζ)2m

and in particular with v=vπ, we have

(PP^)vπ=O~(H/m).

Controlling the second term in (4) requires more care as π^ is random and depends on the same data that is used to generate P^. To deal with this term, we use another union bound. Let V~={vπ:π:SA} be the set of all possible value functions that we can obtain by considering deterministic policies. Since by construction π^ is also a deterministic policy, v^π^V~. Hence,

(PP^)v^π^supvV~(PP^)v.

and thus by a union bound over the |V~|AS functions v in V~, we get that with probability 1ζ,

(PP^)v^π^Hlog(SA|V~|/ζ)2m=Hlog(SA/ζ)+Slog(A)2m=O~(HS/m).

Putting things together, we see that

vvπ^=O~(H2S/m),

which reduces the dependence on H of the sample size bound from H6 to H4. As we shall see soon, this is not the best possible dependence on H. This method also falls short of giving the best possible dependence on the number of states. In particular, inverting the above bound, we see that with this method we can only guarantee a δ-optimal policy if the total number of samples, n=SAm is at least

O~(S2AH4/δ2)

while below we will see that the optimal bound is O~(SAH3/δ2).

Improved analysis of the plug-in method: Second attempt

There are two further ideas that help one achieve the sample complexity which will be seen to be optimal. One is to use what is known as Bernstein’s inequality in place of Hoeffding’s inequality, together with a clever observation on the “total variance” and the second is to improve the covering argument. The first idea helps with improving the horizon dependence, the second helps with improving the dependence on the number of states. In this lecture, we will only cover the first idea and sketch the second.

Bernstein’s inequality is a classic result in probability theory:


Theorem (Bernstein’s inequality): Let b>0 and let X1,,Xm[0,b] be an i.i.d. sequence and define X¯m as the sample mean of this sequence: X¯m=1m(X1++Xm). Then, for any ζ(0,1), with probability at least 1ζ,

|X¯mE[X1]|σ2log(2/ζ)m+23blog(2/ζ)m,

where σ2=Var(X1).


To set expectations, it will be useful to compare this bound to Hoeffding’s inequality. In particular, in the setting of the lemma Hoeffding’s inequality also applies and gives

|X¯mE[X1]|blog(2/ζ)2m.

Since in our case b=H (the value functions take values in the [0,H] interval), using m=H3/δ2 (which would give rise to the optimal sample size), Hoeffding’s inequality gives a bound of size HH3/2δ=H1/2δ (cf. (5)). This is a problem: Ideally, we would like to see H1δ here, because inequality \cref{eq:vdeb} introduces an additional H factor.

We immediately see that for Bernstein’s inequality to make a difference, just focusing on the first term in Bernstein’s inequality, we need σ=O(H1/2). In fact, since b/m=H2δ2=o(H1δ), we see that this is also sufficient to take off the H factor from the sample complexity bound. It thus remains to be seen whether the variance could indeed be this small.

To find this out, fix a state-action pair (s,a) and let S1,,SmPa(s) be an i.i.d. sequence of next states at (s,a). Then, ((P^P)vπ)(s,a)=(P^a(s)Pa(s))vπ has the same distribution as

Δ(s,a)=1mi=1mvπ(Si)Pa(s)vπ.

Defining Xi=vπ(Si) and σπ2(s,a)=Var(X1), we see that it is σπ(s,a) that appears when Bernstein’s inequality is used to bound ((P^P)vπ)(s,a). It remains to be seen how large values σπ(s,a) can take on. Sadly, one can quickly discover that the range of σπ(s,a) is sometimes also as large as H. Is Bernstein’s inequality a dead-end then?

Of course, it is not, otherwise we would not have introduced it. In particular, a better bound is possible by directly bounding the maximum-norm of

δ(vπ)=(IγPπ)1Mπ(PP^)vπ,

which is close to the actual term that we need to bound. Indeed, by \cref{(2)} from the value difference lemma, vπv^π=γδ(v^π) and thus

vπv^π=γδ(vπ)+γ(δ(v^π)δ(vπ)).

The second term on the right-hand side is of order 1/m (since (PP^)(v^πvπ) appears there and both PP^ and v^πvπ have been seen to be of order 1/m). As we expect δ(vπ) to be of order 1/m, we will focus on this term.

For simplicity, take now the case when π is a fixed, nonrandom policy (we need to bounded δ(vπ) for π=π and also for π=π^, the second of which is random). In this case, by a union bound and Bernstein’s inequality, with probability 1ζ,

|(PP^)vπ|2log(2SA/ζ)mσπ+2H3log(2/ζ)m1.

Multiplying both sides by (IγPπ)1Mπ, using a triangle inequality and the special properties of (IγPπ)1Mπ, we get

|δ(vπ)|(IγPπ)1Mπ|(PP^)vπ|(6)2log(2SA/ζ)m(IγPπ)1Mπσπ+2H23log(2SA/ζ)m1.

The following beautiful result, whose proof is omitted, gives an O(H3/2) bound on the first term appearing on the right-hand side of the above display:


Lemma (total discounted variance bound): For any discounted MDP M and policy π in M, (IγPπ)1Mπσπ2(1γ)3.


Since the bound that we get from here is H3/2 and not H2, “we are saved”. Indeed, plugging this into (6) gives

δ(vπ)2H3log(2SA/ζ)m+2H23log(2SA/ζ)m,

which holds with probability 1ζ. Choosing m=H3/δ2, we see that both terms are O(δ). It remains to show that a similar result holds for π=π^. If we use the union bound that we used before, we introduce an extra S factor. Avoiding this extra S factor requires new ideas, but with these we get the following result:


Theorem (upper bound for Z-designs): Let π^ be an optimal policy in the MDP whose transition kernel is P^, a kernel estimated based on a sample of m next states from each state-action pair. Letting 0ζ<1 and 0δH, if

mcγH3log(SAH/δ)δ2

then with probability 1ζ, π^ is δ-optimal, where c is a universal constant. In short, for any 0δH there exist an algorithm that produces a δ-optimal policy from a total number of

O~(γSAH3δ2)

samples under a uniform Z-design.


It remains to be seen whether the same sample complexity holds for larger values of δ, e.g., for δ=H/2.

Lower bound for Z-designs

A natural question is whether we can improve on the H3SA/δ2 upper bound, or whether this can be matched by a lower bound. For this, we have the following result:


Theorem (lower bound for Z-designs): Any algorithm that uses Z-designs and is guaranteed to produce a δ-optimal policy needs at least Ω(H3SA/δ2) samples.


Proof (sketch): As we have seen in the proof of the upper bound, the key to achieve the cubic dependence was that the sample mean X¯m of m i.i.d. bounded random variables is within a distance of σ1/m to the true mean. In a way, the converse of this is also true: It is “quite likely” that the distance between the sample and true mean is this large. This is not too hard to see for specific distributions, such as when the Xi are normally distributed, or when Xi are Bernoulli distributed (in a way, this is the essence of the central-limit theorem, though the central-limit theorem is restricted for m).

So how can we use this to establish the lower bound? In an MDP the randomness comes either from the rewards or the transitions. But in the upper bound above, the rewards were given, so the only source of randomness is transitions. Also, the cubic dependence must hold even if the number of states is a constant. What all this implies is that somehow learning the transition structure with a few states is what makes the sample complexity large as γ1 (or H). drawing Clearly, this can only happen if the (small) MDP has self-loops. The smallest example of an MDP with a self-loop is if one has an action and state such that taking that action from that state leads to same action with some positive probability, while with the complementary probability the next state is some other state. This leads to the structure shown on the figure on the right.

As can be seen, there are two states. The transition at the first state, call it state 1, is stochastic and leads to itself with probability p, while it leads to state 2 with probability 1p. The reward associated with both transitions is 1. The second state, call it state 2, has a self-loop. The reward associated with this transition is zero.

There are no actions (alternatively, there is only one action at both states). However, if we can show that in the lack of knowledge of p, estimating the value of state 1 up to a precision of δ takes Ω(H3/δ2) samples, the sample complexity result will follow. In particular, if we repeat the transition structure A times (sharing the same two states), one can make the value of p for one of this actions ever slightly so different from the others so that its value differs by (say) 2δ from the others. Then, by construction, a learner who uses fewer than Ω(AH3/δ2) total samples at state 1 will not be able to reliably tell the difference between the value of the special action and the other actions, hence, will not be to choose the right action and will thus be unable to produce a δ-optimal policy. To also add the state dependence, the structure can then be repeated S times.

So it remains to be seen whether the said sample complexity result holds for estimating the value of state 1. Rather than giving a formal proof, we give a quick heuristic argument, hoping that readers will find this more intuitive.

The starting point for this heuristic argument is the general observation that sample complexity questions concerning estimation problems are essentially questions about the sensitivity of the quantity to be estimated to the unknown parameters. Here, sensitivity means how much the quantity changes if we change the underlying parameter. This sensitivity for small deviations and a single parameter is exactly the derivative of the quantity of interest with respect to the parameter.

In our special case, the value of state 1, call it vp(1) (also showing the dependence on p) is the quantity to be estimated. Since the value of state 2 is zero, vp(1) must satisfy vp(1)=p(1+γvp(1))+(1p)1. Solving this we get

vp(1)=11pγ.

The derivative of this with respect to p is

ddpvp(1)=γ(1γp)2.

To get a δ-accurate estimate of vp0(1), we need

δ|vp0(1)vX¯m(1)|ddpvp(1)|p=p0|p0X¯m|=γ(1γp0)2|p0X¯m|γ(1γp0)2p0(1p0)m.

Inverting for m, we get that

mγ2p0(1p0)(1γp0)4δ2.

It remains to choose p0 as a function of γ to show that the above can be lower bounded by 1/(1γ)3. If we choose p0=γ, we have 1γp0=1γ2=(1γ)(1+γ)2(1γ) and hence

γ2p0(1p0)(1γp0)4δ2γ2γ(1γ)24(1γ)4δ2=γ324(1γ)3δ2.

Putting things together finishes the proof sketch.

A homework problem is included which explains how to fill in the gaps in the last section of the proof, while pointers to the literature are given that one can use to figure out how to fill the remaining gaps.

Policy-based designs

When the data is generated by following some policy, we talk about policy based designs. Here, the design decision is what policy to use to generate the data. The sample complexity of learning with policy based designs is the number of observations necessary and sufficient for some algorithm to figure out a policy of a fixed target suboptimality, from a fixed initial state, based on data generated by following a policy where the MDP where the policy is followed can be any of the MDPs within the class.

Three questions arise then. (i) The first question (the design question) is what policy to follow during data collection. If the policy can use the full history, the problem is not much different than online learning, which we will consider later. From this perspective the interesting (and perhaps more realistic) case is when the data-collection policy is memoryless and is fixed before the data collection begins. Hence, in what follows, we will restrict our attention to this case. (ii) The second question is what algorithm to use to compute the policy given the data generated. (iii) The final, third question is how large is the sample complexity of learning with policy induced data for some fixed MDP class.

Learning and estimating a good policy from policy induced data is much closer to reality than the same problem from Z-designs. Practical problems, such as problems in health care, robotics, etc., are so that we can obtain data generated by following some fixed policy, while it is usually not possible to demand obtaining sample transitions from arbitrary state-action pairs.

For simplicity, let us still consider the case of finite state-action MDPs, but to further simplify matters, let us now consider (homogenous kernel) finite horizon problems with a horizon H. As it turns out, the plug-in algorithm of the previous section is still a good algorithm in the sense that it achieves the optimal (minimax) sample complexity. However, the minimax sample complexity is much higher than it is for Z-designs:


Theorem (sample complexity lower bound with policy induced data): For any S,A,H, 0δ, any (memoryless) data collection policy π over the state-action spaces [S] and [A], for any ncAmin(S1,H)/δ2 and any algorithm L that maps data that has n transitions to a policy there exist an MDP M with state space [S] and action space [A] such that with constant probability, the policy π^ produced by L is not δ-optimal with respect to the H-horizon total reward criterion when the algorithm is fed with data by following π in M.


Proof (sketch): Without loss of generality assume that S=H+1; if there are more states, just ignore them, while if there are fewer states then just decrease H. Consider an MDP where states 1,,H are organized in a chain under the effect of some actions, and state H+1 is an absorbing state with zero associated reward. drawing For 1iH, let action ai be the one that gets to be chosen with the smallest probability in state i under the data generating policy π: ai=argmina[A]π(a|i). We choose action ai as the action that moves the state from i to i+1, deterministically. Any other action leads to state H+1, deterministically. All rewards are zero, except when transitioning from state H to state H+1 under action aH, where the reward is stochastic with a normal distribution with mean μ either 2δ or +2δ and a variance of one. The structure of the MDP is shown on the figure in the left-hand side.

Now, because of the choice of ai, π(ai|i)1/A. Hence, the probability that starting from state 1, following policy π for H steps will generate the sequence of states 1,2,,H,H+1, including the critical transition from state H to state H+1, is at most (1/A)H. This transition is critical in the sense is that only data from this transition decides whether in state 1 it is worth taking action a1 or not. In particular, if μ=2δ, taking a1 is a poor choice, while if μ=2δ, taking a1 is the optimal choice. The expected number of times this critical transition is seen is at most m=n(1/A)H. With m observations, the value of μ will be estimated up to an accuracy of O(1/m). When this is smaller than 2δ, with constant probability, the sign of μ cannot be decided and thus with constant probability, any algorithm will fail to identify whether a1 should be taken in state 1 or not (with a probability, of say, at least 1/2). Plugging in the expected value of m, we get that the condition on n is that cAH/n2δ where c>0 is some universal constant. Equivalently, the condition is that ncAH/(4δ2), which is the statement to be proven.

The lower bound construction suggests that the best policy to be used in the lack of extra information about the MDPs is the uniform policy. Note that a similar statement holds for the discounted setting. The contrast between this lower bound and the polynomial upper bound of the previous section are in strike contrast: Data obtained from following policies can be very poor. One may wonder whether the situation can be improved assuming that the data is obtained from a good policy (say, 2δ optimal policy), but the proof of the previous result in fact shows that this is not the case.

While the exponential lower bound on the sample complexity of learning from policy induced data is already bad enough, one may worry that the situation could be even worse. Could it happen that even the best algorithm needs double exponential number of samples? Or even infinite? A moment of thought shows that the latter is the case is switch to the average reward setting: This is because in the average reward setting the value of an action can depend on the value of a state whose hitting probability within an arbitrary fixed number of transitions is positive, just arbitrarily low. Can something similar happen perhaps in the finite-horizon setting, or the discounted setting? As it turns out, the answer is no. The previous lower bound gives the correct order of the sample complexity of finding a near-optimal policy using policy induced data:


Theorem (sample complexity upper bound with policy induced data): With m=Ω(S3H4Amin(H,S1)+2/δ2) episodes of length H collected with the uniform policy from a fixed initial distribution μ, with a constant probability, the plug-in algorithm produces a policy that is δ-optimal when started from μ.


Proof (sketch): For simplicity assume that the reward function is known. Let πlog be the logging policy, which is uniform. Again, assume that the plug-in algorithm produces a deterministic policy.

The proof is based on the decomposition of the suboptimality gap of the policy π^ produced that was used before. In particular, by (1),

(7)v(μ)vπ^(μ)vπ(μ)v^π(μ)+v^π^(μ)vπ^(μ),

where as before, v^π denotes the value function of policy π in the empirically estimated MDP. Further, we also used v(μ) as a shorthand for sμ(s)v(s)(=μ,v), where v:[S]R,

One then needs a counterpart of the value difference lemma. In this case, the following version is convenient: For any policy π,

qHπq^Hπ=h=0H1(Pπ)h(PP^)v^Hh1π,

and

q^HπqHπ=h=0H1(P^π)h(P^P)vHh1π,

where Pπ and P^π are SA×SA matrices. These can be proved by using the Bellman equation for the action-value functions and a simple recursion and noting that q0=r=q^0.

Next, we can observe that vπ(μ)=μπ,qπ where μπ is a distribution over [S]×[A] which assigns probability μ(s)π(a|s) to (s,a)[S]×[A]. This, combined with the value difference identity makes νhπ:=μπ(Pπ)h appear in the bounds. This is the probability distribution over the state-action space after using π for h steps when the initial distribution is μπ. Now, as this is multiplied by PP^, and for a given state-action pair (s,a), P(s,a)P^(s,a)11/N(s,a)1/Nh(s,a)1/mνhπlog, using that νhπ(s,a)νhπ(s,a) which holds because 0νhπ1, we see that it suffices if the ratios
ρhπ(s,a):=νhπ(s,a)/νhπlog(s,a) (or their square root) are controlled. Above, N(s,a) is the number of times (s,a) is seen in the data, and Nh(s,a) is the number of times (s,a) is seen in the data in the hth transition. Here we should also mention that we only control these terms for state-action pairs
(s,a) that satisfy νhπ(s,a)1/m as the total contribution of the other state-action pairs is O(1/m), i.e., small. For these state action pairs, νhπlog(s,a) is also positive and with high probability, the counts are also positive.

Next, one can show that

ρhπ(s,a)Amin(h+1,S).

This is done in two steps. First, show that νhπ(s,a)Ah+1νhπlog(s,a). This follows from the law of total probability: Write νhπ(s,a) as the sum of probabilities of all trajectories that end with (s,a) after h transitions. Next, for a given trajectory, replace each occurrence of π with πlog at the expense of introducing a factor of Ah+1 (this comes from π(a|s)1Aπlog(a|s)). The next step is to show that νhπ(s,a)ASνhπlog(s,a) also holds. This inequality follows by observing that the uniform policy and the uniform mixture of all deterministic (memoryless) policies induce the same distribution over the trajectories. Then by letting DET denote the set of all deterministic policies, using that πDET, we have νhπ(s,a)πDETνhπ(s,a)=ASνhπlog(s,a), where we used that |DET|=AS.

Putting things together, applying a union bound when it comes to argue for π^ and collecting terms gives the result.

Bibliographic remarks

Finding a good policy from a sample drawn from a Z-design and finding a good policy from a sample given a generative model, or random access simulator of the MDP (which we extensively studied in previous lectures on planning) are almost the same. The random access model however allows the learner to determine which state-action pair the next transition data should be generated at in reaction to the sample collected in a sequential fashion. Thus, computing a good policy with a random access simulator gives more power to the “learner” (or planner). The lower bound presented for Z-design can in fact be shown to hold for the generative setting, as well (the proof in the paper cited below goes through in this case with no changes). This shows that in the tabular case, adaptive random access to the simulator provides no benefits to the planner over non-adaptive random access.

The result of the O(H3SA/δ2) sample complexity bound to find a δ-optimal policy with uniform Z-design using the plug-in method is from the following paper:

  • Agarwal, Alekh, Sham Kakade, and Lin F. Yang. 2020. “Model-Based Reinforcement Learning with a Generative Model Is Minimax Optimal.” COLT, 67–83. arXiv link

This paper also contains a number of pointers to the literature. Interestingly, earlier approaches often used more complicated approaches which directly worked with value functions rather than the more natural plug-in approach. The problem of whether the plug-in method is minimax optimal in Z design for finite-horizon problem is open.

The result which was included in this lecture limits the range of δ to H. Equivalently, the result is not applicable for a small number of observations m per state-action pair. This limitation has been removed in a follow-up to this work:

  • Li, Gen, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Chen. 2020. “Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model.” NeurIPS

This paper still uses the plug-in method, but adds random noise to the observed rewards to help with tie-breaking.

The variance bound, which is the key to achieving the cubic dependence on the horizon is from the following paper:

  • Mohammad Gheshlaghi Azar, Rémi Munos, and Hilbert J Kappen. Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Machine learning, 91(3):325–349, 2013.

This paper also has the essential ideas for the matching lower bound. The 2020 paper is notable for some novel proof techniques, which were developed to bound the error terms whose control is not included in this lecture.

The results for learning with policy-induced data are from

  • Xiao, Chenjun, Ilbin Lee, Bo Dai, Dale Schuurmans, and Csaba Szepesvari. 2021. “On the Sample Complexity of Batch Reinforcement Learning with Policy-Induced Data.” arXiv

which also has the details that were omitted in these notes. This paper also gives a modern proof for the Z-design sample complexity lower bound.

One may ask whether the results for Z-design that show cubic dependence on the horizon H extend to the case of large MDPs when value function approximation is used. In a special case, this has been positively resolved in the following paper:

  • Yang, Lin F., and Mengdi Wang. 2019. “Sample-Optimal Parametric Q-Learning Using Linearly Additive Features.” ICML arXiv version

which uses an approach similar to Politex in a more restricted setting, but achieves an optimal dependence on H.