Let be the set of state-action pairs. A -design assigns a count to every member of , that is, to every state-action pair. In the last lecture we saw that
samples are sufficient to obtain a -suboptimal policy with high probability provided that data is generated from a -design that assigns the same count to each state-action pair and to get a policy one uses the straightforward plug-in approach that estimates the rewards and transitions using empirical estimates and uses the policy that is optimal with respect to the estimated model. Above, the dependence on the number of state-action pairs is optimal, but the dependence on the horizon is suboptimal. In the first half of this lecture, I sketch how the analysis presented in the previous lecture can be improved to get the optimal cubic dependence, together with a sketch that shows that the cubic dependence is indeed optimal.
In the second half of the lecture, we consider policy-based data collection, or experimental designs, where the goal is to find a near optimal policy from an initial state, where the data consists of trajectories obtained by rolling out the data-collection policy from the said initial state. Here, we will show a lower bound that shows that the sample complexity in this case is at least as large , which shows that there exist an exponential separation between both -designs and policy-based designs, and also between passive and active learning. To see the latter, note that in the presence of a simulator, with only a reset to an initial state, one can use approximate policy iteration with rollouts, or Politex with rollouts, to get a policy that is near-optimal when started from the initial state that one can reset to but with polynomially many samples in and (cf. Lecture 8 and Lecture 14).
Improved analysis of the plug-in method: First attempt
The improvement in the analysis of the plug-in method comes from two sources:
Using a version of the value-difference identity and avoiding the use of the policy error bound
Using Bernstein’s inequality in place of Hoeffding’s inequality
In this section, we focus on the first aspect. The second aspect will be considered in the next section.
We continue to use the same notation as in the previous lecture. In particular, denotes the “true” MDP, denotes the estimated MDP and we put on quantities related to this second MDP. We further let be one of the memoryless optimal policies of . For simplicity, we will assume that the reward function in is the same as in : As we have seen, the higher order term in our error bound came from errors in the transition probability; the simplifying assumption allows us to focus on reducing this term while minimizing clutter. The arguments are easy to extend to the case when .
Let be a policy whose suboptimality in we want to bound. The idea is to bound the suboptimality of by its suboptimality in and also by how much value functions for fixed policies differ when we switch from to . In particular, we have
where denotes an optimal policy in and the inequality holds because . The term marked as “opt. error” is the optimization error that arises when is not (quite) optimal in . This term is controlled by the choice of . For simplicity, assume for now that is an optimal policy in , so that we can drop this term. We further assume that is a deterministic optimal policy of .
It remains to bound the first and last terms. Both of these terms have the form , i.e., the difference between the value functions of the same policy in the two MDPs (here, is either or ). This difference, similar to the value difference identity, can be expressed as a function of the difference , as shown in the next result:
Lemma (value difference from transition differences): Let and be two MDPs sharing the same state-action space, rewards, but differing in their transition probabilities. Let be a memoryless policy over the shared state-action space of the two MDPs. Then, the following identities holds:
Proof: We only need to prove since follows from this identity by symmetry. Concerning the proof of , we start with the closed form expression for value functions. From this we get
Inspired by the elementary identity that states that , we calculate
finishing the proof.
Note that in , the empirical transition kernel appears through its inverse by left-multiplying , while in , through , it appears by right-multiplying the same deviation term. In the remainder of this section we use , but in the next section we will use .
Combining with our previous inequality, we immediately get that
Assume that is obtained by sampling next states at each state-action pair. By Hoeffding’s inequality and a union bound over the state-action pairs, for any fixed and , with probability , we have
and in particular with , we have
Controlling the second term in requires more care as is random and depends on the same data that is used to generate . To deal with this term, we use another union bound. Let be the set of all possible value functions that we can obtain by considering deterministic policies. Since by construction is also a deterministic policy, . Hence,
and thus by a union bound over the functions in , we get that with probability ,
Putting things together, we see that
which reduces the dependence on of the sample size bound from to . As we shall see soon, this is not the best possible dependence on . This method also falls short of giving the best possible dependence on the number of states. In particular, inverting the above bound, we see that with this method we can only guarantee a -optimal policy if the total number of samples, is at least
while below we will see that the optimal bound is .
Improved analysis of the plug-in method: Second attempt
There are two further ideas that help one achieve the sample complexity which will be seen to be optimal. One is to use what is known as Bernstein’s inequality in place of Hoeffding’s inequality, together with a clever observation on the “total variance” and the second is to improve the covering argument. The first idea helps with improving the horizon dependence, the second helps with improving the dependence on the number of states. In this lecture, we will only cover the first idea and sketch the second.
Bernstein’s inequality is a classic result in probability theory:
Theorem (Bernstein’s inequality): Let and let be an i.i.d. sequence and define as the sample mean of this sequence: . Then, for any , with probability at least ,
where .
To set expectations, it will be useful to compare this bound to Hoeffding’s inequality. In particular, in the setting of the lemma Hoeffding’s inequality also applies and gives
Since in our case (the value functions take values in the interval), using (which would give rise to the optimal sample size), Hoeffding’s inequality gives a bound of size (cf. ). This is a problem: Ideally, we would like to see here, because inequality \cref{eq:vdeb} introduces an additional factor.
We immediately see that for Bernstein’s inequality to make a difference, just focusing on the first term in Bernstein’s inequality, we need . In fact, since , we see that this is also sufficient to take off the factor from the sample complexity bound. It thus remains to be seen whether the variance could indeed be this small.
To find this out, fix a state-action pair and let be an i.i.d. sequence of next states at . Then, has the same distribution as
Defining and , we see that it is that appears when Bernstein’s inequality is used to bound . It remains to be seen how large values can take on. Sadly, one can quickly discover that the range of is sometimes also as large as . Is Bernstein’s inequality a dead-end then?
Of course, it is not, otherwise we would not have introduced it. In particular, a better bound is possible by directly bounding the maximum-norm of
which is close to the actual term that we need to bound. Indeed, by \cref{} from the value difference lemma, and thus
The second term on the right-hand side is of order (since appears there and both and have been seen to be of order ). As we expect to be of order , we will focus on this term.
For simplicity, take now the case when is a fixed, nonrandom policy (we need to bounded for and also for , the second of which is random). In this case, by a union bound and Bernstein’s inequality, with probability ,
Multiplying both sides by , using a triangle inequality and the special properties of , we get
The following beautiful result, whose proof is omitted, gives an bound on the first term appearing on the right-hand side of the above display:
Lemma (total discounted variance bound): For any discounted MDP and policy in ,
Since the bound that we get from here is and not , “we are saved”. Indeed, plugging this into gives
which holds with probability . Choosing , we see that both terms are . It remains to show that a similar result holds for . If we use the union bound that we used before, we introduce an extra factor. Avoiding this extra factor requires new ideas, but with these we get the following result:
Theorem (upper bound for -designs): Let be an optimal policy in the MDP whose transition kernel is , a kernel estimated based on a sample of next states from each state-action pair. Letting and , if
then with probability , is -optimal, where is a universal constant. In short, for any there exist an algorithm that produces a -optimal policy from a total number of
samples under a uniform -design.
It remains to be seen whether the same sample complexity holds for larger values of , e.g., for .
Lower bound for -designs
A natural question is whether we can improve on the upper bound, or whether this can be matched by a lower bound. For this, we have the following result:
Theorem (lower bound for -designs): Any algorithm that uses -designs and is guaranteed to produce a -optimal policy needs at least samples.
Proof (sketch): As we have seen in the proof of the upper bound, the key to achieve the cubic dependence was that the sample mean of i.i.d. bounded random variables is within a distance of to the true mean. In a way, the converse of this is also true: It is “quite likely” that the distance between the sample and true mean is this large. This is not too hard to see for specific distributions, such as when the are normally distributed, or when are Bernoulli distributed (in a way, this is the essence of the central-limit theorem, though the central-limit theorem is restricted for ).
So how can we use this to establish the lower bound? In an MDP the randomness comes either from the rewards or the transitions. But in the upper bound above, the rewards were given, so the only source of randomness is transitions. Also, the cubic dependence must hold even if the number of states is a constant. What all this implies is that somehow learning the transition structure with a few states is what makes the sample complexity large as (or ). Clearly, this can only happen if the (small) MDP has self-loops. The smallest example of an MDP with a self-loop is if one has an action and state such that taking that action from that state leads to same action with some positive probability, while with the complementary probability the next state is some other state. This leads to the structure shown on the figure on the right.
As can be seen, there are two states. The transition at the first state, call it state , is stochastic and leads to itself with probability , while it leads to state with probability . The reward associated with both transitions is . The second state, call it state , has a self-loop. The reward associated with this transition is zero.
There are no actions (alternatively, there is only one action at both states). However, if we can show that in the lack of knowledge of , estimating the value of state up to a precision of takes samples, the sample complexity result will follow. In particular, if we repeat the transition structure times (sharing the same two states), one can make the value of for one of this actions ever slightly so different from the others so that its value differs by (say) from the others. Then, by construction, a learner who uses fewer than total samples at state will not be able to reliably tell the difference between the value of the special action and the other actions, hence, will not be to choose the right action and will thus be unable to produce a -optimal policy. To also add the state dependence, the structure can then be repeated times.
So it remains to be seen whether the said sample complexity result holds for estimating the value of state . Rather than giving a formal proof, we give a quick heuristic argument, hoping that readers will find this more intuitive.
The starting point for this heuristic argument is the general observation that sample complexity questions concerning estimation problems are essentially questions about the sensitivity of the quantity to be estimated to the unknown parameters. Here, sensitivity means how much the quantity changes if we change the underlying parameter. This sensitivity for small deviations and a single parameter is exactly the derivative of the quantity of interest with respect to the parameter.
In our special case, the value of state , call it (also showing the dependence on ) is the quantity to be estimated. Since the value of state is zero, must satisfy . Solving this we get
The derivative of this with respect to is
To get a -accurate estimate of , we need
Inverting for , we get that
It remains to choose as a function of to show that the above can be lower bounded by . If we choose , we have and hence
Putting things together finishes the proof sketch.
A homework problem is included which explains how to fill in the gaps in the last section of the proof, while pointers to the literature are given that one can use to figure out how to fill the remaining gaps.
Policy-based designs
When the data is generated by following some policy, we talk about policy based designs. Here, the design decision is what policy to use to generate the data. The sample complexity of learning with policy based designs is the number of observations necessary and sufficient for some algorithm to figure out a policy of a fixed target suboptimality, from a fixed initial state, based on data generated by following a policy where the MDP where the policy is followed can be any of the MDPs within the class.
Three questions arise then. (i) The first question (the design question) is what policy to follow during data collection. If the policy can use the full history, the problem is not much different than online learning, which we will consider later. From this perspective the interesting (and perhaps more realistic) case is when the data-collection policy is memoryless and is fixed before the data collection begins. Hence, in what follows, we will restrict our attention to this case. (ii) The second question is what algorithm to use to compute the policy given the data generated. (iii) The final, third question is how large is the sample complexity of learning with policy induced data for some fixed MDP class.
Learning and estimating a good policy from policy induced data is much closer to reality than the same problem from -designs. Practical problems, such as problems in health care, robotics, etc., are so that we can obtain data generated by following some fixed policy, while it is usually not possible to demand obtaining sample transitions from arbitrary state-action pairs.
For simplicity, let us still consider the case of finite state-action MDPs, but to further simplify matters, let us now consider (homogenous kernel) finite horizon problems with a horizon . As it turns out, the plug-in algorithm of the previous section is still a good algorithm in the sense that it achieves the optimal (minimax) sample complexity. However, the minimax sample complexity is much higher than it is for -designs:
Theorem (sample complexity lower bound with policy induced data): For any , , any (memoryless) data collection policy over the state-action spaces and , for any and any algorithm that maps data that has transitions to a policy there exist an MDP with state space and action space such that with constant probability, the policy produced by is not -optimal with respect to the -horizon total reward criterion when the algorithm is fed with data by following in .
Proof (sketch): Without loss of generality assume that ; if there are more states, just ignore them, while if there are fewer states then just decrease . Consider an MDP where states are organized in a chain under the effect of some actions, and state is an absorbing state with zero associated reward. For , let action be the one that gets to be chosen with the smallest probability in state under the data generating policy : . We choose action as the action that moves the state from to , deterministically. Any other action leads to state , deterministically. All rewards are zero, except when transitioning from state to state under action , where the reward is stochastic with a normal distribution with mean either or and a variance of one. The structure of the MDP is shown on the figure in the left-hand side.
Now, because of the choice of , . Hence, the probability that starting from state , following policy for steps will generate the sequence of states , including the critical transition from state to state , is at most . This transition is critical in the sense is that only data from this transition decides whether in state it is worth taking action or not. In particular, if , taking is a poor choice, while if , taking is the optimal choice. The expected number of times this critical transition is seen is at most . With observations, the value of will be estimated up to an accuracy of . When this is smaller than , with constant probability, the sign of cannot be decided and thus with constant probability, any algorithm will fail to identify whether should be taken in state or not (with a probability, of say, at least ). Plugging in the expected value of , we get that the condition on is that where is some universal constant. Equivalently, the condition is that , which is the statement to be proven.
The lower bound construction suggests that the best policy to be used in the lack of extra information about the MDPs is the uniform policy. Note that a similar statement holds for the discounted setting. The contrast between this lower bound and the polynomial upper bound of the previous section are in strike contrast: Data obtained from following policies can be very poor. One may wonder whether the situation can be improved assuming that the data is obtained from a good policy (say, optimal policy), but the proof of the previous result in fact shows that this is not the case.
While the exponential lower bound on the sample complexity of learning from policy induced data is already bad enough, one may worry that the situation could be even worse. Could it happen that even the best algorithm needs double exponential number of samples? Or even infinite? A moment of thought shows that the latter is the case is switch to the average reward setting: This is because in the average reward setting the value of an action can depend on the value of a state whose hitting probability within an arbitrary fixed number of transitions is positive, just arbitrarily low. Can something similar happen perhaps in the finite-horizon setting, or the discounted setting? As it turns out, the answer is no. The previous lower bound gives the correct order of the sample complexity of finding a near-optimal policy using policy induced data:
Theorem (sample complexity upper bound with policy induced data): With episodes of length collected with the uniform policy from a fixed initial distribution , with a constant probability, the plug-in algorithm produces a policy that is -optimal when started from .
Proof (sketch): For simplicity assume that the reward function is known. Let be the logging policy, which is uniform. Again, assume that the plug-in algorithm produces a deterministic policy.
The proof is based on the decomposition of the suboptimality gap of the policy produced that was used before. In particular, by ,
where as before, denotes the value function of policy in the empirically estimated MDP. Further, we also used as a shorthand for , where ,
One then needs a counterpart of the value difference lemma. In this case, the following version is convenient: For any policy ,
and
where and are matrices. These can be proved by using the Bellman equation for the action-value functions and a simple recursion and noting that .
Next, we can observe that where is a distribution over which assigns probability to . This, combined with the value difference identity makes appear in the bounds. This is the probability distribution over the state-action space after using for steps when the initial distribution is . Now, as this is multiplied by , and for a given state-action pair , , using that which holds because , we see that it suffices if the ratios (or their square root) are controlled. Above, is the number of times is seen in the data, and is the number of times is seen in the data in the th transition. Here we should also mention that we only control these terms for state-action pairs that satisfy as the total contribution of the other state-action pairs is , i.e., small. For these state action pairs, is also positive and with high probability, the counts are also positive.
Next, one can show that
This is done in two steps. First, show that . This follows from the law of total probability: Write as the sum of probabilities of all trajectories that end with after transitions. Next, for a given trajectory, replace each occurrence of with at the expense of introducing a factor of (this comes from ). The next step is to show that also holds. This inequality follows by observing that the uniform policy and the uniform mixture of all deterministic (memoryless) policies induce the same distribution over the trajectories. Then by letting denote the set of all deterministic policies, using that , we have , where we used that .
Putting things together, applying a union bound when it comes to argue for and collecting terms gives the result.
Bibliographic remarks
Finding a good policy from a sample drawn from a -design and finding a good policy from a sample given a generative model, or random access simulator of the MDP (which we extensively studied in previous lectures on planning) are almost the same. The random access model however allows the learner to determine which state-action pair the next transition data should be generated at in reaction to the sample collected in a sequential fashion. Thus, computing a good policy with a random access simulator gives more power to the “learner” (or planner). The lower bound presented for -design can in fact be shown to hold for the generative setting, as well (the proof in the paper cited below goes through in this case with no changes). This shows that in the tabular case, adaptive random access to the simulator provides no benefits to the planner over non-adaptive random access.
The result of the sample complexity bound to find a -optimal policy with uniform -design using the plug-in method is from the following paper:
Agarwal, Alekh, Sham Kakade, and Lin F. Yang. 2020. “Model-Based Reinforcement Learning with a Generative Model Is Minimax Optimal.” COLT, 67–83. arXiv link
This paper also contains a number of pointers to the literature. Interestingly, earlier approaches often used more complicated approaches which directly worked with value functions rather than the more natural plug-in approach. The problem of whether the plug-in method is minimax optimal in design for finite-horizon problem is open.
The result which was included in this lecture limits the range of to . Equivalently, the result is not applicable for a small number of observations per state-action pair. This limitation has been removed in a follow-up to this work:
Li, Gen, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Chen. 2020. “Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model.” NeurIPS
This paper still uses the plug-in method, but adds random noise to the observed rewards to help with tie-breaking.
The variance bound, which is the key to achieving the cubic dependence on the horizon is from the following paper:
Mohammad Gheshlaghi Azar, Rémi Munos, and Hilbert J Kappen. Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Machine learning, 91(3):325–349, 2013.
This paper also has the essential ideas for the matching lower bound. The 2020 paper is notable for some novel proof techniques, which were developed to bound the error terms whose control is not included in this lecture.
The results for learning with policy-induced data are from
Xiao, Chenjun, Ilbin Lee, Bo Dai, Dale Schuurmans, and Csaba Szepesvari. 2021. “On the Sample Complexity of Batch Reinforcement Learning with Policy-Induced Data.” arXiv
which also has the details that were omitted in these notes. This paper also gives a modern proof for the -design sample complexity lower bound.
One may ask whether the results for -design that show cubic dependence on the horizon extend to the case of large MDPs when value function approximation is used. In a special case, this has been positively resolved in the following paper:
Yang, Lin F., and Mengdi Wang. 2019. “Sample-Optimal Parametric Q-Learning Using Linearly Additive Features.” ICML arXiv version
which uses an approach similar to Politex in a more restricted setting, but achieves an optimal dependence on .