6. online planning - Part II.
In the previous lecture online planning was introduced. The main idea is to amortize the cost of planning by asking a planner to produce an action to be taken at a particular state so that the policy induced by repeatedly calling the planner at the states just visited and then using the action returned by the planner is near-optimal. We have seen that with this, the cost of planning can be made independent of the size of the state space – at least for deterministic MDPs. For this, one can use just a recursive implementation of value iteration, which, for convenience, we wrote using action-value functions and the corresponding Bellman optimality operator,
(in the previous lecture we used
We have also seen that no procedure can do significantly better in terms of its runtime (or query cost) than this simple recursive procedure. In this lecture we show that these ideas also extend to the stochastic case.
Sampling May Save the Day?
Assume now that the MDP is stochastic. Recall the pseudocode of the recursive form of value iteration from the last lecture that computes
1. define q(k,s):
2. if k = 0 return [0 for a in A] # base case
3. return [ r(s,a) + gamma * sum( [P(s,a,s') * max(q(k-1,s')) for s' in S] ) for a in A ]
4. end
Obviously, the size of the state space creeps in because in line 3 we need to calculate an expected value over the next state distribution at
To quantify the size of these errors, we recall Hoeffding’s inequality:
Lemma (Hoeffding’s Inequality): Given
Letting
This suggests the following approach: For each state action pair
Plugging this approximation into our previous pseudocode gives the following new code:
1. define q(k,s):
2. if k = 0 return [0 for a in A] # base case
3. return [ r(s,a) + gamma/m * sum( [max(q(k-1,s')) for s' in C(s,a)] ) for a in A ]
4. end
The total runtime of this function is now
This pseudocode sweeps under the rug on who creates the lists
Good Action-Value Approximations Suffice
As a first step towards understanding the strength and weaknesses of this approach, let us define
With the help of this definition, when called with state
The conciseness of this formulae, if anything, must please everyone!
Let us now turn to the question of whether the policy
Suboptimality of -optimizing policies
Define
We call this function
Lemma (Policy error bound - I.): Let
-
If
is -optimizing in the sense that holds for every state then is suboptimal: -
If
is greedy with respect to then is -optimizing with and thus
For the proof, which is partially left to the reader, we need to introduce a bit more notation. In particular, for a memoryless policy, define the operator
With the help of this operator the condition that
Further, the second claim of the lemma can be stated in the more concise form
For future reference, we will also find it useful to define
Note that here we abused notation as
Proof: The first part of the proof is standard and is left to the reader. For the second part note that
Then use the first part.
Suboptimality of almost -optimizing policies
There are two issues that need to be taken care of. One is that the planner is randomizing when computing the values
All in all, the best we can hope for is that with each call,
Here,
Then, on
That is, on the good event
Let
In words, with probability at least
Lemma (Policy error bound II): Let
Proof: By Part 1 of the previous lemma, it suffices to show that
Error control
What remains is to show that with high probability, the error
where, for brevity, we introduced
Union bounds
So we know that for any fixed state-action pair
To answer this question, it will be easier to turn it around and just try to come up with some event that, on the one hand, has low probability, while, in the other hand, outside of this event,
Denoting by
it is clear that if
But how large can the probability of
Lemma (Union Bound): For any probability measure
By this result, using that
If we want this probability to be
The following diagram summarizes the idea of union bounds:
To control the error of some bad event happening, we can break the the bad event into a number of elementary parts. By controlling the probability of each such part, we can control the probability of the bad event, or, alternatively, control the probability of the complementary “good” event. The worst case for controlling the probability of the bad event is if the elementary parts do not overlap, but the argument of course works even in this case.
Returning to our calculations, from the last formula we see that the errors grew a little compared to
Avoiding dependence on state space cardinality
The key to avoiding the dependence on the cardinality of the state is to avoid taking union bounds over the whole state-action set. That this may be possible follows from that, thinking back to the recursive implementation of the planner, we can notice that the planner does not necessarily rely on all the sets
To get a handle on this, it will be useful to introduce a notion of a distance induced by the set
With this, for
as the set of states accessible from
We may now observe that in the calculation of
This can be proved by induction on
Click here for the proof.
The base case follows because whenTaking into account that when
and in particular,
The plan is to take advantage of this to avoid a union bound over all possible state-action pairs. We start with a recursive expression for the errors.
Recall that
Now, observing that
we see that
In particular, defining
we see that
where we use the notation
while
where the last inequality uses that
we get
We see that the first term in the sum on the right-hand side (in the parenthesis) is controlled by
In fact, notice that
Controlling
Since
To exploit that
This suggests that we should consider the chronological order in which in the recursive call of function
That
Lemma: Assume that the immediate rewards belong to the
where
Proof: Recall that
For
(as earlier,
Fix
Fix
Note that given this, for any
for some binary valued functions
holds if and only if
Now, notice that by our assumptions, for
Plugging this back into the previous displayed equation, “unrolling” the expansion done using the law of total probability, we find that
Now, choose
The claim the follows by a union bound over all actions and all
Final error bound
Putting everything together, we get that for any
Thus, to obtain a planner that induces a
For
Proposition: Let
From this, defining
and
if
Theorem: Assume that the immediate rewards belong to the
Overall, we see that the runtime did increase compared to the deterministic case (apart from logarithmic factors, in the above result
Notes
Sparse lookahead trees
The idea of the algorithm that we analyzed comes from a paper by Kearns, Mansour and Ng from 2002. In their paper they consider the version of the algorithm which creates a fresh “new” random set
It is not hard to modify the analysis given here to accommodate this change. With this, one can also interpret the calculations done by the algorithm as backing up values in a “sparse lookahead tree” built recursively from
Much work has been devoted to improving these basic ideas and eventually these ideas led to various Monte-Carlo tree search algorithms, including yours truly’s UCT. In general, these algorithms attempt to improve on the runtime by building the trees when they need to be built. As it turns out, a useful strategy here is to expand nodes which in a way hold the greatest promise to improve the value at the “root”. This is known as the “optimisism in planning”. Note that A* (and its MDP relative, AO) are also based on optimism: A’s admissible heuristic functions in our language correspond to functions that upper bound the optimal value. The definite source on MCTS theory as of today is Remi Munos’s monograph.
Measure concentration
Hoeffding’s inequality is a special case of what is known as measure concentration. This phrase refers to that the empirical measure induced by a sample is a good approximation to the whole measure. The simplest case is when one just compares the means of the measures (the empirical and the sample-generating one), giving rise to concentration inequalities around the mean. Hoeffding’s inequality is an example. What we like about Hoeffding’s inequality (besides that it is simple) is that the failure probability,
The comparison inequality
The comparison inequality between the logarithm and the linear function is given as Proposition 4 here. The proof is based on two observations: First, it is enough to consider the case when
A model-centered view and random operators
A key idea of this lecture is that
It may seem quite miraculous that with only a few elements in
A bigger point is that for a model to be a “good” approximation to the “true MDP”, it suffices that the Bellman optimality operator that it induces is a “close” approximation to the Bellman optimality operator of the true MDP.
This in fact brings us to our next topic, which is what happens when the simulator is imperfect?
Imperfect simulation model?
We can rarely expect simulators to be perfect. Luckily, not all is lost in this case. As noted above, if the simulator induced an MDP whose Bellman optimality operator is in a way close to the Bellman optimality operator of the true MDP, we expect the outcome of planning to be still a good policy in the true MDP.
In fact, the above proof has already all the key elements in place to show this. In particular, it is not hard to show that if
which, combined with the our first lemma of this lecture on the policy error bound gives that the policy that is greedy with respect to
optimal in the MDP underlying
Monte-Carlo methods
We saw in homework 0 that randomization may help a little, and today we saw that it can help in a more significant way. A major lesson again is that representations do matter: If the MDP is not given with a “generative simulator”, getting such a simulator may be really hard. This is good to remember when it comes to learning models:
One should insist on learning models that make the job of planners easier.
Generative models are one such case, provably, as we have seen in today’s lecture put together with our previous lower bound that involved the number of states. Randomization, more generally, is a powerful tool in computing science, which brings us to a somewhat philosophical question: What is randomness? Does “true randomness” exist? Can we really build computers to harness this?
True randomness?
What is the meaning of “true” randomness? The margin is definitely not big enough to explain this. Hence, we just leave this there, hanging, for everyone to ponder about. But let’s also note that this is a thoroughly studied question in theoretical computing science, with many beautiful results and even books. Arora and Barak’s book on computational complexity (Chapters 7, 20 and 21) is a good start for exploring this.
Can we recycle the sets between the calls?
If simulation is expensive, it may be tempting to recycle the sets between calls of the planner. After all, even if we recycle these sets,
The ubiquity of continuity arguments in the MDP literature
All the computations that we do with MDPs tend to be approximate. We evaluate policies approximately. We compute a Bellman back approximately. We have approximate models. We greedify approximately. If any of these operations could enlarge small errors, none of the approximate methods would work. The study of approximate computations (which is a necessity if one faces large MDPs) is a study of the sensitivity of the values of the resulting policies to the errors introduced in the computations. This, in numerical analysis, would be called error analysis. In other areas of mathematics, this is called sensitivity analysis. In fact, sensitivity analysis often involves computing derivatives to see how fast outputs change as the inputs change (which is that data that will be approximated). What should we be taking derivatives with respect to here? Well, it is always the data that is being changed. One can in fact use differentiation based sensitivity analysis everywhere. This has been tried a little in the “older” MDP literature and is also related to policy gradient theorems (that we will learn about laters). However, perhaps there are more nice things to be discovered about this approach.
From local to online access
The algorithm that is analyzed in this lecture requires local access simulators. This is better than requiring global access, but worse than requiring online access. It remains an open question of whether with online access, one can also get a similar result than shown in the present lecture and if not, whether the sample complexity of planning remains finite under this setting.
When the state space is small
For finite state-action MDPs where the rewards and transition probabilities are represented using tables, a previous lecture’s main result established that an optimal policy of the MDP can be calculated by using at most
In the case of online planning with global access, the sample complexity cannot be worse, but it is unclear whether it can be improved. Similarly, it is unclear what the complexity is in the case of either local or online access.
References
- Kearns, M., Mansour, Y., & Ng, A. Y. (2002). A sparse sampling algorithm for near-optimal planning in large Markov decision processes. Machine learning, 49(2), 193-208. [link]
- David Pollard (2015). A few good inequalities. Chapter 2 of a book under preparation with working title “MiniEmpirical”. [link]
- Stephane Boucheron, Gabor Lugosi and Pascal Massart (2012). Concentration inequalities: A nonasymptotic theory of indepndence. Clarendon Press – Oxford. [link]
- Roman Vershynin (2018). High-Dimensional Probability: An Introduction with Applications in Data Science. [link]
- M. J. Wainwright (2019) High-dimensional statistics: A non-asymptotic viewpoint. Cambridge University Press.
- Lafferty J., Liu H., & Wasserman L. (2010). Concentration of Measure. [link]
- Lattimore, T., & Szepesvári, C. (2020). Bandit algorithms. Cambridge University Press.
- William B. Haskell, Rahul Jain, and Dileep Kalathil. Empirical dynamic programming. Mathematics of Operations Research, 2016.
- Sanjeev Arora and Boaz Barak (2009). Computational Complexity: A Modern Approach. Cambridge University Press.
- Remi Munos (2014). From Bandits to Monte-Carlo Tree Search: The Optimistic Principle Applied to Optimization and Planning. Foundations and Trends in Machine Learning: Vol. 7: No. 1, pp 1-129.