17. Introduction
Batch learning is concerned with problems when a learning algorithm must work with data collected in some manner that is not under the control of the learning algorithm: on a batch of data. In batch RL the data is given in the form of a sequence of trajectories of varying length, where each trajectory is of the form
Batch RL problems fall into two basic categories:
- Value prediction: Predict the value
of using a policy from the initial distribution , where both and are given in an explicit form. - Policy optimization: Find a good (ideally, near optimal) policy given the batch of data from an MDP.
These two problems are intimately related. On the one hand, a good value predictor can potentially be used to find good policies. On the other hand, a good policy optimizer can also be used to decide about whether the value of some policy is above or below some fixed threshold by appropriately manipulating the data fed to the policy optimizers. One can then put a binary search procedure around this decision routine to find out the value of some policy.
Value prediction problems have some common variations. In policy evaluation, rather than evaluating a policy for some fixed initial distribution, the goal is to estimate the entire value function of the policy. Of course, this is at least as hard as the simpler, initial value estimation problem. However, much of the hardness of the problem is already captured by the initial value estimation problem. In initial value prediction, oftentimes the goal is to predict an interval that contains the true unknown value with a prescribed probability, rather than just producing a “point estimate”. In the case of policy evaluation, the analogue is to predict a set that contains the true unknown value function with a prescribed probability. Here, a simpler goal is to estimate confidence intervals for each potential input (state), which when “pasted together” can be visualized as forming a confidence band.
There is also the question of how to collect data. In statistics, the problem of designing a “good way” of collecting the data is called the experimental design problem. The best is of course, if data can be collected in an active manner: This is when the data collection strategy changes in response to what data has been collected so far.
The problem of designing good active data collection strategies belongs to the bigger group of designing online learning algorithms. These are defined exactly based on that the data is collected in a way that depends on what data has been previously collected. The last segment of the part will be solely devoted to these online learning strategies.
In many applications, active data collection is not an option. There can be many reasons for this: active data collection may be deemed to be risky, expensive, or just technically challenging. When data is collected in a passive fashion, it may simply miss key information that would allow for good solutions. Still, in this case, there may be better and worse ways collecting data. Optimizing experimental designs is the problem of choosing good passive data collection strategies that lead to good learning outcomes. This topic came up in the context planning algorithms as they also need to create value function estimates and for this the data collection is better to be planned so that learning can succeed.
Oftentimes though, there is no control over how data is collected. Even worse, the method that was used to collect data may be unknown. When this is the case, not much can be done, as the following example shows:
Consider a bandit problem with two actions, denoted by
This is indeed consistent with that
This is an example where the correct model cannot be estimated because of the way data is collected: The presence of the spurious correlation between a variable that controls outcomes but is not recorded can easily make the data collected useless, regardless of quantities. This is an instance when the model is unidentifiable even with an infinite amount of data.
When data collection is as arbitrary as in the above example, only a very careful study of the domain can tell us whether the model is identifiable or not from the data. Note that this is an activity that involves thinking about the structure of the problem at hand. The best is of course if data collection can be influenced to avoid building up spurious correlations. When data is collected in a causal way (following a policy, while recording both the decisions made and the data is used to make those decisions), spurious correlations are avoided and the remaining problem is to guarantee sufficient “coverage” to achieve statistical efficiency.
How good is the plug-in method?
The plug-in method estimates a model and uses the estimated model in place of the real one to solve the problem at hand. Let
We consider the discounted case with a discount factor
We start with a generic result about contraction mappings:
Proposition (residual bound): Let
Proof: By the triangle inequality,
Reordering and solving for
An immediate implication is that good model estimates are guaranteed to give rise to (relatively) good value estimates.
Proposition (value estimation error): Let
Also,
Similarly,
and
Note that in general the value estimates are more sensitive to errors in the transition probabilities then in the rewards. In particular, the transition errors can be magnified by a factor as large as
Proof: To reduce clutter, we write
The second inequality follows from separating
The result just shown suffices to quantify the size of the value errors. For quantifying the policy optimization error that results from finding an optimal (or near optimal) policy for
Lemma (Policy error bound - I.): Let
-
If
is -optimizing in the sense that holds for every state then is suboptimal: -
If
is greedy with respect to then is -optimizing with and thus
This leads to the following result:
Theorem (bound on policy optimization error): Assume that the rewards both in
Note that, up to a small constant factor, the optimization error is magnified by a factor of
Proof: Let
On the one hand, we have
Let
Hence, by Part 1. of the “Policy Error Bound I.” lemma from above,
By the triangle inequality and the assumption on
By Eq.
The result is obtained by chaining the inequalities:
Model estimation error: Tabular case
As usual, it is worthwhile to clean up the foundations by considering the tabular case. In this case, the model can be estimared by using sample means. To allow for a unified presentation, let the data available be given in the form of triplets of the form
and
and for the reward estimate we have
For ensuring that these are always defined, let
Consider now the simple case when the above triplets are so that for each state-action pair
In particular, defining
provided that
from which it follows that with probability
where
One can alternatively write this in terms of the total number of observations,
It follows that for any target suboptimality
we are guaranteed that the optimal policy of the estimated model is at most
Notes
Between batch and online learning
In applications it may happen that one can change the data collection strategy a limited number of times. This creates a scenario that is in between batch and online learning. This setting can be thought to be between batch and online learning. From the perspective of online learning, this is learning in the presence of constraints on the data collection strategy. One such widely studied constraint is the number of switches of the data collection strategy. As it happens, only very few switches are necessary to get the full power of online learning and this is not really specific to reinforcement learning but follows because the empirical distribution converges are a slow rate to the true distribution. For parametric problems, the rate is
Batch RL with no access to state information
For simplicity, we stated the batch learning problem in a way that assumes that the states in the transitions are observed. This may be seen as problematic. One “escape” is to treat the whole history as the state: Indeed, in a causal, controlled stochastic process, the history can always be used as a Markov state. Because of this, the assumption that the state is observed is not restrictive, though the state space becomes exponential in the length of the trajectories. This reduces to the problem to learning in large state-space MDPs. Of course, even lower bounds for planning tell us that in lack of extra structure, all algorithms need a sample size proportional to the size of the state-action space, hence, one needs to add extra structure to deal with this case, such as function approximation. It also holds that if one uses, say, linear function approximation, then only the features of the states (or state-action pairs) need to be recorded in the data.
Causal reasoning and batch RL
Whether a causal effect can be learned from a batch of data (to be more precise, from data drawn from a specific distribution) is the topic of causal reasoning. In batch RL, the “effect” is the value of a policy, which, in the language of causal reasoning, would be called a multistage treatment. As the example in the text shows, in batch RL, just because of our assumptions on how the data is collected, the identifiability problem is just “assumed away”. When the assumption on how the data is generated/collected is not met, the tools of causal reasoning can potentially be still used. It is important to emphasize though that there is no causality without assuming causality. The statements that causal reasoning can make are conditional on the data sampling assumptions met. Even “causal discovery” is contingent on these assumptions. However, with care, oftentimes it is possible to argue for that some suitable assumptions are met (e.g., arguing based on what information is available at what time in a process), in which case, the nontrivial tools of causal reasoning may be very useful.
Nevertheless, especially in engineered systems, our standard data collection assumptions are reasonable and can be arranged for, though in large engineered systems, mistakes, such as not logging critical quantities may happen. One example of this is an action to be taken is overriden by some part of a system, which will, say, later be turned off. Clearly, if no one logs the actual actions taken, the effects of actions become unidentifiable. As we shall see later, batch RL and the causality literature share some of their vocabulary, such as “instrumental variables”, “propensity scores”, etc.
Plug-in or certainty equivalence
Plug-in generally means that a model is estimated and then is used as if it was the “true” model. In control, when a controller (policy) is derived with this approach, this is known as the “certainty equivalence” controller. The “certainty equivalence principle” states that the “random” errors can be neglected. The principle originates from the observation that in various scenarios, the optimal controller (optimal policy) has a special form that confirms this principle. In particular, this was first observed in the control of linear quadratic Gaussian control, where the optimal controller can be obtained by solving for the optimal control under perfect state information then substituting optimal state prediction for the the perfect state information. This strict optimality result is quite brittle. As we shall see soon, from the perspective of minimax optimality, certainty equivalent policies are not a bad choice.
Bibliographic remarks
In the early RL literature, online learning was dominant. When people tried to apply RL to various “industrial”/”applied” settings, they were forced to think about how to learn from data collected before learning starts. One of the first papers to push this agenda is the following one:
- Tree-Based Batch Mode Reinforcement Learning Damien Ernst, Pierre Geurts, Louis Wehenkel; 6(18):503−556, 2005.
Earlier mentions of “batch-mode RL” include
- Efficient Value Function Approximation Using Regression Trees (1999) by Xin Wang , Thomas G. Dietterich, Proceedings of the IJCAI Workshop on Statistical Machine Learning for Large-Scale Optimization. pdf
Even in online learning, efficient learning may force one to save all the data to be used for learning. The so-called LSTD algorithm, and later the LSPI algorithm, were explicitly proposed to address this challenge:
- J. A. Boyan. Technical update: least-squares temporal difference learning. Machine Learning, 49 (2-3):233–246, 2002.
- M. G. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning Research, 4:1107–1149, 2003a.
Off-policy learning refers to the case when an algorithm needs to produce value function (or action-value function) estimates for some policy and the data available is not generated by the policy to be evaluated. In all the above examples, we are thus in the setting of off-policy learning. The policy evaluation problem, accordingly, is often called the off-policy policy evaluation (OPPE) problem, while the problem of finding a good policy is called the off-policy policy optimization (OPPO) problem.
For a review of the literature of around 2012, consult the following paper:
- S. Lange, T. Gabel, M. Riedmiller (2012) Batch Reinforcement Learning. In: M. Wiering, M. van Otterlo (eds) Reinforcement Learning. Adaptation, Learning, and Optimization, vol 12. Springer, Berlin, Heidelberg pdf link2 link3