22. Introduction
Online learning in reinforcement learning refers to the idea that a learner is placed in an (initially) unknown MDP. By interacting with the MDP, the learner collects data about the unknown transition and reward function. The learner’s goal is to collect as much reward as possible, or output a near-optimal policy. The difference to planning is that the learner does not have access to the true MDP. Unlike in batch RL, the learner gets to decide what actions to play. Importantly, this means the learner’s action affect the data that is available to the learner (sometimes refered to as “closed loop”).
The fact that the learner needs to create its own data leads to an important decision: Should the learner sacrifice reward to collect more data that will improve decision making in the future? Or should it act according to what seems currently best? Clearly, too much exploration will be costly if the learner chooses actions with low reward too often. On the other hand, playing actions that appear optimal with limited data comes at the risk of missing out on even better rewards. In the literature, this is commonly known as exploration-exploitation dilemma.
The exploration-exploitation dilemma is not specific to the MDP setting. It already arises in the simpler (multi-armed) bandit setting (i.e. an MDP with only one state and stochastic reward).
In the following, we focus on finite-horizon episodic (undiscounted) MDPs
where
This model contains some important settings as a special case. Most notably,
recovers the contextual bandit setting, where the “context” is sampled from the distribution and is the finite multi-armed bandit setting.
Sample complexity and regret: How good is the learner?
The goal of the learner is to collect as much reward as possible. We denote
A learner has sublinear expected regret if
Before we go on to construct learners with small regret, we briefly note that there are also other objectives. The most common alternative is PAC - which stands for probably approximately correct. A learner is said to be
The difference to bounding regret is that in the first
-greedy
There exist many ideas on how to design algorithms with small regret. We first note that a “greedy” agent can easily fail: Following the best actions according to some empirical estimate can easily get you trapped in a supoptimal policy (think of some examples where this can happen!).
A simple remedy is to add a small amount of “forced” exploration: With (small) probability
It is often possible to show that
Not unexpectedly, this type of exploration can be quite sub-optimal. It is easy to construct examples, where
On the upside,
Optimism Principle
A popular technique to construct regret minimizing algorithms is based on optimism in the face of uncertainty. To formally define the idea, let
The optimism principle is to act according to the policy that achieves the highest reward among all plausible models, i.e.
On the other hand, the learner ensures that
One should also ask if the optimization problem
How much regret the learner has of course depends on the concrete setting at hand. In the next lecture we will see how we can make use of optimism to design (and analyize) an online learning algorithm for finite MDPs. The literature has produced a large amount of papers with algorithms that use the optimism principle in many settings. This however does not mean that optimism is a universal tool. More recent literature has also pointed out limitations of the optimsm principle, and in lieu proposed other design ideas.
Notes
Other Exploration Techniques
Some other notable exploration strategies are:
- Phased-Elimination and Experimental Design
- Thompson Sampling
- Information-Directed Sampling (IDS) and Estimation-To-Decisions (E2D)
References
The paper showing the details behind how to convert between Regret and PAC bounds.
- Dann, C., Lattimore, T., & Brunskill, E. (2017). Unifying PAC and regret: Uniform PAC bounds for episodic reinforcement learning. Advances in Neural Information Processing Systems, 30. [link]