23. Tabular MDPs
In this lecture we will analize an online learning algorithm for the finite-horizon episodic MDP setting. Let
We will focus on the finite-horizon setting where the learner interacts with the MDP over
Recall that the regret is defined as follows:
where
UCRL: Upper Confidence Reinforcement Learning
The UCRL algorithm implements the optimism princple. For this we need to define a set of plausible models. First, we define the maximum likelihood estimates using data from rounds
The definition makes use of the notation
Define the confidence set
where
for all with probability at least is “not too large”
The second point will appear formually in the proof, however note that from a statistical perspective, we want the confidence set to be as efficient as possible.
With the confidence set, we can now introduce the UCRL algorithm:
UCRL (Upper confidence reinforcement learning):
In episodes
- Compute confidence set
- Use policy
- Observe episode data
Note that we omitted the rewards from the observation data. Since we made the assumption that the reward vector
For now we we also glance over the point of how to compute the optimistic policy
Step 1: Defining the confidence set
Lemma (L1-confidence set): Let
Then, with probability at least
Proof: Let
The Markov property implies that
Fix some
where in the last line we defined
Next note that
In a last step, we take a union bound over
This follows from the simple obseration that
Lastly, the claim follows by noting that
Step 2: Bounding the regret
Theorem (UCRL Regret): The regret of UCRL defined with confidence sets
where
Proof: Denote by
Further, let
In what follows we assume that we are on the event
Fix
Note that we used that
The first term is easily bounded. This is the crucial step that makes use of the optimism principle. By
The last term is also relatively easy to control. Denote
The sequence
be the data available to the learner at the beginning of the episode
A sequence of random variables
It remains to bound term (II) in the regret decomposition:
Using the Bellman equation, we can recursively compute the value function for any policy
We introduce the following shorthand for the value difference of policy
Let
The first inequality uses that for any two vectors
Telescoping and using that
Note that
It remains to bound term
Lemma:
For any sequence
Proof of Lemma: Let
The claim follows from telescoping.
Continuing the proof of the theorem where we need to bound
Next, using the algebraic lemma above and the fact that
The last inequality uses Jensen’s inequality.
Collecting all terms and taking the union bound over two applications of Azuma-Hoeffdings and the event
Unknown reward functions
In our analysis of UCRL we assumed that the reward function is known. While this is quite a common assumption in the literature, it is mainly for simplicity. We also don’t expect the bounds to change by much: Estimating the rewards is not harder than estimating the transition kernels.
To modify the analysis and account for unkown rewards, we first consider the case with deterministic reward function
Embracing the idea of optimism, we define reward estimates
Clearly this defines an optimistic estimate,
When the reward is stochastic, we can use a maximum likelihood estimate of the reward and construct confidence bounds around the estimate. This way we can define an optimistic reward. Still not much changes, as the reward estimates concentrate at the same rate as the estimates of
UCBVI: Upper Confidence Bound Value Iteration
Computing the UCRL policy can be quite challenging. However, we can relax the construction so that we can use backward induction. We define a time-inhomogenous relaxation of the confidence set:
Let
Note that the maximum in the second line is a linear optimization with convex constraints that can be solved efficiently. Further, the proof of the UCRL regret still applies, because we used the same (step-wise) relaxation in the analysis.
We can further relax the backward induction to avoid the optimization over
This leads us to the the UCBVI (upper confidence bound value iteration) algorithm. In episode
UCBVI (Upper confidence bound value iteration):
In episodes
- Compute optimistic value function:
- Follow greedy policy
- Observe episode data
Note that we truncate the
By more carefully designing the reward bonuses for UCBVI, it is possible to achieve
Notes
References
The original UCRL paper. Notice that they consider the infinite horizon average reward setting, which is different from the episodic setting we present.
Auer, P., & Ortner, R. (2006). Logarithmic online regret bounds for undiscounted reinforcement learning. Advances in neural information processing systems, 19. [link]
The UCBVI paper. Notice that they consider the homogeneous setting, which is different from the in-homogeneous setting we present.
Azar, M. G., Osband, I., & Munos, R. (2017, July). Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning (pp. 263-272). PMLR. [link]
The paper that presents the lower bound. Notice the they consider the infinite horizon average reward setting. Thus, there results contains a diameter term
Auer, P., Jaksch, T., & Ortner, R. (2008). Near-optimal regret bounds for reinforcement learning. Advances in neural information processing systems, 21. [link]