Reinforcement Learning
Table of Contents
Overview
Reinforcement Learning is a learning framework where which helps us to account for future reward by "encoding" the value of the next state (and / or actions) in the value of the current state (and / or actions), thus turning the greedy "strategy", wrt. this value function we're creating, the optimal "strategy".
Notation
- Capital letter = random variable, e.g. is the random variable which can take on all
- Wierd capital letters = spaces of values that the random variables can take on, e.g. reward space / function
- Lowercase = specific values the random variables take on, e.g.
- denotes optimality for some function
- - policy
- - action
- - state
- - reward
- - value function for some state
- - accumulated reward at step
Markov Decision Process
Markov Reward Process (MRP)
See p. 10 in the Lecture 2: MDP.
Markov Decision Process (MDP)
See p. 24 in the Lecture 2: MDP.
Optimality in MDP
Learning performance
Asymptotic convergence
PAC learning:
for some .
Regret bound
for some .
- Assumes oracle exists
- Regret doesn't tell you anything about each of the individual regrets
Partial Observable Markov Decision Process (POMDP)
See p. 51 in the Lecture 2: MDP.
Planning by Dynamic Programming
Optimal state-value function:
The Bellman equation defines recursively:
and then the Bellman optimality equation
Policy evaluation
Problem - evaluate a given policy Solution - iterative application of Bellman expectation backup
- Initial value function is 0 everywhere
- Using synchronous backups
- Each iteration
- For all states
- Update from
In short, when updating the value function for state , we use the Bellman equation and the previous value function for all "neighboring" states (the states we can end up it with all the actions we can take).
refers to the iteration of the iterative policy evaluation
Policy iteration
Problem: given a policy , find a better policy
- Evaluate the polciy
Improve the policy by acting greedily wrt. to ,
This will eventually converge to the optimal policy due to the partial ordering of policies .
Value iteration
Problem
Find optimal policy given the dynamics of the systems (rewards and environmental state)
Solution
Iterative application of Bellman optimality backup.
- Initialize value function
- For each iteration
- For all states
- Update from
Unlike policy iteration, there is no explicity policy here.
At each iteration, for all states we make the current optimal action (i.e. optimal wrt. the current iteration), and then update the value for this state.
Eventually will converge to the true value function for some , and since all states are dependent through the recursive relationship from the Bellman equation, the other states will also start to converge.
In the form of the Bellman equation we can write the iterative process as:
Or in matrix notation:
Summary
Problem | Bellman equation | Algorithm |
---|---|---|
Prediction | Bellman Expectation equation | Iterative, Policy evaluation |
Control (maximize reward) | Bellman Expectation equation + Greedy Policy improvement | Policy iteration |
Control | Bellman Optimality equation | Value iteration |
Extensions
Asynchronous Dynamic Programming
Prioritized sweeping
- Decide which states to update
- Use absolute value between Bellman equation update and current to guide our selection for which states to update
Real-time dynamic programming
- Idea: only the states that are relevant to agent
- Randomly sample
- After each time-step
- Backup the state
Model-free Prediction - policy-evaluation
What if we don't know the Markov Decision Process, i.e. how the environment works?
This sections focuses only on evaluating the value function for some given policy . This policy does not have to be the optimal policy. ONE STEP AT THE TIME BOIS!
Monte-Carlo Learning
Notation
- - # of times we've seen state in this episode
- - sum of values received from the times we've been in state in this episode
Overview
value
is equal toavg. return
- Learns from complete episodes, i.e. we need to reach some termination
- Caveat can only apply MC to episodic
We assume , and so all we try to do is find the avg. accumulated reward starting from our current state at step .
To obtain this average, we sample "mini-episodes" starting from our current state and average the accumulated reward over these samples.
First-Visit MC policy evaluation
The first time-step in that state is visited in an (sampled) episode:
- Increment counter
- Increment total return , ( being the sum in this case)
- Value is estimated by mean return
- By law of large numbers,
Instead of using the total sums, we can take advantage of the recursive mean formula for online-learning.
Every-Visit MC policy evaluation
Same as First-Visit MC policy evaluation, but if we have loops and end up in the same state multiple times in the same episode, we perform the steps above.
Incremental MC
See p. 11 in Lecture 4.
Update is now:
where in this case. The above is basically a general approach which we will see recurring in other methods.
Temporal-Difference Learning
Notation
- TD-target - , i.e. the value of state we we want to update towards. Thus we update to better account for the value of the next state.
- TD-error - , i.e.
Overview
- Learn directly from experience
- No knowledge of MDP transitions / rewards, i.e. model-free
- Learns from incomplete episodes, by bootstrapping (different from MC)
- Updates a guess towards a guess
- TD attempts to "encode" a fraction of the value in the next state into the value of incrementally
Update value towards estimated return
- TD-target
- TD-error
The difference between the MC and TD approaches is that the MC considers the actual return obtained from an (complete) episode, while the TD uses some estimated return at the next step to obtain value for current step.
TD is thus a much more flexible approach than MC.
Check out p.15 in Lecture 4 for a really nice example of the difference!
Bias / Variance between MC and TD
- Return is unbiased estimate of
- If we had the true value function, TD would also be unbiased, BUT it ain't so!
- TD target is biased estimate of , but has lower variance
- MC's evaluation depends on many random actions, transitions, rewards
- TD's evaluation depends only on one random action, transition, reward
Pros and cons
- MC has high variance, zerio bias
- Good convergence properties
- (also for function-approximations)
- Not very sensitive to initial value
- Minimizes the mean-squared error (p. 24 Lecture 4)
- Ignores the Markov property
- Slower when Markov property is present
- Works even though there is no Markov property
- Uses sampling
- Good convergence properties
- TD has low variance, some bias
- Usually more efficient than MC
- TD(0) converges to
- (not always for function-approximations)
- More sensitive to initial value
- Converges to a solution of maximum likelihood Markov model (p. 24 Lecture 4)
- Takes advantage of the Markov property
- Faster when this is present
- Does not work as well when Markov property is absent
- Usues bootstrapping AND sampling
DO checkout p. 26-28 in Lecture 4! It shows some great visualizations of the different methods and how they do look-aheads / backups.
TD(λ)
- Sweet-spot between TD- and MC-learning
- What if we instead of just considering the very next reward, we consider the next rewards!
- But we would also like to more heavily weight the more immediate reward estimates, compared to one found steps into the future.
Let's define the n-step return as :
The update is then:
And we get, as Hannah Montana would say, the best of both worlds! (TM)
The issue now is to find the optimal .
To handle this, we would like to somehow take the average reward for the different . And so TD(λ) is born!
Forward-view TD(λ)
- The λ-return combines all n-step returns
Using weight
7Forward-view TD(λ)
8- Why geometric weighting, and not some other weighting?
- Computationally efficient
- Stateless => no more expensive than TD(0)
- Issue is now that we have to wait until the end of the episode (since we're summing
to infinity), which is the same problem we had with MC
- Luckily, by taking on an additional view of the problem, backward-view TD(λ), we can update from incomplete episodes
Backward-view TD(λ)
- Keep eligibility trace for every state
- Update value for for every state
Update happens in proportion to the TD-error and the eligibility-trace
9
Eligibility traces
Eligibility traces combines most frequent and most recent states:
My intuition regarding this is as follows:
- Gives us a way of measuring "responsibility" for the TD-error
- If we at some time-step end up in state , then at some we end up in the same state , we can view the eligibility trace of at to measure how much the error in the value function is due to what we saw steps back.
See p.40 in Lecture 4 for an example.
Replacing Eligibility Trace method
The concept is simple; instead of adding the weighted occurence to the trace, we simply replace it. This way the largest possible contribution we can have is 1.
This is especially useful for cases where we have possibilities of cyclic state-transitions, as the multiple occurences of the same state can lead to incredibly contributions to this state.
Take this example:
- Start at position 0
- Reward is at position 1
- Try to move left (opposite of reward) 100000 times, staying in position 0 since there is a wall to our left
- Finally take a right
- Receive reward
In this case position 0 would get large amounts of credit even though it might not be the best state to be in.
This can also often lead to integer-overflow unless other precautionary measures are taken.
For completeness:
is used instead of
Backward-view and forward-view equivalence
If you have a look at p.43 in Lecture 4 and on, there is some stuff about the fact that these views are equivalent.
TD(0) is simply the Simple TD-learning we started out with:
13TD(1) defers updates until the episode has terminated:
14This is not actually equivalent to the MC view, just sayin'.
Conclusion
Forward-view provides us with a "theoretical", providing us with an intuition
15Backward-view provides us with a practical way to update for some state where the error is now a function of the previous updates of instead of the future updates, as in the forward view.
16- Backward-view provides us with a practical way to update some state a any time-step by saying that this state depends on a combined weight of the steps since we last updated and the frequency of state . If we let , i.e. consider , it turns out that the forward-view and backward-view are equivalent.
I must admit that I'm sort of having a hard time understanding how the forward-view and backward-view are in fact equivalent.
I can understand the intuition behind both views, but it does not seem to be the same intuition. Forward-view:
- TD-error is difference between current and accumulated rewards at all steps in the future
Backward-view:
- Considers all previous updates to for the state
Both:
- The error between now and in some future is of course zero if we never actually see state again. This also makes sense in the backward-view, where at some "future" we haven't seen since some time in the past.
Model-free Control - policy-optimization
On-policy
Sarsa
- Uses TD-learning to learn
- Updates to
Sarsa(λ)
Sarsa(1)
- Elegibility trace runs all the way back => how do we deal with loops?
- Cycles of identical state-action pair will cause updates which lowers until it is not any longer the preferred pair.
- But what if is in fact the best state-action pair, but also
has the largest probability of ending up in , then the above point
would be highly indesirable!
- This is especially a problem in environemnts where we dont get a reward until the episode terminates
- In this case, lowering the eligility-factor (or whatever you call it) might help to mitigate this (I think o.O)
Off-policy
Q-Learning
Attempts to estimate a target policy and use this to then estimate or
Notice the difference between Sarsa and Q-learning! is observed, is predicted from
Properties
- Probably converge in infinite-data limit, assuming every state is visited infintely often
- In the tabular context, Q-learning is provably sample efficient jin18_is_q_learn_provab_effic
Value-function approximation
Incremental methods
Value function approximation
Monte-Carlo
TD
TD(λ)
- λ-return is a biased sample of true value
Apply supervised learning to "training data"
Action-value function approximation
Evaluation / prediction
Want to minimise the mean-squared error (MSE):
Using SGD to find local minimum:
Control
Simply take what we found in Evaluation / prediction and substitute a target for
Batch methods
Thought is to "re-use" data, or rather, make the most use of the data we have, instead of updating once using each point as previously described.
Experience replay
See p.35 in Lecture 6.
But it's basically just this:
- Sample from our observations
- Apply SGD update
- Converges to least squares solution
Deep Q-Networks (DQN)
See p.40 or so in Lecture 6 for this awesomesauce. Uses
- Experience replay
- Randomization decorrelates the data => good!
- Fixed Q-targets
- Freeze network, and then during the next training-cycle we bootstrap towards the frozen network (just old parameters)
- Every some
n
steps we freeze the parameters, and then the nextn
steps we use these parameters' prediction as targets. Then after thosen
targets we freeze our new parameters, and repeat. - More stable update
The points above is what makes this non-linear function approx. stable.
What they did in the Atari-paper is to have the outputs of the ConvNet be a softmax over all possible actions. This allows easy maximization over the actions, and is not too hard to model as long as the action-space isn't too large.
Linear Least Squares prediction
See p. 43 or so for a closed form solution to the linear case. For prediction, but something similar exists for control too.
Policy gradient methods
Notation
- is any policy objective function
- is the policy update
Introduction
Previously we've considered the prediction and control of the value function, but now we're instead going to directly model the optimal policy! That is, we directly parametrise
To be able to optimize the policy, we first need to come up with some policy objective function.
Policy Object Functions
Goal: given a policy with parameters , find best
Need to measure quality of :
Episodic environment with some initial state, we use the start-value:
18i.e. judge the policy by the expected reward received when starting in state
Continuing environments we can use the average value:
19Or average reward per time-step
20which says "there is some probability of being in state , under our policy there is some probability I choose action in this state, and then I receive reward at this step". So the above is for each time-step.
Where is a stationary distribution of a Markov Chain, i.e. what is the probability of being in state .
Comparison with Value-based methods
Advantages
- Better convergence propertiers
- Effective in high-dimensional or continuous action space
- This is due to the fact that in value-based methods we need to maximize over the actions to obtain the "value" for a next state. With high-dimensional or continuous action space it might be very hard to perform this maximization
- Can learn stochastic policies
- stochastic policies simply mean policies which take on a probability distribution over the actions rather than a discrete choice. Very useful when the environment itself is stochastic (i.e. not deterministic).
- Example is "Rock-Paper-Scissor", where a the optimal behavior will be a uniform distribution over the actions.
Disadvantages
- Typically converge to local optimum rather than global
- Policy evaluation is usually inefficient using Policy Gradient Methods, compared to using Value-function methods.
Finite Difference Policy Gradient
- To evaluate policy gradient of
- Does so through the definition of the derivative for each dimension (see p. 13 in Lecture 7 for details)
- Noisy and inefficent in high dimensions
Monte-Carlo Policy Gradient
- Allows analytic gradients
- Assume policy is differnetiable whenever it is non-zero
- And know the gradient
Score function:
21where we simply ignore the term from the Likelihood ratios due to the fact that optimizing wrt. also optimizes .
Likelihood ratios
Simply due to the
Example score-function: Softmax Policy
- Actions weighted by linear combination of features
- Assume
Score function is then
23
Example score-function: Gaussian Policy
- Policy is Guassian,
Score function is then
24
One-Step MDPs
- Starting in state
- Terminating after one time-step with reward
See p. 19 in Lecture 7 for how to compute the policy gradient, OR try it yourself from the objective functions described earlier!
Policy Gradient Theorem
Simply changes the immediate reward in the one-step MDPs with the long-term value .
For any differentiable policy , for any of the policy objective functions , the policy gradient is
The algorithm
- Update parameters by SGD
- Using policy gradient theorem
Using return as an unbiased sample of
27
Remember this is a MC => need to sample entire episode to learn
See p.21 from Lecture 7 for psuedo-code.
Actor-Critic Policy Gradient
- Actor-Critic learns both value function and policy
- MC policy gradient still has high variance
- Use critic to estimate the action-value function
- Actor-critic algorithms maintain two sets of parameters
- Critic
- updates action-value function parameters
- Actor
- updates policy parameters in direction suggested by critic
That is; critic handles estimating , and the actor then asks the critic what the value of the state-action pair the critic just observed / took is, and updates the policy-parameters according to this value
28So we have exactly the same update, but now replacing the long-term action-value function with an approximation.
Estimating the action-value function
Want to know how good the current policy is for current params , hence we need to do policy evaluation
- Monte-Carlo policy evaluation
- Temporal-Difference evaluation
- TD(λ)
Algorithm
See the psuedo-code on p. 25 in Lecture 7.
But sort of what it does is (at each step):
- (actor) Sample ; sample transition
- (actor) Sample action
- (critic) Compute TD-error
- (actor) Update policy-parameters
- (critic) Update weights for
- (actor)
Compatible Function Approximation
If the following two conditions are satifisfied:
Value function approximator is compatible to the policy
29Value function parameters minimise the MSE
30
Then the policy gradient is exact,
Reducing variance using a Baseline
- Subtract a baseline function from the policy gradient
- Can reduce variance without changing expectation
- Basically "centering" (subtracting mean and dividing by variance)
See p. 30 in Lecture 7 for a proof of why it doesn't change the expectation. Basically, we can add or subtract anything we want from as long as it only depends on the state .
Left with something called the advantage function
Estimating Advantage function
There multiple ways of doing this.
See p. 30 in Lecture 7 and on for some ways. The most used one is nr. 2.
Summary
Policy gradient has many equivalent forms:
All these methods are basically saying: move the policy in the direction that gets us more value.
- Each leads a SGD algorithm
- Critic uses policy evaluation (e.g. MC or TD learning) to estimate , or
Thompson sampling
Notation
- actions
- are the probabilities of success for the corresponding action / arm
- denotes a graph with vertices and edges
Historical data
- denotes reward obtained by taking action leading to observed outcome
Per-period regret is the difference between exptected reward using optimal action and the reward received by current action. In the Armed bandit problem:
- is the conditional distribution of the system (i.e. what we want to learn)
Armed bandit
- Suppose there are actions, and when played, any action yields either a success or failure.
- Action produces a success with probability .
- The success probabilities are unknown to the agent, but are fixed over time, and can therefore be learned by experimentation
- Objective: maximize number of successes over periods, where is relatively large compared to the number of arms
Stuff
- Posterior sampling and probability matching
- Methods such as attempts to balance exploration by choosing a random action with probability and greedy step with probability
- Inefficient since it will sample actions regardless of how unlikely they are to be optimal; Thompson sampling attempts to address this
- Useful for online decision problems
The difference between a greedy algorithm and Thompson sampling is that a greedy algorithm would use the expectation over the current estimate of to obtain the new estimate of , while Thompson sampling samples the new estimate of .
Book
Chapter 2
Stationary Problem
- N-armed bandit problem
- Action-rewards does not change over time but stay constant
- All we got to do then is to model the average reward for some action, i.e.
where is the number times we've chosen action up until step
And of course, practically it's better to compute this using the recursive formula for the mean:
where is the value for some action after choosing and receiving reward from this action times.
Non-stationary Problem
- Makes sense to more heavily weight the most recent observations higher than "old" ones
- We write in instead of to be a bit more general
- By recursively expanding we obtain
Which is called the weighted average, due to the fact that (can be proven by induction).
Exercises
2.3 - N-Armed bandit ɛ-greedy comparison
Consider the following learning problem. You are faced repeatedly with a choice among
n
different options, or actions. After each choice you receive a numerical reward
chosen from a stationary probability distribution that depends on the action you selected.
Your objective is to maximize the expected total reward over some time period,
for example, over 1000 action selections,or time steps.
The problem is as follows:
- Choose between
n
actions at each stept
- Action-rewards are constant wrt.
t
- Maximize expected / avg. reward over 1000 steps
t
To get an idea of how well the algorithms perform in general, we run the above for 2000 different action-rewards, i.e. 2000 differently parametrized environments.
# don't want this tangled %matplotlib inline
import numpy as np import matplotlib.pyplot as plt # n-Armed Bandit # each environment has 10 actions (bandits), and we have 2000 such problems # allows us to see how algorithms perform over different hidden action-values n_steps = 1000 n_problems = 2000 armed_testbed = np.random.randn(10, n_problems) actions = np.arange(10) steps = np.arange(n_steps) # we compute the value of action `a` at step `t` # by the accumulated reward from the times we've # chosen `a` up until (incl.) `t` and then divided # by the number of times we've chosen this `a`. # Basically, the average reward of `a` at `t` # wrt. times we've chosen `a`. def epsilon_greedy(Q, epsilon): if np.random.rand() < epsilon: return np.random.randint(actions.shape[0]) return np.argmax(Q) def run_testbed(choose_action, *args, **kwargs): rewards = np.zeros((n_steps, armed_testbed.shape[1])) for j in range(armed_testbed.shape[1]): environment = armed_testbed[:, j] # intialization Q = np.zeros(10) action_freq = np.zeros_like(actions, dtype=np.int) action_reward = np.zeros_like(actions) reward = 0 for t in steps: a = choose_action(Q, *args, **kwargs) # reward with some noise r = environment[a] + np.random.randn() # update count and reward for selection action `a` action_freq[a] += 1 action_reward[a] += r # update value for action `a` Q = action_reward / (action_freq + 1e-5) # accumulate reward reward += r # expected reward wrt. t rewards[t, j] = reward / (t or 1) # take the avg for each time-step over all problems return np.mean(rewards, axis=1)
And let's run the testbed for different ε
in parallel.
import multiprocessing as mp from functools import partial f = partial(run_testbed, epsilon_greedy) with mp.Pool(3) as p: results = p.map(f, [0, 0.01, 0.1])
Finally we create the plot!
steps = np.arange(1000) fig = plt.figure(figsize=(14, 5)) plt.plot(steps, results[0], label="greedy") plt.plot(steps, results[1], label="ε-greedy 0.01") plt.plot(steps, results[2], label="ε-greedy 0.1") plt.title("2000 different 10-bandit problems") plt.ylabel("Expected $R_t$") plt.xlabel("$t$") plt.legend() plt.show()
Defintions
- synchronous updates
- update all states / values / whatever one after the other
- prediction
- given a full MDP and some policy , find value function for this policy
- control
- given a full MDP, find optimal (wrt. reward) value function and policy
- backup
- simply the state we "base" our update on for some state
- backup
- updating the value of a state using values of future states
- full backup
- we perform backup for all states
- sweep
- consists of applying a backup operation to each state
- TD-target
- TD-error
- bootstrapping
- update involves an estimate (not the real returns / rewards)
- sampling
- update samples an expectation
- stochastic policy
- policy which uses a distribution over actions rather than deterministic behavior.
Resources
- Karpathy's "Pong from pixels". Details how to implement a policy-gradient RL-algorithm using a ConvNet for function-approximation to learn to play Pong from the raw image input.