Reinforcement Learning

Table of Contents


Reinforcement Learning is a learning framework where which helps us to account for future reward by "encoding" the value of the next state $S_{t+1}$ (and / or actions) in the value of the current state $S_t$ (and / or actions), thus turning the greedy "strategy", wrt. this value function we're creating, the optimal "strategy".


  • Capital letter = random variable, e.g. $S$ is the random variable which can take on all $s \in \mathcal{S}$
  • Wierd capital letters = spaces of values that the random variables can take on, e.g. reward space / function $\mathcal{R}$
  • Lowercase = specific values the random variables take on, e.g. $\mathbb{E}[S = s]$
  • $f^*$ denotes optimality for some function $f$
  • $\pi$ - policy
  • $a \in \mathcal{A}$ - action
  • $s \in \mathcal{S}$ - state
  • $r \in \mathcal{R}$ - reward
  • $v(s)$ - value function for some state $s$
  • $G_t = \sum_{i=0}^t R_i$ - accumulated reward at step $t$

Markov Decision Process

Markov Reward Process (MRP)

See p. 10 in the Lecture 2: MDP.

Markov Decision Process (MDP)

See p. 24 in the Lecture 2: MDP.

Optimality in MDP

Different settings
  • Finite horizon
    • Suppose we have a finite horizon $h$, i.e. the problem is now

\max \mathbb{E} \bigg[ \sum_{t = 0}^{h} r_t \bigg]
  • Infinite horizon

    In the case of an infinite horizon, we add a discount factor

\max \mathbb{E} \bigg( \sum_{t=0}^{\infty} \gamma^t r_t \bigg)
  • Average-reward
    • Finiite horizon

\max \mathbb{E} \bigg( \frac{1}{h} \sum_{t = 0}^{h} r_t \bigg)
    • Infinite horizon

 \lim_{h \to \infty} \max \mathbb{E} \bigg( \frac{1}{h} \sum_{t = 0}^{h} r_t \bigg)

Learning performance

  • Asymptotic convergence

\pi_n \to \pi^* \quad \text{as} \quad n \to \infty
  • PAC learning:

P \big( N_{\text{errors}} > F(\cdot, \epsilon, \delta) \big) \le \delta

    for some $\epsilon, \delta > 0$.

  • Regret bound

\max_j \sum_{t = 0}^{T} r_{tj} - r_t < B

    for some $B$.

    • Assumes oracle $r_{t_j}$ exists
    • Regret doesn't tell you anything about each of the individual regrets $r_t$

Partial Observable Markov Decision Process (POMDP)

See p. 51 in the Lecture 2: MDP.

Planning by Dynamic Programming

Optimal state-value function:

V^*(s_t) = \max_{\pi} \mathbb{E} \bigg( \sum_{t = 0}^{\infty} \gamma^t r_t \bigg)

The Bellman equation defines recursively:

V^{\pi}(s_t) = R \big( s_t, \pi(s_t) \big) + \gamma \sum_{s_{t + 1}}^{} T \big( s_{t + 1} \mid s_t, \pi(s_t) \big) V^{\pi} (s_{t + 1})

and then the Bellman optimality equation

Policy evaluation

Problem - evaluate a given policy $\pi$ Solution - iterative application of Bellman expectation backup

  1. Initial value function $v_0$ is 0 everywhere
  2. Using synchronous backups
    • Each iteration $k + 1$
    • For all states $s \in S$
    • Update $v_{k+1}(s)$ from $v(s')$

In short, when updating the value function $v_{k+1}$ for state $s$, we use the Bellman equation and the previous value function $v_k$ for all "neighboring" states (the states we can end up it with all the actions we can take).

$v_i$ refers to the $i^{\text{th}}$ iteration of the iterative policy evaluation


Policy iteration

Problem: given a policy $\pi$, find a better policy

  1. Evaluate the polciy $\pi$
  2. Improve the policy by acting greedily wrt. to $v_{\pi}$,

\pi' = \text{greedy} (v_{\pi})

This will eventually converge to the optimal policy $\pi^*$ due to the partial ordering of policies $\pi'(s) \ge \pi, \forall s \in \mathcal{S}$.

Value iteration


Find optimal policy $\pi$ given the dynamics of the systems (rewards and environmental state)


Iterative application of Bellman optimality backup.

  1. Initialize value function $v_0$
  2. For each iteration $k+1$
    • For all states $s \in \mathcal{S}$
    • Update $v_{k+1} (s)$ from $v_k (s')$

Unlike policy iteration, there is no explicity policy here.

At each iteration, for all states $s \in \mathcal{S}$ we make the current optimal action $a$ (i.e. optimal wrt. the current iteration), and then update the value $v$ for this state.

Eventually $v_k(s)$ will converge to the true value function for some $s$, and since all states are dependent through the recursive relationship from the Bellman equation, the other states will also start to converge.

In the form of the Bellman equation we can write the iterative process as:

v_{k+1} (s) = \max_{a \in \mathcal(A)} \Big( \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}_ {ss'}^a v_k(s') \Big)

Or in matrix notation:

\mathbf{v}_{k+1} = \max_{a \in \mathcal{A}} \Big( \mathbf{\mathcal{R}}^a + \gamma \mathbf{\mathcal{P}}^a \mathbf{v}_k \Big)


Problem Bellman equation Algorithm
Prediction Bellman Expectation equation Iterative, Policy evaluation
Control (maximize reward) Bellman Expectation equation + Greedy Policy improvement Policy iteration
Control Bellman Optimality equation Value iteration


Asynchronous Dynamic Programming

Prioritized sweeping
  • Decide which states to update
  • Use absolute value between Bellman equation update and current $v(s)$ to guide our selection for which states to update
Real-time dynamic programming
  • Idea: only the states that are relevant to agent
  • Randomly sample
  • After each time-step $S_t, A_t, R_t$
  • Backup the state $S_t$

Model-free Prediction - policy-evaluation

What if we don't know the Markov Decision Process, i.e. how the environment works?

This sections focuses only on evaluating the value function $v_\pi$ for some given policy $\pi$. This policy does not have to be the optimal policy. ONE STEP AT THE TIME BOIS!

Monte-Carlo Learning


  • $N(s)$ - # of times we've seen state $s$ in this episode
  • $S(s)$ - sum of values received from the times we've been in state $s$ in this episode
  • value is equal to avg. return
  • Learns from complete episodes, i.e. we need to reach some termination
  • Caveat can only apply MC to episodic

We assume $v_\pi = \mathbb{E} [G_t | S_t = s]$, and so all we try to do is find the avg. accumulated reward starting from our current state $s$ at step $t$.

To obtain this average, we sample "mini-episodes" starting from our current state and average the accumulated reward $G_t$ over these samples.


First-Visit MC policy evaluation

The first time-step $t$ in that state $s$ is visited in an (sampled) episode:

  1. Increment counter $V(s) \leftarrow N(s) + 1$
  2. Increment total return $S(s) \leftarrow S(s) + G_t$, ( $S$ being the sum in this case)
  3. Value is estimated by mean return $V(s) = \frac{S(s)}{N(s)}$
  4. By law of large numbers, $V(s) \rightarrow v_\pi (s) \text{ as } N(s) \rightarrow \infty$

Instead of using the total sums, we can take advantage of the recursive mean formula for online-learning.

Every-Visit MC policy evaluation

Same as First-Visit MC policy evaluation, but if we have loops and end up in the same state $s$ multiple times in the same episode, we perform the steps above.

Incremental MC

See p. 11 in Lecture 4.

Update is now:

     V(S_t) \leftarrow V(S_t) + \alpha (G_t - V(S_t))

where $\alpha = \frac{1}{N(S_t)}$ in this case. The above is basically a general approach which we will see recurring in other methods.

Temporal-Difference Learning


  • TD-target - $R_{t+1} + \gamma V(S_{t+1})$, i.e. the value of state $S_t$ we we want to update towards. Thus we update $V(S_t)$ to better account for the value of the next state.
  • TD-error - $\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$, i.e.


  • Learn directly from experience
  • No knowledge of MDP transitions / rewards, i.e. model-free
  • Learns from incomplete episodes, by bootstrapping (different from MC)
  • Updates a guess towards a guess
  • TD attempts to "encode" a fraction of the value in the next state $S_{t+1}$ into the value of $S_t$ incrementally

Update value $V(S_t)$ towards estimated return $R_{t+1} + \gamma V(S_{t+1})$

     V(S_t) \leftarrow V(S_t) + \alpha (R_{t+1} + \gamma V(S_{t+1}) - V(S_t))
$R_{t+1} + \gamma V(S_{t+1})$
$\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$

The difference between the MC and TD approaches is that the MC considers the actual return $G_t$ obtained from an (complete) episode, while the TD uses some estimated return at the next step to obtain value for current step.

TD is thus a much more flexible approach than MC.

Check out p.15 in Lecture 4 for a really nice example of the difference!


Bias / Variance between MC and TD

  • Return $G_t = R_{t+1} + \gamma R_{t+2} + ... + \gamma^{T-1} R_T$ is unbiased estimate of $v_\pi(S_t)$
  • If we had the true value function, TD would also be unbiased, BUT it ain't so!
  • TD target $R_{t+1} + \gamma V(S_{t+1})$ is biased estimate of $v_\pi(S_t)$, but has lower variance
  • MC's evaluation depends on many random actions, transitions, rewards
  • TD's evaluation depends only on one random action, transition, reward

Pros and cons

  • MC has high variance, zerio bias
    • Good convergence properties
      • (also for function-approximations)
    • Not very sensitive to initial value
    • Minimizes the mean-squared error (p. 24 Lecture 4)
    • Ignores the Markov property
      • Slower when Markov property is present
      • Works even though there is no Markov property
    • Uses sampling
  • TD has low variance, some bias
    • Usually more efficient than MC
    • TD(0) converges to $v_\pi (s)$
      • (not always for function-approximations)
    • More sensitive to initial value
    • Converges to a solution of maximum likelihood Markov model (p. 24 Lecture 4)
    • Takes advantage of the Markov property
      • Faster when this is present
      • Does not work as well when Markov property is absent
    • Usues bootstrapping AND sampling

DO checkout p. 26-28 in Lecture 4! It shows some great visualizations of the different methods and how they do look-aheads / backups.


  • Sweet-spot between TD- and MC-learning
  • What if we instead of just considering the very next reward, we consider the $n$ next rewards!
  • But we would also like to more heavily weight the more immediate reward estimates, compared to one found $n$ steps into the future.

Let's define the n-step return as :

 G_t^{(n)} = R_{t+1} + \gamma R_{t+2} + ... + \gamma^{n-1} R_{t+n} + \gamma^n V(S_{t+n})

The update is then:

V(S_t) \leftarrow V(S_t) + \alpha \big( G_t^{(n)} - V(S_t) \big)

And we get, as Hannah Montana would say, the best of both worlds! (TM)

The issue now is to find the optimal $n$.

To handle this, we would like to somehow take the average reward for the different $n$. And so TD(λ) is born!

Forward-view TD(λ)

  • The λ-return $G_t^\lambda$ combines all n-step returns $G_t^{(n)}$
  • Using weight $(1 - \lambda)\lambda^{n - 1}$

	   G_t^\lambda = (1 - \lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_t^{(n)}
  • Forward-view TD(λ)

	   V(S_t) \leftarrow V(S_t) + \alpha (G_t^\lambda - V(S_t))
  • Why geometric weighting, and not some other weighting?
    • Computationally efficient
    • Stateless => no more expensive than TD(0)
  • Issue is now that we have to wait until the end of the episode (since we're summing to infinity), which is the same problem we had with MC
    • Luckily, by taking on an additional view of the problem, backward-view TD(λ), we can update from incomplete episodes

Backward-view TD(λ)

  • Keep eligibility trace for every state $s$
  • Update value for $V(s)$ for every state $s$
  • Update happens in proportion to the TD-error $\delta_t$ and the eligibility-trace $E_t(s)$

  \delta_t &= R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \\
      V(s) &\leftarrow V(s) + \alpha \delta_t E_t(s) \\

Eligibility traces

Eligibility traces combines most frequent and most recent states:

       E_0(s) &= 0 \\
       E_t(s) &= \lambda \gamma E_{t-1}(s) + \mathbf{1}(S_t = s)

My intuition regarding this is as follows:

  • Gives us a way of measuring "responsibility" for the TD-error
  • If we at some time-step $t'$ end up in state $s$, then at some $t$ we end up in the same state $s$, we can view the eligibility trace of $s$ at $t - t'$ to measure how much the error in the value function is due to what we saw $t - t'$ steps back.

See p.40 in Lecture 4 for an example.

Replacing Eligibility Trace method

The concept is simple; instead of adding the weighted occurence to the trace, we simply replace it. This way the largest possible contribution we can have is 1.

This is especially useful for cases where we have possibilities of cyclic state-transitions, as the multiple occurences of the same state can lead to incredibly contributions to this state.

Take this example:

  • Start at position 0
  • Reward is at position 1
  • Try to move left (opposite of reward) 100000 times, staying in position 0 since there is a wall to our left
  • Finally take a right
  • Receive reward

In this case position 0 would get large amounts of credit even though it might not be the best state to be in.

This can also often lead to integer-overflow unless other precautionary measures are taken.

For completeness:

E_t[s, a] \leftarrow 1

is used instead of

E_t[s, a] \leftarrow E_{t-1}[s, a] + 1

Backward-view and forward-view equivalence

If you have a look at p.43 in Lecture 4 and on, there is some stuff about the fact that these views are equivalent.

  • TD(0) is simply the Simple TD-learning we started out with:

  E_t(s) &= \mathbf{1}(S_t = s), \quad \lambda = 0 \\
      \delta_t &= R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \\
      V(s) &\leftarrow V(s) + \alpha \delta_t \\
  • TD(1) defers updates until the episode has terminated:

  E_t(s) &= \gamma E_{t-1}(s) + \mathbf{1}(S_t = s), \quad \lambda = 1 \\
      \delta_t &= R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \\
      V(s) &\leftarrow V(s) + \alpha \delta_t \\

    This is not actually equivalent to the MC view, just sayin'.


  • Forward-view provides us with a "theoretical", providing us with an intuition

       G_t^\lambda &= (1 - \lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_t^{(n)} \\
       V(S_t) &\leftarrow V(S_t) + \alpha (G_t^\lambda - V(S_t)) \\
  • Backward-view provides us with a practical way to update $V(s)$ for some state $s$ where the error is now a function of the previous updates of $V(s)$ instead of the future updates, as in the forward view.

  \delta_t &= R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \\
      V(s) &\leftarrow V(s) + \alpha \delta_t E_t(s) \\
  • Backward-view provides us with a practical way to update some state $s$ a any time-step $t$ by saying that this state depends on a combined weight of the steps since we last updated $V(s)$ and the frequency of state $s$. If we let $\lambda = 1$, i.e. consider $T \rightarrow \infty$, it turns out that the forward-view and backward-view are equivalent.

I must admit that I'm sort of having a hard time understanding how the forward-view and backward-view are in fact equivalent.

I can understand the intuition behind both views, but it does not seem to be the same intuition. Forward-view:

  • TD-error is difference between current $V(s)$ and accumulated rewards $G_t^{(n)}$ at all steps $n$ in the future


  • Considers all previous updates to $V(s)$ for the state $s$


  • The error between $V(S_t = s)$ now and in some future is of course zero if we never actually see state $s$ again. This also makes sense in the backward-view, where at some "future" we haven't seen $s$ since some time $t$ in the past.

Model-free Control - policy-optimization



  • Uses TD-learning to learn $Q(s, a)$
  • Updates to


  • Elegibility trace runs all the way back => how do we deal with loops?
    • Cycles of identical state-action pair $(s, a)$ will cause updates which lowers $q(s, a)$ until it is not any longer the preferred pair.
    • But what if $(s, a)$ is in fact the best state-action pair, but also has the largest probability of ending up in $s$, then the above point would be highly indesirable!
      • This is especially a problem in environemnts where we dont get a reward until the episode terminates
    • In this case, lowering the eligility-factor (or whatever you call it) $\lambda$ might help to mitigate this (I think o.O)



  • Attempts to estimate a target policy $\pi(a | s)$ and use this to then estimate $v(s)$ or $q(s, a)$

  Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha (R_{t+1} + \gamma Q(S_{t+1}, A') - Q(S_t, A_t)

Notice the difference between Sarsa and Q-learning! $S_{t + 1}$ is observed, $A'$ is predicted from $s_{t + 1}$

  • Probably converge in infinite-data limit, assuming every state is visited infintely often
  • In the tabular context, Q-learning is provably sample efficient jin18_is_q_learn_provab_effic

Value-function approximation

Incremental methods

Value function approximation

  • λ-return $G_t^\lambda$ is a biased sample of true value $v_\pi (s)$
  • Apply supervised learning to "training data"

  \langle S_1, G_1^\lambda \rangle, \langle S_2, G_2^\lambda \rangle, ..., \langle S_{T-1}, G_{T-1}^\lambda
  • Forward-view for a linear TD(λ)
\Delta \mathbf{w} &= \alpha \Big( G_t^\lambda - \hat{v} (S_t, \mathbf{w}) \Big) \nabla_{\mathbf{w}} \hat{v} (S_t, \mathbf{w}) \\
&= \alpha \Big( G_t^\lambda - \hat{v} ( S_t, \mathbf{w}) \Big) \mathbf{x} (S_t)
  • Backward-view for a linear TD(λ)
    • $E_t$ is now be of same dimension as our feature-space!
    • That is, instead of "counting" the occurences of the states, we now "count" the features of the states
   \delta_{t+1} &= R_{t+1} + \gamma \hat{v} (S_{t+1}, \mathbf{w} ) - \hat{v} (S_t, \mathbf{w}) \\
   E_t &= \gamma \lambda E_{t-1} + \mathbf{x} (S_t) \\
   \Delta \mathbf{w} &= \alpha \delta_t E_t

Action-value function approximation

Evaluation / prediction
\hat{q} (S, A \mathbf{w}) \sim q_\pi (S, A)

Want to minimise the mean-squared error (MSE):

\underset{\mathbf{w}}{\text{argmin }} \mathbb{E}_\pi \Big[ \Big( q_\pi (S, A) - \hat{q} (S, A, \mathbf{w}) \Big)^2 \Big]

Using SGD to find local minimum:

-\frac{1}{2} \nabla_\mathbf{w} J(\mathbf{w}) &= \Big( q_\pi (S, A) - \hat{q}(S, A, \mathbf{w}) \Big) \nabla_\mathbf{w} \hat{q} (S, A, \mathbf{w}) \\
\Delta \mathbf{w} &= \alpha \Big( q_\pi (S, A) - \hat{q} (S, A, \mathbf{w}) \Big) \nabla_\mathbf{w} \hat{q} (S, A, \mathbf{w}) 
  • When to use what for prediction

    See p.30 in Lecture 6 for a sweet table regarding when we have guaranteed convergence when using function approx. for policy evaluation.


Simply take what we found in Evaluation / prediction and substitute a target for $q_\pi (S, A)$

  • Monte-Carlo

    Target is return at time $t$, $G_t$:

   \Delta \mathbf{w} = \alpha \Big( G_t - \hat{q} (S, A, \mathbf{w}) \Big) \nabla_\mathbf{w} \hat{q} (S, A, \mathbf{w}) 
  • TD(0)

    Target is reward + estimated reward given we take optimal actions in the future:

\Delta \mathbf{w} = \alpha \Big(R_{t+1} + \gamma \hat{q} (S_{t+1}, A_{t+1}, \mathbf{w}) - \hat{q} (S_t, A_t, \mathbf{w}) \Big) \nabla_\mathbf{w} \hat{q} (S_t, A_t, \mathbf{w})
  • TD(λ)
    • Forward-view

      Target is simply $R_{t+1} + \hat{q}_t^\gamma$, thus:

 \Delta \mathbf{w} = \Big( \hat{q}_t^\gamma - \hat{q} (S_t, A_t, \mathbf{w}) \Big) \nabla_\mathbf{w} \hat{q} (S_t, A_t, \mathbf{w})

      where $q_t^\lambda = (1 - \lambda) \sum_{n=1}^\infty \lambda^{n - 1} q_t^{(n)}$ and $q_t^{(n)} = R_{t+1} + \gamma R_{t+2} + ... + \gamma^{n-1} R_{t+n} + \gamma^n q(S_{t+n}, A_{t+n})$.

    • Backward-view
\delta_t &= R_{t+1} + \gamma \hat{q} (S_{t+1}, A_{t+1}, \mathbf{w}) - \hat{q} (S_t, A_t, \mathbf{w}) \\
E_t &= \gamma \lambda E_{t-1} + \nabla_\mathbf{w} \hat{q} (S_t, A_t, \mathbf{w}) \\
\Delta \mathbf{w} &= \alpha \delta_t E_t

Batch methods

Thought is to "re-use" data, or rather, make the most use of the data we have, instead of updating once using each point as previously described.

Experience replay

See p.35 in Lecture 6.

But it's basically just this:

  1. Sample $(s, v^\pi)$ from our observations $\mathcal{D} = \{ \langle s_1, v_1^\pi \rangle, ..., \langle s_T, v_T^\pi \rangle \}$
  2. Apply SGD update $\Delta \mathbf{w} = \alpha \Big( v^\pi - \hat{v} (s, \mathbf{w}) \Big) \nabla_\mathbf{w} \hat{v}(s)$
  3. Converges to least squares solution $\mathbf{w}^\pi = \underset{\mathbf{w}}{\text{argmin }} LS(\mathbf{w})$

Deep Q-Networks (DQN)

See p.40 or so in Lecture 6 for this awesomesauce. Uses

  • Experience replay
    • Randomization decorrelates the data => good!
  • Fixed Q-targets
    • Freeze network, and then during the next training-cycle we bootstrap towards the frozen network (just old parameters)
    • Every some n steps we freeze the parameters, and then the next n steps we use these parameters' prediction as targets. Then after those n targets we freeze our new parameters, and repeat.
    • More stable update

The points above is what makes this non-linear function approx. stable.

What they did in the Atari-paper is to have the outputs of the ConvNet be a softmax over all possible actions. This allows easy maximization over the actions, and is not too hard to model as long as the action-space $\mathcal{A}$ isn't too large.

Linear Least Squares prediction

See p. 43 or so for a closed form solution to the linear case. For prediction, but something similar exists for control too.

Policy gradient methods


  • $J(\theta)$ is any policy objective function
  • $\Delta \theta = \alpha \nabla_\theta J(\theta)$ is the policy update


Previously we've considered the prediction and control of the value function, but now we're instead going to directly model the optimal policy! That is, we directly parametrise

  \pi_\theta = \mathcal{P} [a, | s, \theta]

To be able to optimize the policy, we first need to come up with some policy objective function.

Policy Object Functions

Goal: given a policy $\pi_\theta(s, a)$ with parameters $\theta$, find best $\theta$

Need to measure quality of $\pi_\theta$:

  • Episodic environment with some initial state, we use the start-value:

	   J_1 (\theta) = V^{\pi_\theta} (s_1) = \mathbb{E_{\pi_\theta}}[v_1]

    i.e. judge the policy by the expected reward received when starting in state $v_1$

  • Continuing environments we can use the average value:

	   J_{\text{avg } V}(\theta) = \sum_s d^{\pi_\theta}(s) V^{\pi_\theta}(s)
  • Or average reward per time-step

	   J_{\text{avg } R}(\theta) = \sum_s d^{\pi_\theta}(s) \sum_a \pi_{\theta} (s, a) \mathcal{R}_s^a

    which says "there is some probability of being in state $s$, under our policy there is some probability I choose action $a$ in this state, and then I receive reward at this step". So the above is for each time-step.

    Where $d^{\pi_\theta}$ is a stationary distribution of a Markov Chain, i.e. what is the probability of being in state $s$.

Comparison with Value-based methods


  • Better convergence propertiers
  • Effective in high-dimensional or continuous action space
    • This is due to the fact that in value-based methods we need to maximize over the actions to obtain the "value" for a next state. With high-dimensional or continuous action space it might be very hard to perform this maximization
  • Can learn stochastic policies
    • stochastic policies simply mean policies which take on a probability distribution over the actions rather than a discrete choice. Very useful when the environment itself is stochastic (i.e. not deterministic).
    • Example is "Rock-Paper-Scissor", where a the optimal behavior will be a uniform distribution over the actions.


  • Typically converge to local optimum rather than global
  • Policy evaluation is usually inefficient using Policy Gradient Methods, compared to using Value-function methods.

Finite Difference Policy Gradient

  • To evaluate policy gradient of $\pi_\theta(s, a)$
  • Does so through the definition of the derivative for each dimension (see p. 13 in Lecture 7 for details)
  • Noisy and inefficent in high dimensions

Monte-Carlo Policy Gradient

  • Allows analytic gradients
  • Assume policy $\pi_\theta$ is differnetiable whenever it is non-zero
  • And know the gradient
  • Score function:

\nabla_\theta \log \pi_\theta (s, a)

    where we simply ignore the $\pi_\theta(s, a)$ term from the Likelihood ratios due to the fact that optimizing wrt. $\log f(x)$ also optimizes $f(x)$.

Likelihood ratios

\nabla_\theta \pi_\theta (s, a) &= \pi_\theta (s, a) \frac{\nabla_\theta \pi_\theta (s, a)}{\pi_\theta(s, a)} \\
&= \pi_\theta (s, a) \nabla_{\theta} \log \pi_\theta(s, a)

Simply due to the $\nabla \log (x) = \frac{\nabla x}{x}$

Example score-function: Softmax Policy

  • Actions weighted by linear combination of features $\phi(s, a)^T \theta$
  • Assume $\pi_\theta (s, a) \propto \exp \Big( {\phi(s,a)^T \theta} \Big)$
  • Score function is then

 \nabla_\theta \log \pi_\theta (s, a) = \phi(s, a) - \mathbb{E} [\phi(s, \cdot)]

Example score-function: Gaussian Policy

  • $\mu(s) = \phi(s)^T \theta$
  • Policy is Guassian, $a \sim \mathcal{N} (\mu(s), \sigma^2)$
  • Score function is then

\nabla_\theta \log \pi_\theta (s, a) = \frac{(a - \mu(s)) \phi(s)}{\sigma^2}

One-Step MDPs

  • Starting in state $s \sim d(s)$
  • Terminating after one time-step with reward $r = \mathcal{R}_{s, a}$

See p. 19 in Lecture 7 for how to compute the policy gradient, OR try it yourself from the objective functions described earlier!

\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta} [\nabla_\theta \log \pi_\theta (s, a) r]

Policy Gradient Theorem

Simply changes the immediate reward $r$ in the one-step MDPs with the long-term value $Q^\pi (s, a)$.

For any differentiable policy $\pi_\theta (s, a)$, for any of the policy objective functions $J = J_1, J_{\text{avg } R}, \quad \text{or  } \frac{1}{1 - \gamma} J_{\text{avg } V}$, the policy gradient is

\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta} [\nabla_\theta \log \pi_\theta (s, a) Q^{\pi_\theta}(s, a)]

The algorithm

  • Update parameters by SGD
  • Using policy gradient theorem
  • Using return $v_t$ as an unbiased sample of $Q^{\pi_\theta} (s_t, a_t)$

\Delta \theta_t = \alpha \nabla_\theta \log \pi_\theta (s_t, a_t) v_t

Remember this is a MC => need to sample entire episode to learn

See p.21 from Lecture 7 for psuedo-code.

Actor-Critic Policy Gradient

  • Actor-Critic learns both value function and policy
  • MC policy gradient still has high variance
  • Use critic to estimate the action-value function $Q_w(s, a) \sim Q^{\pi_\theta} (s, a)$
  • Actor-critic algorithms maintain two sets of parameters
    updates action-value function parameters $w$
    updates policy parameters $\theta$ in direction suggested by critic
  • That is; critic handles estimating $Q^{\pi_\theta}$, and the actor then asks the critic what the value of the state-action pair the critic just observed / took is, and updates the policy-parameters $\theta$ according to this value

\nabla_\theta J(\theta) &\sim \mathbb{E}_{\pi_\theta} [\nabla_\theta \log \pi_\theta (s, a) Q_w (s, a) ] \\
\Delta \theta &= \alpha \nabla_\theta \log \pi_\theta (s, a) Q_w(s, a)

    So we have exactly the same update, but now replacing the long-term action-value function with an approximation.

Estimating the action-value function

Want to know how good the current policy $\pi_\theta$ is for current params $\theta$, hence we need to do policy evaluation

  • Monte-Carlo policy evaluation
  • Temporal-Difference evaluation
  • TD(λ)


See the psuedo-code on p. 25 in Lecture 7.

But sort of what it does is (at each step):

  1. (actor) Sample $r = \mathcal{R}_s^a$; sample transition $s' \sim \mathcal{P}_s^a$
  2. (actor) Sample action $a' \sim \pi_\theta (s', a')$
  3. (critic) Compute TD-error $\delta_t$
  4. (actor) Update policy-parameters $\theta$
  5. (critic) Update weights for $Q_w(s, a)$
  6. (actor) $a \leftarrow a', \quad s \leftarrow s'$

Compatible Function Approximation

If the following two conditions are satifisfied:

  1. Value function approximator is compatible to the policy

\nabla_w Q_w(s, a) = \nabla_\theta \log \pi_\theta (s, a)
  2. Value function parameters $w$ minimise the MSE

\varepsilon = \mathbb{E} \Big[ \big( Q^{\pi_\theta}(s, a) - Q_w(s, a) \big)^2 \Big]

Then the policy gradient is exact,

\nabla_\theta J(\theta) = \mathbb{E} \big[ \nabla_\theta \log \pi_\theta (s, a) Q_w(s,a) \big]

Reducing variance using a Baseline

  • Subtract a baseline function $B(s)$ from the policy gradient
  • Can reduce variance without changing expectation
  • Basically "centering" (subtracting mean and dividing by variance)

See p. 30 in Lecture 7 for a proof of why it doesn't change the expectation. Basically, we can add or subtract anything we want from $Q^{\pi_\theta}$ as long as it only depends on the state $s$.

Left with something called the advantage function

A^{\pi_\theta} (s, a) &= Q^{\pi_\theta} (s, a) - V^{\pi_\theta}(s) \\
\nabla_\theta J(\theta) &= \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta (s, a) A^{\pi_\theta} (s, a)] \\
Estimating Advantage function

There multiple ways of doing this.

See p. 30 in Lecture 7 and on for some ways. The most used one is nr. 2.


Policy gradient has many equivalent forms:

\nabla_\theta J(\theta) &=& \mathbb{E} [ \nabla_\theta \log \pi_\theta (s, a) v(s)] & \text{REINFORCE} \\
&=& \mathbb{E} [ \nabla_\theta \log \pi_\theta (s, a) Q^w (s, a) ] & \text{Q Actor-Critic} \\
&=& \mathbb{E} [ \nabla_\theta \log \pi_\theta (s, a) A^{\pi_\theta}(s, a) ] & \text{Advantage Actor-Critic} \\
&=& \mathbb{E} [ \nabla_\theta \log \pi_\theta (s, a) \delta ] & \text{TD Actor-Critic} \\
&=& \mathbb{E} [ \nabla_\theta \log \pi_\theta (s, a) \delta e ] & \text{TD(}\lambda \text{) Actor-Critic} \\
G_\theta^{-1} \nabla_\theta J(\theta) &=& w & \text{Natural Actor-Critic}

All these methods are basically saying: move the policy in the direction that gets us more value.

  • Each leads a SGD algorithm
  • Critic uses policy evaluation (e.g. MC or TD learning) to estimate $Q^{\pi}(s, a)$, $A^\pi(s, a)$ or $V^\pi(s)$

Thompson sampling


  • $k \in \left\{ 1, \dots, K \right\}$ actions
  • $(\theta_1, \dots, \theta_K)$ are the probabilities of success for the corresponding action / arm
  • $G = (V, E)$ denotes a graph with vertices $V = \left\{ 1 , \dots, N \right\}$ and edges $(i, j) \in E$
  • Historical data

\mathbb{H}_{t} = \Big( (x_1, y_1), \dots, (x_t, y_t) \Big)
  • $r_t = r(y_t)$ denotes reward obtained by taking action $x_t \in \mathcal{X}$ leading to observed outcome $y_t$
  • Per-period regret is the difference between exptected reward using optimal action and the reward received by current action. In the Armed bandit problem:

\text{regret}_t(\theta) = \Big( \max_k \theta_k \Big) - \theta_{x_t}
  • $y_t \sim q_{\theta}(\cdot \mid x_t)$ is the conditional distribution of the system (i.e. what we want to learn)

Armed bandit

  • Suppose there are $K$ actions, and when played, any action yields either a success or failure.
  • Action $k \in \left\{ 1, \dots, K \right\}$ produces a success with probability $\theta_k \in [0, 1]$.
  • The success probabilities $(\theta_1, \dots, \theta_K)$ are unknown to the agent, but are fixed over time, and can therefore be learned by experimentation
  • Objective: maximize number of successes over $T$ periods, where $T$ is relatively large compared to the number of arms $K$


  • Posterior sampling and probability matching
  • Methods such as $\varepsilon \text{-greedy}$ attempts to balance exploration by choosing a random action with probability $\varepsilon$ and greedy step with probability $1 - \varepsilon$
    • Inefficient since it will sample actions regardless of how unlikely they are to be optimal; Thompson sampling attempts to address this
  • Useful for online decision problems
  \KwData{action space $\mathcal{X}$, prior $p$, enviroment distribution $q$, reward function $r$}
  \KwResult{posterior distribution $\hat{p}$}
  \For{$t = 1, 2, \dots$} {
    \tcp{sample model; a greedy algorithm would use $\hat{\theta} = \mathbb{E}_p[\theta]$}
    $\hat{\theta} \sim p$ \\
    \tcp{select and apply action}
    $x_t \leftarrow \underset{x \in \mathcal{X}}{\text{argmax}}\ \mathbb{E}_{q_{\hat{\theta}}} \big[ r(y_t) \mid x_t = x \big]$ \\
    Apply $x_t$ and observe $y_t$ \\
    \tcp{update distribution}
    $p \leftarrow \mathbb{P}_{p, q}(\theta \in \cdot \mid x_t, y_t)$
  \caption{Thompson Sampling. Estimate the distribution of the environment. }

The difference between a greedy algorithm and Thompson sampling is that a greedy algorithm would use the expectation over the current estimate of $p$ to obtain the new estimate of $\theta$, while Thompson sampling samples the new estimate of $\theta$.


Chapter 2

Stationary Problem

  • N-armed bandit problem
  • Action-rewards does not change over time but stay constant
  • All we got to do then is to model the average reward for some action, i.e.
  Q_t(a) = \frac{R_1 + R_2 + ... + R_{K_a}}{K_a}

where $K_a$ is the number times we've chosen action $a$ up until step $t$

And of course, practically it's better to compute this using the recursive formula for the mean:

Q_{k+1} &= \frac{1}{k} \sum_{i=1}^k R_i \\
&= \frac{1}{k} (R_{k} + \sum_{i=1}^{k-1} R_i) \\
&= \frac{1}{k} \Big( R_k + (k - 1) \sum_{i=1}^{k-1} R_i \Big) \\
&= \frac{1}{k} \Big( R_k + (k - 1) Q_k ) \\
&= Q_k + \frac{1}{k} \Big( R_k - Q_k \Big) \\

where $Q_k$ is the value for some action $a$ after choosing and receiving reward from this action $k$ times.

Non-stationary Problem

  • Makes sense to more heavily weight the most recent observations higher than "old" ones
  • We write $\alpha$ in instead of $\frac{1}{k}$ to be a bit more general
  • By recursively expanding we obtain
Q_{k+1} = (1 - \alpha)^k Q_1 + \sum_{i=1}^k \alpha (1 - \alpha)^{k-i} R_i

Which is called the weighted average, due to the fact that $(1-\alpha)^k + \sum_{i=1}^k \alpha (1-\alpha)^{k-i} = 1$ (can be proven by induction).


2.3 - N-Armed bandit ɛ-greedy comparison

Consider the following learning problem. You are faced repeatedly with a choice among n different options, or actions. After each choice you receive a numerical reward chosen from a stationary probability distribution that depends on the action you selected. Your objective is to maximize the expected total reward over some time period, for example, over 1000 action selections,or time steps.

The problem is as follows:

  • Choose between n actions at each step t
  • Action-rewards are constant wrt. t
  • Maximize expected / avg. reward over 1000 steps t

To get an idea of how well the algorithms perform in general, we run the above for 2000 different action-rewards, i.e. 2000 differently parametrized environments.

# don't want this tangled
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

# n-Armed Bandit
# each environment has 10 actions (bandits), and we have 2000 such problems
# allows us to see how algorithms perform over different hidden action-values
n_steps = 1000
n_problems = 2000
armed_testbed = np.random.randn(10, n_problems)
actions = np.arange(10)
steps = np.arange(n_steps)

# we compute the value of action `a` at step `t`
# by the accumulated reward from the times we've
# chosen `a` up until (incl.) `t` and then divided
# by the number of times we've chosen this `a`.
# Basically, the average reward of `a` at `t`
# wrt. times we've chosen `a`.

def epsilon_greedy(Q, epsilon):
    if np.random.rand() < epsilon:
        return np.random.randint(actions.shape[0])
    return np.argmax(Q)

def run_testbed(choose_action, *args, **kwargs):
    rewards = np.zeros((n_steps, armed_testbed.shape[1]))

    for j in range(armed_testbed.shape[1]):
        environment = armed_testbed[:, j]

        # intialization
        Q = np.zeros(10)
        action_freq = np.zeros_like(actions,
        action_reward = np.zeros_like(actions)
        reward = 0

        for t in steps:
            a = choose_action(Q, *args, **kwargs)
            # reward with some noise
            r = environment[a] + np.random.randn()
            # update count and reward for selection action `a`
            action_freq[a] += 1
            action_reward[a] += r
            # update value for action `a`
            Q = action_reward / (action_freq + 1e-5)

            # accumulate reward
            reward += r
            # expected reward wrt. t
            rewards[t, j] = reward / (t or 1)

    # take the avg for each time-step over all problems
    return np.mean(rewards, axis=1)

And let's run the testbed for different ε in parallel.

import multiprocessing as mp
from functools import partial

f = partial(run_testbed, epsilon_greedy)

with mp.Pool(3) as p:
    results =, [0, 0.01, 0.1])

Finally we create the plot!

steps = np.arange(1000)

fig = plt.figure(figsize=(14, 5))
plt.plot(steps, results[0], label="greedy")
plt.plot(steps, results[1], label="ε-greedy 0.01")
plt.plot(steps, results[2], label="ε-greedy 0.1")
plt.title("2000 different 10-bandit problems")
plt.ylabel("Expected $R_t$")


synchronous updates
update all states / values / whatever one after the other
given a full MDP and some policy $\pi$, find value function $V^\pi$ for this policy
given a full MDP, find optimal (wrt. reward) value function $V*$ and policy $\pi*$
simply the state $s$ we "base" our update on for some state $s'$
updating the value of a state using values of future states
full backup
we perform backup for all states
consists of applying a backup operation to each state
$R_{t+1} + \gamma V(S_{t+1})$
$\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$
update involves an estimate (not the real returns / rewards)
update samples an expectation
stochastic policy
policy which uses a distribution over actions rather than deterministic behavior.


  • Karpathy's "Pong from pixels". Details how to implement a policy-gradient RL-algorithm using a ConvNet for function-approximation to learn to play Pong from the raw image input.