Reinforcement Learning

Table of Contents


Reinforcement Learning is a learning framework where which helps us to account for future reward by "encoding" the value of the next state reinforcement_learning_eed8aa14cb7e30428601947fd38cbff194501645.png (and / or actions) in the value of the current state reinforcement_learning_293db0c4e4da69245da054eb061ca764e2c4dfcc.png (and / or actions), thus turning the greedy "strategy", wrt. this value function we're creating, the optimal "strategy".


  • Capital letter = random variable, e.g. reinforcement_learning_e1606e7346bacb31c433e48ab9a27150bcbdc1f8.png is the random variable which can take on all reinforcement_learning_32259c2b21dbd267d3a86c18c7adadeeeee75a55.png
  • Wierd capital letters = spaces of values that the random variables can take on, e.g. reward space / function reinforcement_learning_b8615deb99f6b822a41b83e0702f97500dd423f2.png
  • Lowercase = specific values the random variables take on, e.g. reinforcement_learning_265b3a043c1edb7a21e404cfd8618bce7b4e26fa.png
  • reinforcement_learning_498a0868024087d893a089b85f7e4b73c64b5be7.png denotes optimality for some function reinforcement_learning_93369077affb352dbdb97e8b3182fd50784f2b14.png
  • reinforcement_learning_9e283f0a770977b9c5776238e85e4b1f2406cdd4.png - policy
  • reinforcement_learning_8b67facb987aef38d90e5a3c1d9ba3968d35f1c2.png - action
  • reinforcement_learning_32259c2b21dbd267d3a86c18c7adadeeeee75a55.png - state
  • reinforcement_learning_e7b5838c9fdb7150a46b12ade18e02b23ffb6d37.png - reward
  • reinforcement_learning_54f59e8ea58dadb4418ec803d6ecc98fdfe14695.png - value function for some state reinforcement_learning_aa1698cb8ee1665238ec3e91824191643c62ee93.png
  • reinforcement_learning_e7d4924428b007b6f19cfd112e4f974044b24bb6.png - accumulated reward at step reinforcement_learning_a270d2f0aac3268943aea13ffd8c0dd862578f2e.png

Markov Decision Process

Markov Reward Process (MRP)

See p. 10 in the Lecture 2: MDP.

Markov Decision Process (MDP)

See p. 24 in the Lecture 2: MDP.

Partial Observable Markov Decision Process (POMDP)

See p. 51 in the Lecture 2: MDP.

Planning by Dynamic Programming

Policy evaluation

Problem - evaluate a given policy reinforcement_learning_9e283f0a770977b9c5776238e85e4b1f2406cdd4.png Solution - iterative application of Bellman expectation backup

  1. Initial value function reinforcement_learning_df58b36a4a134784a088e3f50e152e61b5937f34.png is 0 everywhere
  2. Using synchronous backups
    • Each iteration reinforcement_learning_5a565ece572d54b17995317acee2accc6ec2f821.png
    • For all states reinforcement_learning_678b49f7e5ab8f76628d50711bf275b4730960dc.png
    • Update reinforcement_learning_45c9ec5aff24f8ecb0d4bb5b35c2e4e080858b70.png from reinforcement_learning_921be87f90d734fc676238d76bf66a80ed7f63bb.png

In short, when updating the value function reinforcement_learning_2ceb5e690db9006af18be019f834394201e4180f.png for state reinforcement_learning_aa1698cb8ee1665238ec3e91824191643c62ee93.png, we use the Bellman equation and the previous value function reinforcement_learning_4814974d67f6bbff04f331785de6c9441d66bfc3.png for all "neighboring" states (the states we can end up it with all the actions we can take)

reinforcement_learning_d227c159aec842757df3b431a59e158a624489c8.png refers to the reinforcement_learning_5eb7fc787be871dff6a222f22f13df17973b1ec0.png iteration of the iterative policy evaluation


Policy iteration

Problem - given a policy reinforcement_learning_9e283f0a770977b9c5776238e85e4b1f2406cdd4.png, find a better policy

  1. Evaluate the polciy reinforcement_learning_9e283f0a770977b9c5776238e85e4b1f2406cdd4.png
  2. Improve the policy by acting greedily wrt. to reinforcement_learning_9d07d01df7ee21c3438eb7b55fb19c9076a2fc8b.png,


This will eventually converge to the optimal policy reinforcement_learning_a4c22f18dd0429d55c9ba17cc6200e85f42f48b5.png due to the partial ordering of policies reinforcement_learning_357ad7d6ee2c53717a5e4f04ef4627669a0532b6.png.

Value iteration


Find optimal policy reinforcement_learning_9e283f0a770977b9c5776238e85e4b1f2406cdd4.png given the dynamics of the systems (rewards and environmental state)


Iterative application of Bellman optimality backup.

  1. Initialize value function reinforcement_learning_df58b36a4a134784a088e3f50e152e61b5937f34.png
  2. For each iteration reinforcement_learning_f991082100c628dc9d3b7dc6c95722a0be3c8d38.png
    • For all states reinforcement_learning_32259c2b21dbd267d3a86c18c7adadeeeee75a55.png
    • Update reinforcement_learning_03abfe5a406724c67d3faee49697ab5cf0157f14.png from reinforcement_learning_86fb08ec531bdc91fe1a17b61be5ec1889ef7892.png

Unlike policy iteration, there is no explicity policy here.

At each iteration, for all states reinforcement_learning_32259c2b21dbd267d3a86c18c7adadeeeee75a55.png we make the current optimal action reinforcement_learning_3778a663528a3e4db5ce811ed0c6cd9f5831579f.png (i.e. optimal wrt. the current iteration), and then update the value reinforcement_learning_d198c0d06b0da5525f9a966222601066509ac01d.png for this state.

Eventually reinforcement_learning_b0fa312a4d5ccd0ccdf6777c02c60425134bfc17.png will converge to the true value function for some reinforcement_learning_aa1698cb8ee1665238ec3e91824191643c62ee93.png, and since all states are dependent through the recursive relationship from the Bellman equation, the other states will also start to converge.

In the form of the Bellman equation we can write the iterative process as:


Or in matrix notation:



Problem Bellman equation Algorithm
Prediction Bellman Expectation equation Iterative, Policy evaluation
Control (maximize reward) Bellman Expectation equation + Greedy Policy improvement Policy iteration
Control Bellman Optimality equation Value iteration


Asynchronous Dynamic Programming

Prioritized sweeping
  • Decide which states to update
  • Use absolute value between Bellman equation update and current reinforcement_learning_54f59e8ea58dadb4418ec803d6ecc98fdfe14695.png to guide our selection for which states to update
Real-time dynamic programming
  • Idea: only the states that are relevant to agent
  • Randomly sample
  • After each time-step reinforcement_learning_7a4f0578ef280646fd7f78f18c2b944dc7c9a752.png
  • Backup the state reinforcement_learning_293db0c4e4da69245da054eb061ca764e2c4dfcc.png

Model-free Prediction - policy-evaluation

What if we don't know the Markov Decision Process, i.e. how the environment works?

This sections focuses only on evaluating the value function reinforcement_learning_b356f526f85b6a0ca0c3fff64dcffac60e6e7737.png for some given policy reinforcement_learning_9e283f0a770977b9c5776238e85e4b1f2406cdd4.png. This policy does not have to be the optimal policy. ONE STEP AT THE TIME BOIS!

Monte-Carlo Learning


  • reinforcement_learning_eb8895027fbf1e23c853fc72d8bd87665c6ad89e.png - # of times we've seen state reinforcement_learning_aa1698cb8ee1665238ec3e91824191643c62ee93.png in this episode
  • reinforcement_learning_2a726101e8e6d05d249ddea26056d1835b419567.png - sum of values received from the times we've been in state reinforcement_learning_aa1698cb8ee1665238ec3e91824191643c62ee93.png in this episode
  • value is equal to avg. return
  • Learns from complete episodes, i.e. we need to reach some termination
  • Caveat can only apply MC to episodic

We assume reinforcement_learning_7aa63d51984cd40d66a9bb600d8ec33ae2b50d64.png, and so all we try to do is find the avg. accumulated reward starting from our current state reinforcement_learning_aa1698cb8ee1665238ec3e91824191643c62ee93.png at step reinforcement_learning_a270d2f0aac3268943aea13ffd8c0dd862578f2e.png.

To obtain this average, we sample "mini-episodes" starting from our current state and average the accumulated reward reinforcement_learning_0fef2aab52a647b93cd0f0f9432ddb6e293a9b02.png over these samples.


First-Visit MC policy evaluation

The first time-step reinforcement_learning_a270d2f0aac3268943aea13ffd8c0dd862578f2e.png in that state reinforcement_learning_aa1698cb8ee1665238ec3e91824191643c62ee93.png is visited in an (sampled) episode:

  1. Increment counter reinforcement_learning_eb5015c4d68c45f43b8d0f9f3ec4288650bacaa0.png
  2. Increment total return reinforcement_learning_4086ba36b844b23b9a0d3f76209af9e472705e66.png, ( reinforcement_learning_e1606e7346bacb31c433e48ab9a27150bcbdc1f8.png being the sum in this case)
  3. Value is estimated by mean return reinforcement_learning_4e83a747f811a118fd60f731433c04f860b686ab.png
  4. By law of large numbers, reinforcement_learning_d8b7a427f94f32bfba06e16285446cdc88299a4d.png

Instead of using the total sums, we can take advantage of the recursive mean formula for online-learning.

Every-Visit MC policy evaluation

Same as First-Visit MC policy evaluation, but if we have loops and end up in the same state reinforcement_learning_aa1698cb8ee1665238ec3e91824191643c62ee93.png multiple times in the same episode, we perform the steps above.

Incremental MC

See p. 11 in Lecture 4.

Update is now:


where reinforcement_learning_539de08e3b4977be03a54acf2c4778562c2f9a2c.png in this case. The above is basically a general approach which we will see recurring in other methods.

Temporal-Difference Learning


  • TD-target - reinforcement_learning_86899349e918d8725d7d36d278456a1e85b1121f.png, i.e. the value of state reinforcement_learning_293db0c4e4da69245da054eb061ca764e2c4dfcc.png we we want to update towards. Thus we update reinforcement_learning_78e96f97594606d9af4c4f67365637116be3d512.png to better account for the value of the next state.
  • TD-error - reinforcement_learning_1ab860254b5b2a00cb4b06fd6a42363806b654c2.png, i.e.


  • Learn directly from experience
  • No knowledge of MDP transitions / rewards, i.e. model-free
  • Learns from incomplete episodes, by bootstrapping (different from MC)
  • Updates a guess towards a guess
  • TD attempts to "encode" a fraction of the value in the next state reinforcement_learning_eed8aa14cb7e30428601947fd38cbff194501645.png into the value of reinforcement_learning_293db0c4e4da69245da054eb061ca764e2c4dfcc.png incrementally

Update value reinforcement_learning_78e96f97594606d9af4c4f67365637116be3d512.png towards estimated return reinforcement_learning_86899349e918d8725d7d36d278456a1e85b1121f.png



The difference between the MC and TD approaches is that the MC considers the actual return reinforcement_learning_0fef2aab52a647b93cd0f0f9432ddb6e293a9b02.png obtained from an (complete) episode, while the TD uses some estimated return at the next step to obtain value for current step.

TD is thus a much more flexible approach than MC.

Check out p.15 in Lecture 4 for a really nice example of the difference!


Bias / Variance between MC and TD

  • Return reinforcement_learning_057e01bd914eaee4c793972d5bd302898adb044e.png is unbiased estimate of reinforcement_learning_98698df0421cf1627c0354beb6848b0731ea7cbf.png
  • If we had the true value function, TD would also be unbiased, BUT it ain't so!
  • TD target reinforcement_learning_86899349e918d8725d7d36d278456a1e85b1121f.png is biased estimate of reinforcement_learning_98698df0421cf1627c0354beb6848b0731ea7cbf.png, but has lower variance
  • MC's evaluation depends on many random actions, transitions, rewards
  • TD's evaluation depends only on one random action, transition, reward

Pros and cons

  • MC has high variance, zerio bias
    • Good convergence properties
      • (also for function-approximations)
    • Not very sensitive to initial value
    • Minimizes the mean-squared error (p. 24 Lecture 4)
    • Ignores the Markov property
      • Slower when Markov property is present
      • Works even though there is no Markov property
    • Uses sampling
  • TD has low variance, some bias
    • Usually more efficient than MC
    • TD(0) converges to reinforcement_learning_38457c97159c502fa4330a90a02b42fbdc36302c.png
      • (not always for function-approximations)
    • More sensitive to initial value
    • Converges to a solution of maximum likelihood Markov model (p. 24 Lecture 4)
    • Takes advantage of the Markov property
      • Faster when this is present
      • Does not work as well when Markov property is absent
    • Usues bootstrapping AND sampling

DO checkout p. 26-28 in Lecture 4! It shows some great visualizations of the different methods and how they do look-aheads / backups.


  • Sweet-spot between TD- and MC-learning
  • What if we instead of just considering the very next reward, we consider the reinforcement_learning_1dbee3eb7b7bd0453a691130ee5d0d2bc8d46f23.png next rewards!
  • But we would also like to more heavily weight the more immediate reward estimates, compared to one found reinforcement_learning_1dbee3eb7b7bd0453a691130ee5d0d2bc8d46f23.png steps into the future.

Let's define the n-step return as :


The update is then:


And we get, as Hannah Montana would say, the best of both worlds! (TM)

The issue now is to find the optimal reinforcement_learning_1dbee3eb7b7bd0453a691130ee5d0d2bc8d46f23.png.

To handle this, we would like to somehow take the average reward for the different reinforcement_learning_1dbee3eb7b7bd0453a691130ee5d0d2bc8d46f23.png. And so TD(λ) is born!

Forward-view TD(λ)

  • The λ-return reinforcement_learning_79654a7e582ae95e8f0e2b254a068f186e08692b.png combines all n-step returns reinforcement_learning_5b1275927048fe1273fab975674ca08fdd27505b.png
  • Using weight reinforcement_learning_6247147f01b9fdfb8589399d0f25dbdc63c6b9c5.png


  • Forward-view TD(λ)


  • Why geometric weighting, and not some other weighting?
    • Computationally efficient
    • Stateless => no more expensive than TD(0)
  • Issue is now that we have to wait until the end of the episode (since we're summing to infinity), which is the same problem we had with MC
    • Luckily, by taking on an additional view of the problem, backward-view TD(λ), we can update from incomplete episodes

Backward-view TD(λ)

  • Keep eligibility trace for every state reinforcement_learning_aa1698cb8ee1665238ec3e91824191643c62ee93.png
  • Update value for reinforcement_learning_146ab3f3ee4de25cf1575892b075d09e1b9836da.png for every state reinforcement_learning_aa1698cb8ee1665238ec3e91824191643c62ee93.png
  • Update happens in proportion to the TD-error reinforcement_learning_289710aa65821156a171102b1f4e654fa76a8a3a.png and the eligibility-trace reinforcement_learning_1bf035ae24e2bf8a405eff07be2d4d8835baddf6.png


Eligibility traces

Eligibility traces combines most frequent and most recent states:


My intuition regarding this is as follows:

  • Gives us a way of measuring "responsibility" for the TD-error
  • If we at some time-step reinforcement_learning_df0a459c33fb16f88fcc00fee4e3f32d81e13236.png end up in state reinforcement_learning_aa1698cb8ee1665238ec3e91824191643c62ee93.png, then at some reinforcement_learning_a270d2f0aac3268943aea13ffd8c0dd862578f2e.png we end up in the same state reinforcement_learning_aa1698cb8ee1665238ec3e91824191643c62ee93.png, we can view the eligibility trace of reinforcement_learning_aa1698cb8ee1665238ec3e91824191643c62ee93.png at reinforcement_learning_fbae59cefbf49ecdfb9cccfe5b2074220aae1d52.png to measure how much the error in the value function is due to what we saw reinforcement_learning_fbae59cefbf49ecdfb9cccfe5b2074220aae1d52.png steps back.

See p.40 in Lecture 4 for an example.

Replacing Eligibility Trace method

The concept is simple; instead of adding the weighted occurence to the trace, we simply replace it. This way the largest possible contribution we can have is 1.

This is especially useful for cases where we have possibilities of cyclic state-transitions, as the multiple occurences of the same state can lead to incredibly contributions to this state.

Take this example:

  • Start at position 0
  • Reward is at position 1
  • Try to move left (opposite of reward) 100000 times, staying in position 0 since there is a wall to our left
  • Finally take a right
  • Receive reward

In this case position 0 would get large amounts of credit even though it might not be the best state to be in.

This can also often lead to integer-overflow unless other precautionary measures are taken.

For completeness:


is used instead of


Backward-view and forward-view equivalence

If you have a look at p.43 in Lecture 4 and on, there is some stuff about the fact that these views are equivalent.

  • TD(0) is simply the Simple TD-learning we started out with:


  • TD(1) defers updates until the episode has terminated:


    This is not actually equivalent to the MC view, just sayin'.


  • Forward-view provides us with a "theoretical", providing us with an intuition


  • Backward-view provides us with a practical way to update reinforcement_learning_146ab3f3ee4de25cf1575892b075d09e1b9836da.png for some state reinforcement_learning_aa1698cb8ee1665238ec3e91824191643c62ee93.png where the error is now a function of the previous updates of reinforcement_learning_146ab3f3ee4de25cf1575892b075d09e1b9836da.png instead of the future updates, as in the forward view.


  • Backward-view provides us with a practical way to update some state reinforcement_learning_aa1698cb8ee1665238ec3e91824191643c62ee93.png a any time-step reinforcement_learning_a270d2f0aac3268943aea13ffd8c0dd862578f2e.png by saying that this state depends on a combined weight of the steps since we last updated reinforcement_learning_146ab3f3ee4de25cf1575892b075d09e1b9836da.png and the frequency of state reinforcement_learning_aa1698cb8ee1665238ec3e91824191643c62ee93.png. If we let reinforcement_learning_3cd0a0b6c3e431c3b0a946f3e56f53429ea0af45.png, i.e. consider reinforcement_learning_d16449a0e660caed498740138ecec73651f262bd.png, it turns out that the forward-view and backward-view are equivalent.

I must admit that I'm sort of having a hard time understanding how the forward-view and backward-view are in fact equivalent.

I can understand the intuition behind both views, but it does not seem to be the same intuition. Forward-view:

  • TD-error is difference between current reinforcement_learning_146ab3f3ee4de25cf1575892b075d09e1b9836da.png and accumulated rewards reinforcement_learning_5b1275927048fe1273fab975674ca08fdd27505b.png at all steps reinforcement_learning_1dbee3eb7b7bd0453a691130ee5d0d2bc8d46f23.png in the future


  • Considers all previous updates to reinforcement_learning_146ab3f3ee4de25cf1575892b075d09e1b9836da.png for the state reinforcement_learning_aa1698cb8ee1665238ec3e91824191643c62ee93.png


  • The error between reinforcement_learning_3cb4ca2356aacd9f0864aff829961b3b73867648.png now and in some future is of course zero if we never actually see state reinforcement_learning_aa1698cb8ee1665238ec3e91824191643c62ee93.png again. This also makes sense in the backward-view, where at some "future" we haven't seen reinforcement_learning_aa1698cb8ee1665238ec3e91824191643c62ee93.png since some time reinforcement_learning_a270d2f0aac3268943aea13ffd8c0dd862578f2e.png in the past.

Model-free Control - policy-optimization



  • Uses TD-learning to learn reinforcement_learning_3cdc4086be5d35511cac2c99204d6757817fcdaf.png
  • Updates to


  • Elegibility trace runs all the way back => how do we deal with loops?
    • Cycles of identical state-action pair reinforcement_learning_1780ebf44e923e1295836f07d1d32d1e55f23420.png will cause updates which lowers reinforcement_learning_a33c52ad2036ccb7e3cbb1d4870a64bd34602220.png until it is not any longer the preferred pair.
    • But what if reinforcement_learning_1780ebf44e923e1295836f07d1d32d1e55f23420.png is in fact the best state-action pair, but also has the largest probability of ending up in reinforcement_learning_aa1698cb8ee1665238ec3e91824191643c62ee93.png, then the above point would be highly indesirable!
      • This is especially a problem in environemnts where we dont get a reward until the episode terminates
    • In this case, lowering the eligility-factor (or whatever you call it) reinforcement_learning_df08dda73af29a2bd1a17f98fe28db12eb64c4ee.png might help to mitigate this (I think o.O)



  • Attempts to estimate a target policy reinforcement_learning_b7229546966987934e6434e3636d69944362908a.png and use this to then estimate reinforcement_learning_54f59e8ea58dadb4418ec803d6ecc98fdfe14695.png or reinforcement_learning_a33c52ad2036ccb7e3cbb1d4870a64bd34602220.png


Notice the difference between Sarsa and Q-learning! reinforcement_learning_eed8aa14cb7e30428601947fd38cbff194501645.png is observed, reinforcement_learning_6ecabf46b855c7cbc7b0c57a13b6f1559d122040.png is predicted from reinforcement_learning_6efeac92f26ec99b9b7ad9805062f82cc7b404c5.png

Value-function approximation

Incremental methods

Value function approximation

  • λ-return reinforcement_learning_79654a7e582ae95e8f0e2b254a068f186e08692b.png is a biased sample of true value reinforcement_learning_38457c97159c502fa4330a90a02b42fbdc36302c.png
  • Apply supervised learning to "training data"


  • Forward-view for a linear TD(λ)


  • Backward-view for a linear TD(λ)
    • reinforcement_learning_60bc84c39069f560d9116f3c23f0644ae4449fa0.png is now be of same dimension as our feature-space!
    • That is, instead of "counting" the occurences of the states, we now "count" the features of the states


Action-value function approximation

Evaluation / prediction


Want to minimise the mean-squared error (MSE):


Using SGD to find local minimum:


  • When to use what for prediction

    See p.30 in Lecture 6 for a sweet table regarding when we have guaranteed convergence when using function approx. for policy evaluation.


Simply take what we found in Evaluation / prediction and substitute a target for reinforcement_learning_17890a8637e6d6584733b472fd0eb105cbd65f27.png

  • Monte-Carlo

    Target is return at time reinforcement_learning_a270d2f0aac3268943aea13ffd8c0dd862578f2e.png, reinforcement_learning_0fef2aab52a647b93cd0f0f9432ddb6e293a9b02.png:


  • TD(0)

    Target is reward + estimated reward given we take optimal actions in the future:


  • TD(λ)
    • Forward-view

      Target is simply reinforcement_learning_63ce674ae757a2e2d5a6b34c78d843e550684d40.png, thus:


      where reinforcement_learning_5df8498859c7796c53b127616e4054cc1b6cdab5.png and reinforcement_learning_c47c74204a1833f5e1190e8c1df8363286e9511a.png.

    • Backward-view


Batch methods

Thought is to "re-use" data, or rather, make the most use of the data we have, instead of updating once using each point as previously described.

Experience replay

See p.35 in Lecture 6.

But it's basically just this:

  1. Sample reinforcement_learning_9be9f08dfc4043e198aa2a1266ccc4142a552ffb.png from our observations reinforcement_learning_c52f560606936609b5cf850b7eea2582f5087c10.png
  2. Apply SGD update reinforcement_learning_6e848f284a644a06c0bf580d77aa80fbe23bdfa0.png
  3. Converges to least squares solution reinforcement_learning_2c7ca9bb7693e7e63d7f1301eb0cf0d86cc30ee8.png

Deep Q-Networks (DQN)

See p.40 or so in Lecture 6 for this awesomesauce. Uses

  • Experience replay
    • Randomization decorrelates the data => good!
  • Fixed Q-targets
    • Freeze network, and then during the next training-cycle we bootstrap towards the frozen network (just old parameters)
    • Every some n steps we freeze the parameters, and then the next n steps we use these parameters' prediction as targets. Then after those n targets we freeze our new parameters, and repeat.
    • More stable update

The points above is what makes this non-linear function approx. stable.

What they did in the Atari-paper is to have the outputs of the ConvNet be a softmax over all possible actions. This allows easy maximization over the actions, and is not too hard to model as long as the action-space reinforcement_learning_37d3d2a40172b2cf7b0e8ea91ec033c3e1f9ecbd.png isn't too large.

Linear Least Squares prediction

See p. 43 or so for a closed form solution to the linear case. For prediction, but something similar exists for control too.

Policy gradient methods


  • reinforcement_learning_4c7d29c87bbf1548950bf3f61aa46a8af1b14186.png is any policy objective function
  • reinforcement_learning_29f8e362292ee673a556b0da94863ef784869b93.png is the policy update


Previously we've considered the prediction and control of the value function, but now we're instead going to directly model the optimal policy! That is, we directly parametrise


To be able to optimize the policy, we first need to come up with some policy objective function.

Policy Object Functions

Goal: given a policy reinforcement_learning_82d180ddc83cfa10a3c9c644727b0df695b451a1.png with parameters reinforcement_learning_7bec91615d02a5342b34270ae4a669947491ecaf.png, find best reinforcement_learning_7bec91615d02a5342b34270ae4a669947491ecaf.png

Need to measure quality of reinforcement_learning_db2ebf6a22a442cb4166924a51473fb77fe82929.png:

  • Episodic environment with some initial state, we use the start-value:


    i.e. judge the policy by the expected reward received when starting in state reinforcement_learning_52d25ed38af35e1d6acd665dc3da709e5ffd6451.png

  • Continuing environments we can use the average value:


  • Or average reward per time-step


    which says "there is some probability of being in state reinforcement_learning_aa1698cb8ee1665238ec3e91824191643c62ee93.png, under our policy there is some probability I choose action reinforcement_learning_3778a663528a3e4db5ce811ed0c6cd9f5831579f.png in this state, and then I receive reward at this step". So the above is for each time-step.

    Where reinforcement_learning_5d29ad9f8bc56ab71c17f04f8e29f27b15fc8338.png is a stationary distribution of a Markov Chain, i.e. what is the probability of being in state reinforcement_learning_aa1698cb8ee1665238ec3e91824191643c62ee93.png.

Comparison with Value-based methods


  • Better convergence propertiers
  • Effective in high-dimensional or continuous action space
    • This is due to the fact that in value-based methods we need to maximize over the actions to obtain the "value" for a next state. With high-dimensional or continuous action space it might be very hard to perform this maximization
  • Can learn stochastic policies
    • stochastic policies simply mean policies which take on a probability distribution over the actions rather than a discrete choice. Very useful when the environment itself is stochastic (i.e. not deterministic).
    • Example is "Rock-Paper-Scissor", where a the optimal behavior will be a uniform distribution over the actions.


  • Typically converge to local optimum rather than global
  • Policy evaluation is usually inefficient using Policy Gradient Methods, compared to using Value-function methods.

Finite Difference Policy Gradient

  • To evaluate policy gradient of reinforcement_learning_82d180ddc83cfa10a3c9c644727b0df695b451a1.png
  • Does so through the definition of the derivative for each dimension (see p. 13 in Lecture 7 for details)
  • Noisy and inefficent in high dimensions

Monte-Carlo Policy Gradient

  • Allows analytic gradients
  • Assume policy reinforcement_learning_db2ebf6a22a442cb4166924a51473fb77fe82929.png is differnetiable whenever it is non-zero
  • And know the gradient
  • Score function:


    where we simply ignore the reinforcement_learning_82d180ddc83cfa10a3c9c644727b0df695b451a1.png term from the Likelihood ratios due to the fact that optimizing wrt. reinforcement_learning_db6b1dd2e5cb7a8068505a3b5e8d59c58ac1710d.png also optimizes reinforcement_learning_b4625aa4535b6123d93c130d4dc4772a395916cd.png.

Likelihood ratios


Simply due to the reinforcement_learning_7ae62577598decb42b9a5a72d54c98369cd4b30b.png

Example score-function: Softmax Policy

  • Actions weighted by linear combination of features reinforcement_learning_295be6daff4d9f306c67c6bcf1a895c3d47c2ed2.png
  • Assume reinforcement_learning_e6b17958d2f42bafa08b740b46c41c43c77470c7.png
  • Score function is then


Example score-function: Gaussian Policy

  • reinforcement_learning_376a156afa46032a57dfbbb499c8eab66d76b7e3.png
  • Policy is Guassian, reinforcement_learning_6ea6dfeab062c7729bcbdea66abe68af77f176f2.png
  • Score function is then


One-Step MDPs

  • Starting in state reinforcement_learning_11fcb81546c9896f8202ffe1618ac5c26a97faa7.png
  • Terminating after one time-step with reward reinforcement_learning_9de99b1439c1e66ce9beb909c69f2d7403d8dbcb.png

See p. 19 in Lecture 7 for how to compute the policy gradient, OR try it yourself from the objective functions described earlier!


Policy Gradient Theorem

Simply changes the immediate reward reinforcement_learning_17bffe02d589bbbc96f6efbdbbdd12efb5d9607b.png in the one-step MDPs with the long-term value reinforcement_learning_b773b07d1f344a9cd4ac97798812df0fb7e54f91.png.

For any differentiable policy reinforcement_learning_bf1a9a8b11db54ab1f9f579b351acbd80caae863.png, for any of the policy objective functions reinforcement_learning_62c8f65fdcb50de7b62a7c8bd036b9ed2accabe1.png, the policy gradient is


The algorithm

  • Update parameters by SGD
  • Using policy gradient theorem
  • Using return reinforcement_learning_72acfca838ae0e086c21cdd438965d0bfef58825.png as an unbiased sample of reinforcement_learning_84ae5b874b2bda3a95d2c59f8431fc16a6b35d3b.png


Remember this is a MC => need to sample entire episode to learn

See p.21 from Lecture 7 for psuedo-code.

Actor-Critic Policy Gradient

  • Actor-Critic learns both value function and policy
  • MC policy gradient still has high variance
  • Use critic to estimate the action-value function reinforcement_learning_394811fb6918c94e3a4d97f9fcd8510ac7064572.png
  • Actor-critic algorithms maintain two sets of parameters
    updates action-value function parameters reinforcement_learning_c5cd2e111c21decdec253f54b39b5789e0778151.png
    updates policy parameters reinforcement_learning_7bec91615d02a5342b34270ae4a669947491ecaf.png in direction suggested by critic
  • That is; critic handles estimating reinforcement_learning_4311054e79bbf5d968c08b2cac52b7cdcce74e03.png, and the actor then asks the critic what the value of the state-action pair the critic just observed / took is, and updates the policy-parameters reinforcement_learning_7bec91615d02a5342b34270ae4a669947491ecaf.png according to this value


    So we have exactly the same update, but now replacing the long-term action-value function with an approximation.

Estimating the action-value function

Want to know how good the current policy reinforcement_learning_db2ebf6a22a442cb4166924a51473fb77fe82929.png is for current params reinforcement_learning_7bec91615d02a5342b34270ae4a669947491ecaf.png, hence we need to do policy evaluation

  • Monte-Carlo policy evaluation
  • Temporal-Difference evaluation
  • TD(λ)


See the psuedo-code on p. 25 in Lecture 7.

But sort of what it does is (at each step):

  1. (actor) Sample reinforcement_learning_e243eece2a51dbd6a60419dce2956dc12028bc59.png; sample transition reinforcement_learning_a7ef13697ea599ba95d2a1b31a2a38101842a352.png
  2. (actor) Sample action reinforcement_learning_6f29ca9d09a393f09aa2f84adcfccf1c1181bc2c.png
  3. (critic) Compute TD-error reinforcement_learning_289710aa65821156a171102b1f4e654fa76a8a3a.png
  4. (actor) Update policy-parameters reinforcement_learning_7bec91615d02a5342b34270ae4a669947491ecaf.png
  5. (critic) Update weights for reinforcement_learning_e6408d491fd517e54e1c24726caaf4c466ed70ad.png
  6. (actor) reinforcement_learning_c8cb52e455323a960d02ae5bc165355907beafe6.png

Compatible Function Approximation

If the following two conditions are satifisfied:

  1. Value function approximator is compatible to the policy


  2. Value function parameters reinforcement_learning_c5cd2e111c21decdec253f54b39b5789e0778151.png minimise the MSE


Then the policy gradient is exact,


Reducing variance using a Baseline

  • Subtract a baseline function reinforcement_learning_9b565a8de83df7a5c79da99f07430bd9d97dd50b.png from the policy gradient
  • Can reduce variance without changing expectation
  • Basically "centering" (subtracting mean and dividing by variance)

See p. 30 in Lecture 7 for a proof of why it doesn't change the expectation. Basically, we can add or subtract anything we want from reinforcement_learning_4311054e79bbf5d968c08b2cac52b7cdcce74e03.png as long as it only depends on the state reinforcement_learning_aa1698cb8ee1665238ec3e91824191643c62ee93.png.

Left with something called the advantage function


Estimating Advantage function

There multiple ways of doing this.

See p. 30 in Lecture 7 and on for some ways. The most used one is nr. 2.


Policy gradient has many equivalent forms:


All these methods are basically saying: move the policy in the direction that gets us more value.

  • Each leads a SGD algorithm
  • Critic uses policy evaluation (e.g. MC or TD learning) to estimate reinforcement_learning_409cec3a68dab8e7c74f3b760b0847cef1b69ef7.png, reinforcement_learning_9957d18e326b1fa2f0df1ab7e703ef096572fc68.png or reinforcement_learning_e6d28c027d6534a5db0a57d37d4a2f9c61c1822d.png

Thompson sampling


  • reinforcement_learning_af965b7c370799d1ad98bfb7656d4f5de52ff557.png actions
  • reinforcement_learning_20bf8c8f9aa17f4ae1669c8f584b1d5121692223.png are the probabilities of success for the corresponding action / arm
  • reinforcement_learning_0305070071dcf619cd3cf4a112256568ac67bdc9.png denotes a graph with vertices reinforcement_learning_d8c37e5ba495937f7b39eb579d9b3e9149a5f8d2.png and edges reinforcement_learning_8c315cc84bce367469e49139a879ed40c453d0e6.png
  • Historical data


  • reinforcement_learning_e852e95650b13ec741f35a87f9acddee57eda6d9.png denotes reward obtained by taking action reinforcement_learning_578e511877b566bc3a636bfdf9bd0c876fd18085.png leading to observed outcome reinforcement_learning_f77ccdb170b4f71364cc9ec3ff29643df7cea01e.png
  • Per-period regret is the difference between exptected reward using optimal action and the reward received by current action. In the Armed bandit problem:


  • reinforcement_learning_695fc0b25bd7fdfbcf525dde41cfc8c12f4bb5e1.png is the conditional distribution of the system (i.e. what we want to learn)

Armed bandit

  • Suppose there are reinforcement_learning_1641d18cc980f8db14cdff95d7417a8526eef446.png actions, and when played, any action yields either a success or failure.
  • Action reinforcement_learning_af965b7c370799d1ad98bfb7656d4f5de52ff557.png produces a success with probability reinforcement_learning_14d6fc937107eb0bf4de6740e39129e6f9115f34.png.
  • The success probabilities reinforcement_learning_20bf8c8f9aa17f4ae1669c8f584b1d5121692223.png are unknown to the agent, but are fixed over time, and can therefore be learned by experimentation
  • Objective: maximize number of successes over reinforcement_learning_d1f6bb1b651fd7c16aa8c86a437c709203aeb4a3.png periods, where reinforcement_learning_d1f6bb1b651fd7c16aa8c86a437c709203aeb4a3.png is relatively large compared to the number of arms reinforcement_learning_1641d18cc980f8db14cdff95d7417a8526eef446.png


  • Posterior sampling and probability matching
  • Methods such as reinforcement_learning_fb721c565cefdc5f2cf5e7f3c1c4b2f103f8dcd0.png attempts to balance exploration by choosing a random action with probability reinforcement_learning_224ae917d74dc2133d4403064c971bf562d4db50.png and greedy step with probability reinforcement_learning_5bb9f53fbd2b2d8f3ef0de568e7d5b0e413cd26d.png
    • Inefficient since it will sample actions regardless of how unlikely they are to be optimal; Thompson sampling attempts to address this
  • Useful for online decision problems


The difference between a greedy algorithm and Thompson sampling is that a greedy algorithm would use the expectation over the current estimate of reinforcement_learning_7225b076f6e6326f1636b11d1aad8de58bcc4761.png to obtain the new estimate of reinforcement_learning_7bec91615d02a5342b34270ae4a669947491ecaf.png, while Thompson sampling samples the new estimate of reinforcement_learning_7bec91615d02a5342b34270ae4a669947491ecaf.png.


Chapter 2

Stationary Problem

  • N-armed bandit problem
  • Action-rewards does not change over time but stay constant
  • All we got to do then is to model the average reward for some action, i.e.


where reinforcement_learning_3811540d06d2ee3c1c0a193cd54776b1509172aa.png is the number times we've chosen action reinforcement_learning_3778a663528a3e4db5ce811ed0c6cd9f5831579f.png up until step reinforcement_learning_a270d2f0aac3268943aea13ffd8c0dd862578f2e.png

And of course, practically it's better to compute this using the recursive formula for the mean:


where reinforcement_learning_36ac9621ccbe6c1380dc45f1bb3cc0100683a5f8.png is the value for some action reinforcement_learning_3778a663528a3e4db5ce811ed0c6cd9f5831579f.png after choosing and receiving reward from this action reinforcement_learning_ad63c81b09a45c2afd6c497ff38a542e26c1423a.png times.

Non-stationary Problem

  • Makes sense to more heavily weight the most recent observations higher than "old" ones
  • We write reinforcement_learning_216baa1dc1da1eb3f1765b89622830a7cef9cccc.png in instead of reinforcement_learning_7fdcc6f20ac03aacd98a98ec088da7446dfadc98.png to be a bit more general
  • By recursively expanding we obtain


Which is called the weighted average, due to the fact that reinforcement_learning_943ffa45daa3bd2bebfcd559a0fd1c73540f0510.png (can be proven by induction).


2.3 - N-Armed bandit ɛ-greedy comparison

Consider the following learning problem. You are faced repeatedly with a choice among n different options, or actions. After each choice you receive a numerical reward chosen from a stationary probability distribution that depends on the action you selected. Your objective is to maximize the expected total reward over some time period, for example, over 1000 action selections,or time steps.

The problem is as follows:

  • Choose between n actions at each step t
  • Action-rewards are constant wrt. t
  • Maximize expected / avg. reward over 1000 steps t

To get an idea of how well the algorithms perform in general, we run the above for 2000 different action-rewards, i.e. 2000 differently parametrized environments.

# don't want this tangled
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

# n-Armed Bandit
# each environment has 10 actions (bandits), and we have 2000 such problems
# allows us to see how algorithms perform over different hidden action-values
n_steps = 1000
n_problems = 2000
armed_testbed = np.random.randn(10, n_problems)
actions = np.arange(10)
steps = np.arange(n_steps)

# we compute the value of action `a` at step `t`
# by the accumulated reward from the times we've
# chosen `a` up until (incl.) `t` and then divided
# by the number of times we've chosen this `a`.
# Basically, the average reward of `a` at `t`
# wrt. times we've chosen `a`.

def epsilon_greedy(Q, epsilon):
    if np.random.rand() < epsilon:
        return np.random.randint(actions.shape[0])
    return np.argmax(Q)

def run_testbed(choose_action, *args, **kwargs):
    rewards = np.zeros((n_steps, armed_testbed.shape[1]))

    for j in range(armed_testbed.shape[1]):
        environment = armed_testbed[:, j]

        # intialization
        Q = np.zeros(10)
        action_freq = np.zeros_like(actions,
        action_reward = np.zeros_like(actions)
        reward = 0

        for t in steps:
            a = choose_action(Q, *args, **kwargs)
            # reward with some noise
            r = environment[a] + np.random.randn()
            # update count and reward for selection action `a`
            action_freq[a] += 1
            action_reward[a] += r
            # update value for action `a`
            Q = action_reward / (action_freq + 1e-5)

            # accumulate reward
            reward += r
            # expected reward wrt. t
            rewards[t, j] = reward / (t or 1)

    # take the avg for each time-step over all problems
    return np.mean(rewards, axis=1)

And let's run the testbed for different ε in parallel.

import multiprocessing as mp
from functools import partial

f = partial(run_testbed, epsilon_greedy)

with mp.Pool(3) as p:
    results =, [0, 0.01, 0.1])

Finally we create the plot!

steps = np.arange(1000)

fig = plt.figure(figsize=(14, 5))
plt.plot(steps, results[0], label="greedy")
plt.plot(steps, results[1], label="ε-greedy 0.01")
plt.plot(steps, results[2], label="ε-greedy 0.1")
plt.title("2000 different 10-bandit problems")
plt.ylabel("Expected $R_t$")


synchronous updates
update all states / values / whatever one after the other
given a full MDP and some policy reinforcement_learning_9e283f0a770977b9c5776238e85e4b1f2406cdd4.png, find value function reinforcement_learning_cc16ef065f79b334a22ca4b4f1fffe686a371aab.png for this policy
given a full MDP, find optimal (wrt. reward) value function reinforcement_learning_bcd7085e11dafe2b3719b51098645ad986aeecaf.png and policy reinforcement_learning_f9e4ccd79ed953db8a53115ecb9d892df620d5d6.png
simply the state reinforcement_learning_aa1698cb8ee1665238ec3e91824191643c62ee93.png we "base" our update on for some state reinforcement_learning_af32ed5b9ee28158c363bc90a03b398ddb2da9a3.png
updating the value of a state using values of future states
full backup
we perform backup for all states
consists of applying a backup operation to each state
update involves an estimate (not the real returns / rewards)
update samples an expectation
stochastic policy
policy which uses a distribution over actions rather than deterministic behavior.


  • Karpathy's "Pong from pixels". Details how to implement a policy-gradient RL-algorithm using a ConvNet for function-approximation to learn to play Pong from the raw image input.