Notes on: Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., … (2015): Continuous Control With Deep Reinforcement Learning

Table of Contents

Notation

  • lillicrap15_contin_contr_with_deep_reinf_learn_fbe98aaf8359e7eed3ba031caeaf6a3c13ae8690.png environment
  • lillicrap15_contin_contr_with_deep_reinf_learn_ce8deeba8921a13666b0564f15c0771f99fece1c.png observation at timestep lillicrap15_contin_contr_with_deep_reinf_learn_eb5d809ed7c492fae7d4927a6fc9a5e22f9b3831.png
  • lillicrap15_contin_contr_with_deep_reinf_learn_1d515ff72be24c7edd840f83f46759fc99487b77.png at timestep lillicrap15_contin_contr_with_deep_reinf_learn_eb5d809ed7c492fae7d4927a6fc9a5e22f9b3831.png
  • lillicrap15_contin_contr_with_deep_reinf_learn_7a0ebf67f20130e4ede7cdac781c27530e1847f4.png at timestep lillicrap15_contin_contr_with_deep_reinf_learn_7a0ebf67f20130e4ede7cdac781c27530e1847f4.png
  • lillicrap15_contin_contr_with_deep_reinf_learn_900783ca89e1337e778fbf64d461940c72295151.png represents the history, i.e. state-action pairs
  • lillicrap15_contin_contr_with_deep_reinf_learn_34ffd1d7050adcdaf002efa1e9b7e668e2005bad.png represents the agent's behaviour (stochastic) policy
  • lillicrap15_contin_contr_with_deep_reinf_learn_e9408af8e755948d3b8d24da1acad7fba4fc32b2.png denotes a deterministic policy, often denoted lillicrap15_contin_contr_with_deep_reinf_learn_3ca9caede113c510091b6b47815c7aa274a46c68.png if it's a policy-function with learnable parameters lillicrap15_contin_contr_with_deep_reinf_learn_9d91f48b7868aaa5d6e5633a7536967a245e364c.png
  • lillicrap15_contin_contr_with_deep_reinf_learn_f7d0183df2eab6c727eabff578e7db3290fd5f76.png is the expected return from the start distribution
  • lillicrap15_contin_contr_with_deep_reinf_learn_cdaac1c0716ce489868125dea85616482852b47b.png denotes discounted state visitation distribution for policy lillicrap15_contin_contr_with_deep_reinf_learn_2bb5e22609e9d4b4a084d37146266b8fcd08416c.png
  • Action-value function which describes expected return after takin an action lillicrap15_contin_contr_with_deep_reinf_learn_1d515ff72be24c7edd840f83f46759fc99487b77.png in state lillicrap15_contin_contr_with_deep_reinf_learn_8e2fa451417ae5fcfe9ac81eecfb7de99c88e9cd.png, and thereafter following policy lillicrap15_contin_contr_with_deep_reinf_learn_2bb5e22609e9d4b4a084d37146266b8fcd08416c.png:

    lillicrap15_contin_contr_with_deep_reinf_learn_6de87fb898d679a8fda9c6d0313792f4568f3c68.png

  • Bellmann equation:

    lillicrap15_contin_contr_with_deep_reinf_learn_96af401652f3e14da0f1ca1495770caeab536bc9.png

  • lillicrap15_contin_contr_with_deep_reinf_learn_23cfee8ce9bfca435c2b1dc6cee9b1b613bf019f.png and lillicrap15_contin_contr_with_deep_reinf_learn_58c3318de1aff96caf306255f2c51f94ccd54f91.png are copies of the actor and critic networks, respectively

Background

Q-learning

Q-learning uses the greedy policy

lillicrap15_contin_contr_with_deep_reinf_learn_a5d1d8a0238763df0489895d466070f854346c07.png

where we consider function approximators parametrized by lillicrap15_contin_contr_with_deep_reinf_learn_5f76f6497f65b301943a3be43095445b4662f53c.png, which we optimize by minimizing the loss

lillicrap15_contin_contr_with_deep_reinf_learn_78f8de93c5afd846f90e5029b71bc8cb4ba8831d.png

where

lillicrap15_contin_contr_with_deep_reinf_learn_5515eb4e2f73198121fd30db673fce80ec058031.png

while lillicrap15_contin_contr_with_deep_reinf_learn_058ed08188eec3022f512ee03ca08bf616437fed.png is also dependent on lillicrap15_contin_contr_with_deep_reinf_learn_5f76f6497f65b301943a3be43095445b4662f53c.png, which is typically ignored.

Deterministic policy

If the target policy is deterministic we can describe it as a function lillicrap15_contin_contr_with_deep_reinf_learn_0d22e7d144ed764d53f243908c2655e72386d2c8.png and avoid the inner expectation:

lillicrap15_contin_contr_with_deep_reinf_learn_7428ae042eb99fbd4dbb643a5e4786fa663b9a26.png

Here the expectation depends only on the enviroment. This means that it is possible to learn lillicrap15_contin_contr_with_deep_reinf_learn_6f2182c3c42542fec185ecfe3513a8811be036df.png off-policy, using transitions which are generated from a different stochastic policy lillicrap15_contin_contr_with_deep_reinf_learn_c7352234b7a50a150d827cc268488697d90826dc.png.

  • Parametrized action function lillicrap15_contin_contr_with_deep_reinf_learn_3ca9caede113c510091b6b47815c7aa274a46c68.png which specifices the current policy by determinstically mapping states to a specific action

Replay buffer

The replay buffer is a finitely sized cache lillicrap15_contin_contr_with_deep_reinf_learn_bdf2885146249a0265c09f94c75d5f3a20c03e78.png.

  • Transitions were sampled from the enviroment according to the exploration policy and the tuple

    lillicrap15_contin_contr_with_deep_reinf_learn_cc6fbb280c940de8a7e19d26059d879093c2a62b.png

    was stored in the tuple buffer.

  • When the replay buffer wass full the oldest samples were discarded.
  • Each timestep lillicrap15_contin_contr_with_deep_reinf_learn_eb5d809ed7c492fae7d4927a6fc9a5e22f9b3831.png the actor and critic are updated by sampling a minibatch uniformly from the buffer
  • DDPG is an off-policy algorithm, the replay buffer can be large, allowing benefitting across a set of uncorrelated transitions

Target network

Algorithm