# Notes on: Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., … (2015): Continuous Control With Deep Reinforcement Learning

## Table of Contents

## Notation

- environment
- observation at timestep
- at timestep
- at timestep
- represents the
*history*, i.e. state-action pairs - represents the agent's behaviour
**(stochastic) policy** - denotes a
**deterministic policy**, often denoted if it's a policy-function with learnable parameters - is the
*expected return from the start distribution* - denotes
*discounted state visitation distribution*for policy Action-value function which describes expected return after takin an action in state , and thereafter following policy :

**Bellmann equation**:- and are
*copies*of the actor and critic networks, respectively

## Background

### Q-learning

**Q-learning** uses the greedy policy

where we consider function approximators parametrized by , which we optimize by minimizing the loss

where

while is also dependent on , which is typically ignored.

### Deterministic policy

If the target policy is **deterministic** we can describe it as a function and avoid the *inner* expectation:

Here the expectation depends only on the enviroment. This means that it is possible to learn *off-policy*, using transitions which are generated from a different stochastic policy .

- Parametrized action function which specifices the current policy by determinstically mapping states to a specific
*action*

### Replay buffer

The **replay buffer** is a finitely sized cache .

Transitions were sampled from the enviroment according to the exploration policy and the tuple

was stored in the tuple buffer.

- When the replay buffer wass full the oldest samples were discarded.
- Each timestep the actor and critic are updated by sampling a minibatch uniformly from the buffer
- DDPG is an off-policy algorithm, the replay buffer can be large, allowing benefitting across a set of uncorrelated transitions