# Notes on: Papamakarios, G., Pavlakou, T., & Murray, I. (2017): Masked autoregressive flow for density estimation

## Table of Contents

## Main idea

## Notation

**AR**is*autoregressive models***NF**is*normalizing flows***IAF**is*inverse autogressive flow***MAF**is*Masked Autoregressive Flow***MADE**is*Masked Autoencoder for Distribution Estimation*- is the
**base density**which the Jacobian transform will be applied to

## Overview

- Neural networks for density estimation
- Useful for non-generative cases; compared to variational autoencoders and GANs, these can readily provide exact density evaluations
- Can provide priors in Bayesian settings

## Problem

- Hard to construct models that are flexible enough to represent complex densities, but have tractable density functions and learning algortihms

## Autoregressive models

- Decompose the joint density as a product of conditionals, and model each conditional in turn

### Background

#### Overview

- Any joint density can be decomposed into a product of one-dimensional conditionals as
- Models each conditional as a parametric density, whose parameters are a function of a hidden state

#### Examples

- Real-valued Neural Autoregressive Density Estimator (RNADE) uses mixtures of Gaussian or Laplace densities for modelling the conditionals, and a simple linear rule for updating the hidden state
- More flexible approaches are models such as LSTMs

#### Drawbacks

- Sensitive to order of the variables

### Correspondance between autoregressive models and normalizing flows

When used to generate data, correspond to a differentiable transformation of an external source of randomness (typically obtain by random number generation) => has tractable Jacobian by design, and for certain autoregressive models it is also *invertible* => directly corresponds to **normalizing flows**.

Viewing AR models as NF opens the possibility of increasing its flexibility by stacking multiple models of the same type, having each model provide the source of randomness for the next model in the stack. The resulting stack of models is a NF that is more flexible than the original model but still tractable.

### Masked Autoencoder for Distribution Estimation (MADE)

- Enables density evaluations without the sequential loop that is typical for AR => fast to evaluate and train on parallel computing architectures

## Normalizing flows

- Transform a base density (e.g. standard Gaussian) into the target density by an invertible transformation with tractable Jacobian

### Background

Represents as an invertible differentiable transformation of a base density , that is

where the *base-density* is chosen s.t. it can be easily evaluated for any input (e.g. standard Gaussian).

Under the invertibility assumption of , the density can be calculated as

For the above equation to be tractable, the transformation must be constructed such that:

- it's easy to invert
- determinant of Jacobian is easy to compute

These properties are actually conserved under composition, thus if and have the above properties, then so does .

Hence, we can make "deeper" by composition, and still be a valid NF.

#### DONE Why scale by the determinant?

I'm guessing it's because when you integrate over the probability density and instead integrate over instead of , then you're making a substitution of variables and hence need to multiply by the Jacobian of wrt. , which defines the relationship from to .

### "Training" the normalizing flows

- Given a data point , compute , i.e. the "random number which generated " (this data wasn't
*actually*generated, since it's a sample from the outside world) - Repeat step 1 for multiple samples
- If the
*distribution*of the random numbers generated*from the observations*is close to the base-density (i.e. KL-divergence is small), then we have a**good fit**!