Notes on: Papamakarios, G., Pavlakou, T., & Murray, I. (2017): Masked autoregressive flow for density estimation

Table of Contents

Main idea

Notation

  • AR is autoregressive models
  • NF is normalizing flows
  • IAF is inverse autogressive flow
  • MAF is Masked Autoregressive Flow
  • MADE is Masked Autoencoder for Distribution Estimation
  • papamakarios17_masked_autor_flow_densit_estim_7871cfaaa2622fdd921f9e63141b98f49b6a5525.png is the base density which the Jacobian transform will be applied to

Overview

  • Neural networks for density estimation
  • Useful for non-generative cases; compared to variational autoencoders and GANs, these can readily provide exact density evaluations
  • Can provide priors in Bayesian settings

Problem

  • Hard to construct models that are flexible enough to represent complex densities, but have tractable density functions and learning algortihms

Autoregressive models

  • Decompose the joint density as a product of conditionals, and model each conditional in turn

Background

Overview

  • Any joint density papamakarios17_masked_autor_flow_densit_estim_1ed25cf746e36b58c8d980488e7f49fe0ccb1b39.png can be decomposed into a product of one-dimensional conditionals as papamakarios17_masked_autor_flow_densit_estim_b6552042a7e41dfa60c7e70864f3da6696604405.png
  • Models each conditional papamakarios17_masked_autor_flow_densit_estim_b868b3a4cffe0fc2f805ac4bf0e1d928466e9d5f.png as a parametric density, whose parameters are a function of a hidden state papamakarios17_masked_autor_flow_densit_estim_2c098dfbf5f8dd044626866f6087772135cd637b.png

Examples

  • Real-valued Neural Autoregressive Density Estimator (RNADE) uses mixtures of Gaussian or Laplace densities for modelling the conditionals, and a simple linear rule for updating the hidden state
  • More flexible approaches are models such as LSTMs

Drawbacks

  • Sensitive to order of the variables

Correspondance between autoregressive models and normalizing flows

When used to generate data, correspond to a differentiable transformation of an external source of randomness (typically obtain by random number generation) => has tractable Jacobian by design, and for certain autoregressive models it is also invertible => directly corresponds to normalizing flows.

Viewing AR models as NF opens the possibility of increasing its flexibility by stacking multiple models of the same type, having each model provide the source of randomness for the next model in the stack. The resulting stack of models is a NF that is more flexible than the original model but still tractable.

Masked Autoencoder for Distribution Estimation (MADE)

  • Enables density evaluations without the sequential loop that is typical for AR => fast to evaluate and train on parallel computing architectures

Normalizing flows

  • Transform a base density (e.g. standard Gaussian) into the target density by an invertible transformation with tractable Jacobian

Background

Represents papamakarios17_masked_autor_flow_densit_estim_1ed25cf746e36b58c8d980488e7f49fe0ccb1b39.png as an invertible differentiable transformation papamakarios17_masked_autor_flow_densit_estim_cdd1cc131da6040eca078917132a377727053c44.png of a base density papamakarios17_masked_autor_flow_densit_estim_7871cfaaa2622fdd921f9e63141b98f49b6a5525.png, that is

papamakarios17_masked_autor_flow_densit_estim_87918a25afcf29b58daaa57751ae90e342ced0de.png

where the base-density papamakarios17_masked_autor_flow_densit_estim_7871cfaaa2622fdd921f9e63141b98f49b6a5525.png is chosen s.t. it can be easily evaluated for any input papamakarios17_masked_autor_flow_densit_estim_9aff87d25cc15fd1fe2df999fb21866262499dd7.png (e.g. standard Gaussian).

Under the invertibility assumption of papamakarios17_masked_autor_flow_densit_estim_cdd1cc131da6040eca078917132a377727053c44.png, the density papamakarios17_masked_autor_flow_densit_estim_1ed25cf746e36b58c8d980488e7f49fe0ccb1b39.png can be calculated as

papamakarios17_masked_autor_flow_densit_estim_0a574417b5790e8ca319bf862a3b1ebb7959194f.png

For the above equation to be tractable, the transformation papamakarios17_masked_autor_flow_densit_estim_cdd1cc131da6040eca078917132a377727053c44.png must be constructed such that:

  • it's easy to invert
  • determinant of Jacobian is easy to compute

These properties are actually conserved under composition, thus if papamakarios17_masked_autor_flow_densit_estim_866ead89ec63f64bf1f7216d2aac4c2c37c2c921.png and papamakarios17_masked_autor_flow_densit_estim_8523abf12e4b850e630a76804e0f6241b0b93067.png have the above properties, then so does papamakarios17_masked_autor_flow_densit_estim_a8a63b922b5aa31d27a15029e8da27c56194add5.png.

Hence, we can make papamakarios17_masked_autor_flow_densit_estim_cdd1cc131da6040eca078917132a377727053c44.png "deeper" by composition, and still be a valid NF.

DONE Why scale by the determinant?

I'm guessing it's because when you integrate over the probability density papamakarios17_masked_autor_flow_densit_estim_1ed25cf746e36b58c8d980488e7f49fe0ccb1b39.png and instead integrate over papamakarios17_masked_autor_flow_densit_estim_9aff87d25cc15fd1fe2df999fb21866262499dd7.png instead of papamakarios17_masked_autor_flow_densit_estim_ed39d9a397196f8f0ce6388b0ea4e0c1dd8becee.png, then you're making a substitution of variables and hence need to multiply by the Jacobian of papamakarios17_masked_autor_flow_densit_estim_0e7f81bfc4740b441601e2df2b0bbe1d88517cd5.png wrt. papamakarios17_masked_autor_flow_densit_estim_ed39d9a397196f8f0ce6388b0ea4e0c1dd8becee.png, which defines the relationship from papamakarios17_masked_autor_flow_densit_estim_ed39d9a397196f8f0ce6388b0ea4e0c1dd8becee.png to papamakarios17_masked_autor_flow_densit_estim_9aff87d25cc15fd1fe2df999fb21866262499dd7.png.

"Training" the normalizing flows

  1. Given a data point papamakarios17_masked_autor_flow_densit_estim_ed39d9a397196f8f0ce6388b0ea4e0c1dd8becee.png, compute papamakarios17_masked_autor_flow_densit_estim_c5472240b58ddf3d73633ad205a0e36bcc31870a.png, i.e. the "random number papamakarios17_masked_autor_flow_densit_estim_9aff87d25cc15fd1fe2df999fb21866262499dd7.png which generated papamakarios17_masked_autor_flow_densit_estim_ed39d9a397196f8f0ce6388b0ea4e0c1dd8becee.png" (this data wasn't actually generated, since it's a sample from the outside world)
  2. Repeat step 1 for multiple samples papamakarios17_masked_autor_flow_densit_estim_ed39d9a397196f8f0ce6388b0ea4e0c1dd8becee.png
  3. If the distribution of the random numbers papamakarios17_masked_autor_flow_densit_estim_9aff87d25cc15fd1fe2df999fb21866262499dd7.png generated from the observations is close to the base-density papamakarios17_masked_autor_flow_densit_estim_7871cfaaa2622fdd921f9e63141b98f49b6a5525.png (i.e. KL-divergence is small), then we have a good fit!