# Normalizing flows

## Table of Contents

## Literature

- wu20_stoch_normal_flows: Stochastic Normalizing Flows
- Considering a sequence of flows
- Instead of composing these determinstically, they interleave
*sampling steps*from some transition kernel

- huang20_augmen_normal_flows

## Results

- There exists normalizing flows which are "universal" in the sense that they can express any
*distribution*(i.e. "weak" convergence). huang18_neural_autor_flows,papamakarios19_normal_flows_probab_model_infer

## "Why the name?!"

I've received this question or a similar one on more than one occation. **Normalizing flows** is really not a good name for what it is supposed to represent. It is also unclear whether or not it refers to the base distribution *together* with the transformation, or *just* the transformation itself, ignoring the base distribution which is transformed.

I've received questions such as

- "Is this related to gradient flows in differential equations and manifolds?" which is a completely valid question because that is indeed what it sounds like it is! I'd say something like "Well, kind of, but also not really. A gradient flow could technically be used in a normalizing flow but it's way too strict of an requirement; you really just need a differentiable bijection with a differentiable inverse. A gradient flow is indeed this (at least locally), but yeah, you can also just have, say, addition by a constant factor."

Today, I'd say a normalizing flow is a *piecewise [[file:..*mathematics*geometry.org::def:diffeomorphism][diffeomorphic]] [[file:..*mathematics*measure _{theory.org}::def:push-forward-measure][push-forward]]*, or in simpler terms, it's a differentiable function with a differentiable inverse together with a base distribution on with a density .

So, why isn't it just called a this? It seems like the term **normalizing flows** was popularized in 2014 by rezende15_variat_infer_with_normal_flows. This paper refers to the "method of normalizing flows" from tabak2010density in which we can probably find the first use of the term. In this work they do the following

Define

where is a

*known*density and denotes the Jacobian of the map .Define the mapping as an "infinite composition of infinitesimal transformations", i.e. a (gradient) flow s.t.

Define

then

Given a set of samples , we can measure the quality of the estimated density by the

*log-likelihood*, treating it as a functional on :This suggests constructing the flow by following the

*direction of ascent of*, i.e. s.t.and such that is the (local) minimizer.

where

Note that

In this work they describe a method of defining a (gradient) flow

Basically, in this work they define a gradient flow using the log-likelihood which transforms from a known density to the unknown density . They also consider the *dual* of the flow, which transforms from to , which they refer to as "transforming to normality", i.e. a normalizing flow in the sense that it's a flow which transforms a density to normality / a normal distribution. For practical purposes they consider "infinitesimal" small additive changes, which is basically what we today refer to as *residual normalizing flows* behrmann18_inver_resid_networ. They also point out that the work of "Gaussianization" which was done in 2002 follow a similar idea, though not using flows.