# Variational Inference (Part 1): Variational Inference & CAVI

## Table of Contents

## Overview

In this post I will present to you what's known as **variational inference (VI)**, a family of *approximate* Bayesian inference methods. In particular, we will focus on one of the more basic VI methods called **coordinate ascent variational inference (CAVI)**. We will go through the theory for VI, and apply CAVI to the case of a Gaussian mixture, following and adding to what's presented in blei16_variat_infer. With the theory clear we will go through an implemention in `Turing.jl`

as a simple (and slightly contrived) example of how to implement "custom" variational inference methods. Hopefully you'll learn *something*. The goal is that you walk way with (in descending order of importance):

- A basic understanding of VI and why we bother
- A general idea of how to use the (work in progress) VI interface in
`Turing.jl`

- A vague familiarity with CAVI

Just a heads up: in contrast to the previous post this is one is less focused on memes, involving a fair bit of mathematics and code. At ripe age of 24, there's certainly one thing I've learned: memes can only get you so far in life.

## Notation

**VI**abbreviates**variational inference****CAVI**abbvr.**coordinate ascent VI**- denotes the number of data
- denotes a set of data
Evidence lower bound (ELBO):

and

## Motivation

In Bayesian inference one usually specifies a model as follows: given data ,

where denotes that the samples are identically independently distributed. Our goal in Bayesian inference is then to find the *posterior*

In general one cannot obtain a closed form expression for , but one might still be able to *sample* from with guarantees of converging to the target posterior as the number of samples go to , e.g. MCMC.

"So what's the problem? We have these nice methods which we know converge to the true posterior in the infinite limit. What more do we really need?!"

Weeell, things aren't always that easy.

- As the model becomes increasingly complex, e.g. more variables or multi-modality (i.e. separate "peaks"), convergence of these unbiased samplers can slow down dramatically. Still, in the
*infinite*limit, these methods should converge to the true posterior, but infinity is fairly large, like,*at least*more than 12! - Large amounts of data means more evaluations of the log-likelihood which, if it's sufficiently expensive to compute , can make direct sampling infeasible.

In fact, there is also an application of VI in the context of Bayesian *model selection*, but we will not dig into that here.pmlr-v80-chen18k

Therefore it might at times be a good idea to use an *approximate* posterior, which we'll denote . Clearly we'd like it to be close to target , but what does even "close" mean in the context of probability densities?

This question really deserves much more than a paragraph and is indeed quite an interesting topic. There are plenty of approaches to take in this case, some come down to putting actual metrics (defining a "distance") on spaces of probability densities, e.g. Wasserstein / Earth-mover distance, while other approaches are less strict, e.g. Kullback-Leibler (KL) divergence which, compared to a metric, is *not* symmetric in its arguments.

## Objective

For now we will consider the KL-divergence, but as we will probably see in later posts, there are several viable alternatives which can also be employed in VI. KL-divergence has the nice property that it's equivalent to maximizing the likelihood.

With KL-divergence as this measure of how "good" of an approximation our approximate posterior is to , the *objective* is to find the optimal where is space of densities we have chosen to consider:

where the KL-divergence is defined

So the idea is to construct some approximate posterior which it's feasible to sample from, allowing us to do *approximate* Bayesian inference. But there's a number of practical issues with the above objective:

- We can't evaluate ! Evaluating is the entire goal of Bayesian inference; if we could simply evaluate this, we would be done.
- Integrating over all is of course in general intractable.

In the case of KL-divergence, the standard approach to address (1) is to instead optimize what's know as the *evidence lower bound (ELBO)*.

## The famous ELBO^{1}

Using the fact that , the RHS of the KL-divergence becomes

where we've simply brought the term outside of the expectation since is independent of . We can therefore re-arrange the original expression to

This then implies that if we maximize the second term on the RHS we're implicitly minimizing the KL-divergence between and ! This term is called the **evidence lower bound (ELBO)**, stemming from the fact that is often referred to as the "evidence" in the Bayesian literature and it turns out that the KL-divergence is indeed non-negative, i.e. . We therefore define

This still requires us to feasibly compute !

Also, the fact that this lower bounds the evidence is how one can use VI to do model selection, but again, not getting into that here.

Since the expectation is linear and we're assuming independence, if we want to take into account all the samples in , we simply sum these contributions to get

in the case where we use the same for the different samples.^{2}

Awesome. We now have an *objective* to maximize, whose maximizer then gives us the "optimal" approximate posterior from which we can sample from.^{3}

*But* computing involves computing an expectation, which of course, again, requires evaluation of a (usually) intractable integral, i.e. 2. in the problems for the KL-divergence noted in previous section. Luckily, in this case we can simply substitute the expectations which are intractable with the *empirical expectation*, i.e. if we want to estimate the expectation of some function wrt. a density , then

which, by Strong Law of Large Numbers (SLLN), this converges to the true expectation in the infinite limit. We can apply this to the ELBO:

which is much more of an approachable problem. Granted, this makes our objective non-determinstic / stochastic / random / wibbly-wobbly, which in turn means that our posterior estimate will have much higher variance. This usually means slower convergence, and possibly even complete failure to converge.

In fact, the approach to estimating the is often what distinguishes VI approaches. We will discuss this further in the future; it's even likely that we'll dedicate one entire post only to the discussion / comparison of different approaches to this estimation.

In the particular example we will consider now, we actually have access to a closed form expression for , and therefore only need to make use of the empirical estimation to the first term of the ELBO: . As mentioned above, this will lead to reduced variance in our updates and faster convergence.

## How do I use this in `Turing.jl`

?

Before diving into a particular example, let's just quickly go through the VI interface in `Turing.jl`

.

Unfortunately VI is not officially supported *yet*, but you can get the WIP version if you checkout my VI branch tor/vi-v3.^{4} All code that follows is assuming you indeed have done that and that you've activated the local environment by doing `]activate(".")`

in the repository directory.

using Turing using Turing: Variational

Just one more time: **this is work in progress**! What I present to you here is basically the first draft and thus unlikely to survive for long. But I put it here for you to see; maybe you even have ideas on how to improve it! If so, hit me up on the Julia slack channel @torfjelde :)

VI in `Turing.jl`

is mainly supposed to be used as an approximate substitute for `sample(model, sampler)`

interface:

?Variational.vi

vi(model::Model, alg::VariationalInference) vi(model::Model, alg::VariationalInference, q::VariationalPosterior) Constructs the variational posterior from the model and performs the optimization following the configuration of the given VariationalInference instance.

Therefore, in general we'll have the following psuedo-code:

# 1. Instantiate `Turing.Model` with observations m = model(data) # 2. Construct optimizer and VI algo. opt = Variational.ADAGrad() # proxy for `Flux.Optimise` alg = Alg(args...) # <= `Alg <: VariationalInference` # 3. Perform variational inference q = vi(m, alg; optimizer = opt) # => <: `VariationalPosterior` q = vi(m, alg, q_before_opt; optimizer = opt) # => optimized `q <: VariationalPosterior` rand(q, 1000)

If `q`

is *not* provided, `Turing`

will extract the latent variables in your `model`

and construct a `VariationalPosterior`

(i.e. a `Distribution{Multivariate, Continuous}`

) and optimize the corresponding `objective`

, *if* the `VariationalInference`

method indeed has a default family.

The most significant difference from the existing `sample(...)`

interface is that `vi(...)`

returns a `VariationalPosterior`

which we can call `rand`

on to obtain approximate posterior samples, while `sample`

returns a `Chains`

object from `MCMCChains`

containing the posterior samples. Currently `VariationalPosterior`

is simply an alias for `Distribution{Multivariate, Continuous}`

from `Distributions.jl`

. The idea is that the variational posteriors should act as standard distributions, implementing both `rand`

and `logpdf`

. This way, any multivariate distribution from `Distributions.jl`

will work (if supported by the method).

An `VariationalInference`

instance is supposed to hold the "configuration" for a VI method, e.g. number of samples used in the empirical expectation for the estimate of the ELBO. Additionally, even though two `VariationalInference`

types have exactly the same struct fields, they might still have very different behaviour.

The following methods might have different behaviour based on the `VariationalInference`

type:

`optimize!`

:- If the VI method is
*not*gradient-based (e.g. CAVI as we'll soon see), then this is where the main chunk of work occurs, and thus is the method to implement. - If the VI method
*is*gradient-based,`optimize!`

will usually contain the "training-loop", where the gradients are obtained by calling`grad!`

. Then the parameters are updated by applying the suitable update rule decided by the optimizer used, e.g.`ADAGrad`

.- There are also certain cases where we need to do more than just update the variational variables every iteration, e.g. min-max steps needed for certain objectives liu16_stein_variat_gradien_descen.

- Default implementation is provided which simply takes a given number of optimization steps by calling
`grad!`

and`apply!(optimizer, ...)`

.

- If the VI method is
`grad!`

: If the VI method is gradient-based, then this is usually where the heavy lifting is done. As mentioned earlier, the gradient estimator is often the distinguishing factor between different VI methods, e.g. reparameterization trick (next posts guys; calm down).- Default implementations are provided by using
`ForwardDiff`

and`Flux.Tracker`

to differentiate the`objective`

call, depending on which AD-type is specified for the`VariationalInference{AD}`

instance.

- Default implementations are provided by using

And of course you we need to implement some *objective* function, e.g. ELBO. Internally in `Turing.jl`

we define types and implement calling an instance, e.g. `struct ELBO end`

and a `(elbo::ELBO)(...)`

. This allows for different `VariationalInference`

methods to have different behaviour for different objectives. If you're interested in implementing your own VI objective maybe using a simple function, then you can just specialize the `optimize!`

call with a `obj::typeof(objective_function)`

as argument.

?Variational.optimize!

optimize!(vo, alg::VariationalInference{AD}, q::VariationalPosterior, model::Model, θ; optimizer = ADAGrad()) Iteratively updates parameters by calling grad! and using the given optimizer to compute the steps.

?Variational.grad!

grad!(vo, vi::VariationalInference, q::VariationalPosterior, model::Model, θ, out, args...) Computes the gradients used in optimize!. Default implementation is provided for VariationalInference{AD} where AD is either ForwardDiffAD or TrackerAD. This implicitly also gives a default implementation of optimize!. Variance reduction techniques, e.g. control variates, should be implemented in this function.

Aight, aight; that's a lot of information. Let's do an example!

## Example: 1D Gaussian Mixture

### Notation

1D Gaussian mixture:

where

Variational distribution

where

- is a Gaussian with mean and variance for
- is a Categorical distribution where (a K-dim vector) s.t.
- denote is a
*parameterized*density of with param . - means we assume is also random variable itself, i.e. latent variable in Bayesian inference setting.
Expectation of some function wrt. density is denoted

### Definition

Let's apply VI to a simple Gaussian mixture problem, because why not? We will closely follow blei16_variat_infer with a slight variation (other than notational differences): we allow non-uniform mixture weights, i.e. being assinged to cluster might have a different probability than being assinged to cluster .

This example lends itself nicely as a base-case for VI as we can derive closed form expressions for the updates. Though it takes a bit of work, it nicely demonstrates the underlying mechanisms necessary for VI and demonstrates why we really, *really* want more general VI approaches :) Aight, let's get started.

For a 1D Gaussian mixture, one particular generative model can be

where

In this case we then have

where denotes summing over all possible values can take on. Following blei16_variat_infer we can actually rewrite to

The conjugacy of the Gaussian prior and Gaussian likelihood , i.e. that Gaussian prior and Gaussian likelihood gives a Gaussian posterior, we can indeed obtain a closed form expression for this. Even so, there are then number of these terms since we're summing over all the different cluster assignments: and gives terms. Thus, for the sake of efficiency, we reach for VI!

### Coordinate ascent (mean-field) variational inference (CAVI)

As noted earlier, to perform VI we need to specify a variational family . To make life easy for ourselves, we will make what's known as the *mean-field assumption*, i.e. the latent variables are independent so means that

for some independent densitites . To choose these , a natural thing to do is to take the densities used in the original model and apply the mean-field assumption to this model, i.e.

where

- is a Gaussian with mean and variance for
- is a Categorical distribution where (a K-dim vector) s.t.

Recall we want to find s.t. we maximize the . To do this, we're going to use one of the most basic VI methods called **coordinate ascent variational inference (CAVI)**. This works by iteratively optimize each of the latent variables *conditioned* on the rest of the latent variables, similar to the iterative process of Gibbs sampling (but of course, doesn't involve any optimization).

Observe that we can write the ELBO under the mean-field assumption as

where denotes the parameters of . Then observe that

Therefore we can also absorb the second term above into the constant wrt. term:

We therefore make the justified guess

(up to a constant normalization factor) which means that if we substitute into the above expression for ELBO, the two first terms cancel eachother out, and we're left with

If was parameterized by some parameter (and differentiable wrt. ), we could differentiate wrt. and see that the LHS vanishes completely when is set to satisfy the above (assuming such a exists of course). At the very least, this suggests an iterative process to maximize the ELBO:

And indeed this converges to a local optimum blei16_variat_infer.

Now, we still need to compute . But of course, the entire reason for choosing this model and variational family is because in this case these expressions are available in closed form:

where we have used the fact that if then

The derivations of these can be found in the appendix (or without in blei16_variat_infer).

This example is a special case of what's known as *conditionally conjugate models* with "local" and "global" variables.

- A variable is
*local*if only a single data point depends on this variable - A variable is
*global*if multiple data points depend on this variable

We use this terminology of "global" and "local" for both the *latent* variables and the *variational* variables. E.g.

- Latent:
- Local: cluster assignments
- Global: cluster means

- Variational:
- Local: parameters of cluster assignments
- Global: means and variances for the different clusters

In a general **conditionally conjugate model**, say is the *global latent variables* and is *local latent variables*, then the joint density is given by

In the case where these factors are also in the exponential family, closed form updates for CAVI can be derived in a similar way as we did for the Gaussian mixture model above.blei16_variat_infer

### Implementation

Now, we before we start, remember that we're trying to get a more efficient way of sampling from our `model`

, right? We're *not* trying to infer the parameters that generated the data, but instead working under the assumption the data came from the given model with known parameters (which in this is of course true) and we want to produce *approximate* samples from this model. With that in mind, let's begin!

First we import some necessary packages and set the random seed for reproducibility.

using Random Random.seed!(1) using Plots, StatsPlots, LaTeXStrings

The `Turing.Model`

can then be defined

σ² = 1.0 K = 3 # number of clusters n = 1000 # number of samples @model mixture(x = Vector{Float64}(undef, 1), μ = Vector{Float64}(undef, K), π = (1 / K) .* ones(K)) = begin for i = 1:K μ[i] ~ Normal(0.0, σ²) end c = tzeros(Int, length(x)) for i ∈ eachindex(x) c[i] ~ Categorical(π) x[i] ~ Normal(μ[c[i]], 1) end end

mixture (generic function with 4 methods)

This might look a bit complicated, but it really isn't.

`K`

is the number of clusters`x = Vector{Float64}(undef, 1)`

means that*if*we don't specify any data`x`

when instantiating the`Model`

,`x`

will be treated as a random variable to be sampled rather than as an observation. As we will see in a bit, this allows us to generate samples for`x`

from the`Model`

!`μ = Vector{Float64}(undef, K)`

will hold the means of the different clusters; again, the`undef`

will simply allow us to choose between fixing or sampling`μ`

`π`

will denote the weights of the clusters, i.e.`π[k]`

is the probability of assignment to cluster`k`

. Setting it equal to the number of clusters`K`

assumes uniform cluster weights.`tzeros`

is simply a initialization procedure for latent random variables which is compatible with Turing's particle samplers (we're going to use Particle gibbs`PG`

). See Turing's documentation for more info.

Let's produce some samples from the model with , and together with the cluster weights , and , e.g. there's a probability of being assigned cluster which is normally distributed with mean .

# sample from the model π = [0.1, 0.2, 0.7] # cluster weights mix = mixture(μ = [-5.0, 0.0, 5.0], π = π) samples = sample(mix, PG(200, n));

[PG] Sampling... 0% ETA: 0:44:51[PG] Sampling... 4% ETA: 0:01:49[PG] Sampling... 8% ETA: 0:01:03[PG] Sampling... 12% ETA: 0:00:48[PG] Sampling... 16% ETA: 0:00:39[PG] Sampling... 20% ETA: 0:00:34[PG] Sampling... 24% ETA: 0:00:30[PG] Sampling... 27% ETA: 0:00:28[PG] Sampling... 31% ETA: 0:00:25[PG] Sampling... 35% ETA: 0:00:23[PG] Sampling... 39% ETA: 0:00:21[PG] Sampling... 43% ETA: 0:00:19[PG] Sampling... 47% ETA: 0:00:17[PG] Sampling... 51% ETA: 0:00:16[PG] Sampling... 55% ETA: 0:00:14[PG] Sampling... 59% ETA: 0:00:13[PG] Sampling... 63% ETA: 0:00:12[PG] Sampling... 67% ETA: 0:00:10[PG] Sampling... 71% ETA: 0:00:09[PG] Sampling... 75% ETA: 0:00:08[PG] Sampling... 79% ETA: 0:00:06[PG] Sampling... 82% ETA: 0:00:05[PG] Sampling... 86% ETA: 0:00:04[PG] Sampling... 90% ETA: 0:00:03[PG] Sampling... 94% ETA: 0:00:02[PG] Sampling... 98% ETA: 0:00:01[PG] Sampling...100% Time: 0:00:29 ┌ Info: [PG] Finished with └ @ Turing.Inference /home/tor/Projects/mine/Turing.jl/src/inference/AdvancedSMC.jl:195 ┌ Info: Running time = 28.558703612000002; └ @ Turing.Inference /home/tor/Projects/mine/Turing.jl/src/inference/AdvancedSMC.jl:196

```
plot(samples[:x], size = (1000, 400))
```

Looks nice! Let's extract the cluster assignments and the observations:

cluster_assignments = vec([v for v ∈ samples[:c].value]) x = vec([e for e ∈ samples[:x].value])

1000-element Array{Float64,1}: 6.147820300612501 0.4663086943131637 -0.6258083879904177 5.724759360321863 4.852150430501788 8.069098838605914 6.02043226784142 5.25245793098362 5.236303942371799 4.918764327220161 -0.40921684006032205 3.727671136403713 5.121310585284786 ⋮ 6.344736353801734 0.4270545469128376 4.834727745113844 -4.668084240525111 4.090751518471361 4.557843176639102 3.9554058940625487 0.4796996702606038 -4.078674006599593 -5.560181598563516 5.773778693804238 -4.9926860296827575

Now we implement the `VariationalInference`

interface for `Turing`

. First we define the `VariationalInference`

type `CAVI`

:

struct CAVI max_iter # maximum number of optimization steps end

And how the ELBO is evaluated for this particular VI algorithm and this `MeanField`

model:

function (elbo::Variational.ELBO)(alg::CAVI, q::MeanField, model, num_samples) res = 0.0 for i = 1:num_samples z = rand(q) res += logdensity(model, z) / num_samples end return res - entropy(q) end

Observe that here we are using an empirical estimate of the expectation while we can compute in closed form. `entropy`

is provided by `StatsBase`

for both `Categorical`

and `Normal`

, so we don't even have to implement those. Under a mean-field assumption, the entropy of `q`

is simply the sum of the entropies of the independent variables.

Both the `logdensity`

and `MeanField`

definitions you can find in the appendix.

As mentioned before, we now need to at the very least define `optimize!`

and `vi`

. This is kind of a contrived example since the update derived before is only applicable to a model of this particular form. Because of this, we're going to instantiate the `mixture`

with the generated data and then dispatch on `typeof(model)`

.

model = mixture(x = x, π = π)

Turing.Model{Tuple{:c,:μ},Tuple{:x},getfield(Main, Symbol("###inner_function#375#11")){Array{Float64,1}},NamedTuple{(:x,),Tuple{Array{Float64,1}}},NamedTuple{(:μ, :x),Tuple{Array{Float64,1},Array{Float64,1}}}}(getfield(Main, Symbol("###inner_function#375#11")){Array{Float64,1}}(Core.Box(getfield(Main, Symbol("###inner_function#375#11")){Array{Float64,1}}(#= circular reference @-2 =#)), [0.1, 0.2, 0.7]), (x = [6.14782, 0.466309, -0.625808, 5.72476, 4.85215, 8.0691, 6.02043, 5.25246, 5.2363, 4.91876 … 4.83473, -4.66808, 4.09075, 4.55784, 3.95541, 0.4797, -4.07867, -5.56018, 5.77378, -4.99269],), (μ = [6.93058e-310, 6.93058e-310, 6.93058e-310], x = [6.93058e-310]))

function optimize!(alg::CAVI, elbo::Variational.ELBO, q, model::typeof(model), m, s², φ) # ideally we'd pack and unpack the params `m`, `s²` and `φ` # but for sake of exposition we skip that bit here K = length(m) n = size(φ, 1) for step_idx = 1:alg.max_iter # update cluster assignments for i = 1:n for k = 1:K # If the cluster weights were uniform, we could do without accesing π: # φ[i, k] = exp(m[k] * x[i] - 0.5 * (s²[k] + m[k]^2)) # HACK: π is global variable :/ φ[i, k] = π[k] * exp(m[k] * x[i] - 0.5 * (s²[k] + m[k]^2)) end φ[i, :] ./= sum(φ[i, :]) # normalize end # HACK: σ² here is a global variable :/ # update Gaussians for k = 1:K φ_k = φ[:, k] denominator = (1 / σ²) + sum(φ_k) m[k] = sum(φ_k .* x) / denominator s²[k] /= denominator end end m, s², φ end

optimize! (generic function with 1 method)

function Variational.vi(model::typeof(model), alg::CAVI, q::MeanField) discrete_idx = findfirst(d -> d isa DiscreteNonParametric, q.dists) K = length(q.dists[discrete_idx].p) # extract the number of clusters n = sum(isa.(q.dists, DiscreteNonParametric)) # each data-point has a cluster assignment # initialization of variables φ = rand(n, K) φ ./= sum(φ; dims = 2) m = randn(K) s² = ones(K) # objective elbo = Variational.ELBO() # optimize optimize!(alg, elbo, q, model, m, s², φ) # create new distribution q_new = MeanField(copy(q.dists), q.ranges) for k = 1:K q_new.dists[k] = Normal(m[k], sqrt(s²[k])) end for i ∈ eachindex(x) q_new.dists[i + K] = Categorical(φ[i, :]) end q_new end

Now we construct our mean-field approximation from `model`

.

# extract priors into mean-field var_info = Turing.VarInfo(); model(var_info, Turing.SampleFromUniform()) q = MeanField(deepcopy(var_info.dists), var_info.ranges)

MeanField{Array{Distribution,1}}( dists: Distribution[Normal{Float64}(μ=0.0, σ=1.0), Normal{Float64}(μ=0.0, σ=1.0), Normal{Float64}(μ=0.0, σ=1.0), DiscreteNonParametric{Int64,Float64,Base.OneTo{Int64},Array{Float64,1}}(support=Base.OneTo(3), p=[0.1, 0.2, 0.7]), DiscreteNonParametric{Int64,Float64,Base.OneTo{Int64},Array{Float64,1}}(support=Base.OneTo(3), p=[0.1, 0.2, 0.7]), DiscreteNonParametric{Int64,Float64,Base.OneTo{Int64},Array{Float64,1}}(support=Base.OneTo(3), p=[0.1, 0.2, 0.7]), DiscreteNonParametric{Int64,Float64,Base.OneTo{Int64},Array{Float64,1}}(support=Base.OneTo(3), p=[0.1, 0.2, 0.7]), DiscreteNonParametric{Int64,Float64,Base.OneTo{Int64},Array{Float64,1}}(support=Base.OneTo(3), p=[0.1, 0.2, 0.7]), DiscreteNonParametric{Int64,Float64,Base.OneTo{Int64},Array{Float64,1}}(support=Base.OneTo(3), p=[0.1, 0.2, 0.7]), DiscreteNonParametric{Int64,Float64,Base.OneTo{Int64},Array{Float64,1}}(support=Base.OneTo(3), p=[0.1, 0.2, 0.7]) … DiscreteNonParametric{Int64,Float64,Base.OneTo{Int64},Array{Float64,1}}(support=Base.OneTo(3), p=[0.1, 0.2, 0.7]), DiscreteNonParametric{Int64,Float64,Base.OneTo{Int64},Array{Float64,1}}(support=Base.OneTo(3), p=[0.1, 0.2, 0.7]), DiscreteNonParametric{Int64,Float64,Base.OneTo{Int64},Array{Float64,1}}(support=Base.OneTo(3), p=[0.1, 0.2, 0.7]), DiscreteNonParametric{Int64,Float64,Base.OneTo{Int64},Array{Float64,1}}(support=Base.OneTo(3), p=[0.1, 0.2, 0.7]), DiscreteNonParametric{Int64,Float64,Base.OneTo{Int64},Array{Float64,1}}(support=Base.OneTo(3), p=[0.1, 0.2, 0.7]), DiscreteNonParametric{Int64,Float64,Base.OneTo{Int64},Array{Float64,1}}(support=Base.OneTo(3), p=[0.1, 0.2, 0.7]), DiscreteNonParametric{Int64,Float64,Base.OneTo{Int64},Array{Float64,1}}(support=Base.OneTo(3), p=[0.1, 0.2, 0.7]), DiscreteNonParametric{Int64,Float64,Base.OneTo{Int64},Array{Float64,1}}(support=Base.OneTo(3), p=[0.1, 0.2, 0.7]), DiscreteNonParametric{Int64,Float64,Base.OneTo{Int64},Array{Float64,1}}(support=Base.OneTo(3), p=[0.1, 0.2, 0.7]), DiscreteNonParametric{Int64,Float64,Base.OneTo{Int64},Array{Float64,1}}(support=Base.OneTo(3), p=[0.1, 0.2, 0.7])] ranges: UnitRange{Int64}[1:1, 2:2, 3:3, 4:4, 5:5, 6:6, 7:7, 8:8, 9:9, 10:10 … 994:994, 995:995, 996:996, 997:997, 998:998, 999:999, 1000:1000, 1001:1001, 1002:1002, 1003:1003] )

Let's check that it indeed produces the expected samples:

rand(q)

1003-element Array{Float64,1}: -0.7810409728512453 -0.7015975666891903 -1.2133827111882445 3.0 3.0 3.0 3.0 2.0 3.0 1.0 3.0 3.0 3.0 ⋮ 3.0 3.0 3.0 3.0 2.0 3.0 1.0 2.0 3.0 2.0 1.0 2.0

The only issue with this `MeanField`

implementation is that to satisfy the `<: Distribution{Multivariate, Continuous}`

type constraint, we need to return `AbstractArray{<: Real}`

. This means that the cluster assignments, which ideally would be integers, are now returned as floats.

This is the reason why in the definition of `mixture`

, we added a `Int(c[i])`

. It's kind of ugly, but, yah know, it works.

Finally we can perform CAVI on this mixture model:

# perform VI cavi = CAVI(10) q_new = vi(model, cavi, q)

MeanField{Array{Distribution,1}}( dists: Distribution[Normal{Float64}(μ=-0.0720897, σ=1.42392e-11), Normal{Float64}(μ=-4.91145, σ=5.46488e-11), Normal{Float64}(μ=4.92814, σ=4.9844e-15), DiscreteNonParametric{Int64,Float64,Base.OneTo{Int64},Array{Float64,1}}(support=Base.OneTo(3), p=[1.1981e-9, 1.67617e-27, 1.0]), DiscreteNonParametric{Int64,Float64,Base.OneTo{Int64},Array{Float64,1}}(support=Base.OneTo(3), p=[0.999615, 1.21824e-6, 0.000384242]), DiscreteNonParametric{Int64,Float64,Base.OneTo{Int64},Array{Float64,1}}(support=Base.OneTo(3), p=[0.999758, 0.000240409, 1.63404e-6]), DiscreteNonParametric{Int64,Float64,Base.OneTo{Int64},Array{Float64,1}}(support=Base.OneTo(3), p=[9.93394e-9, 1.07655e-25, 1.0]), DiscreteNonParametric{Int64,Float64,Base.OneTo{Int64},Array{Float64,1}}(support=Base.OneTo(3), p=[7.7963e-7, 5.76301e-22, 0.999999]), DiscreteNonParametric{Int64,Float64,Base.OneTo{Int64},Array{Float64,1}}(support=Base.OneTo(3), p=[8.06611e-14, 1.03465e-35, 1.0]), DiscreteNonParametric{Int64,Float64,Base.OneTo{Int64},Array{Float64,1}}(support=Base.OneTo(3), p=[2.26518e-9, 5.87004e-27, 1.0]) … DiscreteNonParametric{Int64,Float64,Base.OneTo{Int64},Array{Float64,1}}(support=Base.OneTo(3), p=[8.50589e-7, 6.84062e-22, 0.999999]), DiscreteNonParametric{Int64,Float64,Base.OneTo{Int64},Array{Float64,1}}(support=Base.OneTo(3), p=[1.32992e-5, 0.999987, 3.62968e-20]), DiscreteNonParametric{Int64,Float64,Base.OneTo{Int64},Array{Float64,1}}(support=Base.OneTo(3), p=[3.50883e-5, 1.03289e-18, 0.999965]), DiscreteNonParametric{Int64,Float64,Base.OneTo{Int64},Array{Float64,1}}(support=Base.OneTo(3), p=[3.39579e-6, 1.04282e-20, 0.999997]), DiscreteNonParametric{Int64,Float64,Base.OneTo{Int64},Array{Float64,1}}(support=Base.OneTo(3), p=[6.90294e-5, 3.91168e-18, 0.999931]), DiscreteNonParametric{Int64,Float64,Base.OneTo{Int64},Array{Float64,1}}(support=Base.OneTo(3), p=[0.999588, 1.14177e-6, 0.000410837]), DiscreteNonParametric{Int64,Float64,Base.OneTo{Int64},Array{Float64,1}}(support=Base.OneTo(3), p=[0.000230365, 0.99977, 1.19755e-17]), DiscreteNonParametric{Int64,Float64,Base.OneTo{Int64},Array{Float64,1}}(support=Base.OneTo(3), p=[1.77431e-7, 1.0, 5.59742e-24]), DiscreteNonParametric{Int64,Float64,Base.OneTo{Int64},Array{Float64,1}}(support=Base.OneTo(3), p=[7.77467e-9, 6.64628e-26, 1.0]), DiscreteNonParametric{Int64,Float64,Base.OneTo{Int64},Array{Float64,1}}(support=Base.OneTo(3), p=[2.76477e-6, 0.999997, 1.4889e-21])] ranges: UnitRange{Int64}[1:1, 2:2, 3:3, 4:4, 5:5, 6:6, 7:7, 8:8, 9:9, 10:10 … 994:994, 995:995, 996:996, 997:997, 998:998, 999:999, 1000:1000, 1001:1001, 1002:1002, 1003:1003] )

Let's compare the ELBO of `q_new`

to `q`

:

elbo = Variational.ELBO() elbo(cavi, q, model, 100), elbo(cavi, q_new, model, 100)

(-13270.643508809633, -2240.377147461269)

Hey, it improved! Neat. Let's inspect the obtained parameters.

q_new.dists[1:K] # <= distributions for the cluster means

3-element Array{Distribution,1}: Normal{Float64}(μ=-0.07208966158030183, σ=1.4239202985246593e-11) Normal{Float64}(μ=-4.911453128528145, σ=5.464878645528e-11) Normal{Float64}(μ=4.928144006028532, σ=4.984398204586815e-15)

As we can see, the means are fairly close and the variances are small. This is what we expected (or rather *hoped for*, but since it worked out, we just say it was expected, right guys?). In the data generation process we had the means *fixed*, i.e. "variance was zero", thus the variance of the estimates should also be small.

This further displays how this is a "contrived" example in Bayesian inference. Since we fixed the means during the generation process, and the likelihood had unit-variance, the posterior distributions we get *should* (and do) converge to, basically, point-estimates. As a result, sampling from the resulting variational posterior `q_new`

will almost always give the same result.

This kind of "loses" the purpose of doing Bayesian inference in the first place since the "true" distribution of the latent variables is a discrete distribution with a single outcome, i.e. fixed distribution. At the same time, most interesting examples will probably not be in the conditionally conjugate family, thus CAVI won't be applicable. As mentioned before, there are more general approaches to come!

We can also inspect the learned cluster weights:

vec(mean(transpose(hcat([d.p for d ∈ q_new.dists[K + 1:end]]...)); dims = 1))

3-element Array{Float64,1}: 0.19451210528824994 0.08900147050764164 0.7164864242041084

Comparing to the cluster assignments in the dataset:

```
[sum(cluster_assignments .== v) / length(cluster_assignments) for v ∈ sort(unique(cluster_assignments))]
```

3-element Array{Float64,1}: 0.089 0.199 0.712

Notice that indeed that the cluster weights correspond to the correct means, e.g. cluster with indeed had weight . Neat stuff, ain't it?

Observe that the *order* of the cluster means is different from what we used in the data generation; we used the ordering `[-5.0, 0.0, 5.0]`

. This is because the clusters are *exchangable*, i.e. any permutation of the cluster-indices will result in the same approximate posterior (modulo stochasticity). Which order we get these in simply comes down to the initialization values: if is closer to than after initialization, then it's very likely that the resulting variational posterior will have ordering with the index 2 corresponding to the "true" index 1, if yah feel me.

Let's finish off with a neat GIF of the cluster assignment. In creating this GIF, we'll also see how we the user can write VI for Turing as a "training" loop, more similar to other machine learning techniques. The general idea is to

- Initialize the parameters , , and , together with the initial mean-field distribution
`q`

. - Perform an optimization step using
`optimize!`

- Plot the inferred cluster assignments and ELBO.

# Let's do one where we plot the progress q = MeanField(deepcopy(var_info.dists), var_info.ranges) cavi = CAVI(1) # perform 1 step for each `optimize!` call φ = rand(n, K) # cluster assignments φ ./= sum(φ; dims = 2) m = randn(K) s² = ones(K) # update distributions for k = 1:K q.dists[k] = Normal(m[k], sqrt(s²[k])) end for i ∈ eachindex(x) q.dists[i + K] = Categorical(φ[i, :]) end objectives = [] # history of objective evaluations anim = @animate for step_idx = 1:20 # compute empirical estimate of ELBO step_elbo = elbo(cavi, q, model, 100) push!(objectives, step_elbo) @info "[$step_idx]" step_elbo # plotting of ELBO # sort so that we get the cluster in the middle is coloured red for visibility mean_idx_sort = sortperm(m) color_map = Dict(zip(sortperm(m), [colorant"#41b8f4", colorant"#f44155", colorant"#13c142"])) colors = map(c -> color_map[c], Int.(rand(q)[K + 1:end])) plt1 = scatter(x, color = colors, label = "") title!("Step $step_idx with ELBO $step_elbo") xlabel!("Index of data point") plt2 = plot(max(1, step_idx - 10):step_idx, objectives[max(1, step_idx - 10):step_idx], label = "ELBO") xlabel!("Step index") # run CAVI for 1 iteration to update parameters m, s², φ optimize!(cavi, elbo, q, model, m, s², φ) # update distributions so we can evaluate the ELBO with updated parameters for k = 1:K q.dists[k] = Normal(m[k], sqrt(s²[k])) end for i ∈ eachindex(x) q.dists[i + K] = Categorical(φ[i, :]) end plot(plt1, plt2, layout = grid(2, 1, heights = [0.7, 0.3]), size = (1000, 500)) end gif(anim, ".2019-06-25-variational-inference-part-1-cavi/figures/gaussian_mixture_cluster_assignments.gif", fps = 5);

┌ Info: [1] │ step_elbo = -14505.608414741417 └ @ Main In[35]:28 ┌ Info: [2] │ step_elbo = -4295.3039673855865 └ @ Main In[35]:28 ┌ Info: [3] │ step_elbo = -3322.581925954455 └ @ Main In[35]:28 ┌ Info: [4] │ step_elbo = -3302.4541140390593 └ @ Main In[35]:28 ┌ Info: [5] │ step_elbo = -3300.152447964722 └ @ Main In[35]:28 ┌ Info: [6] │ step_elbo = -3295.6891878053775 └ @ Main In[35]:28 ┌ Info: [7] │ step_elbo = -3289.9682776026852 └ @ Main In[35]:28 ┌ Info: [8] │ step_elbo = -3284.6262893228004 └ @ Main In[35]:28 ┌ Info: [9] │ step_elbo = -3273.6817405178094 └ @ Main In[35]:28 ┌ Info: [10] │ step_elbo = -3266.9336190510426 └ @ Main In[35]:28 ┌ Info: [11] │ step_elbo = -3259.8310113380035 └ @ Main In[35]:28 ┌ Info: [12] │ step_elbo = -3248.8725276001446 └ @ Main In[35]:28 ┌ Info: [13] │ step_elbo = -3243.003893619992 └ @ Main In[35]:28 ┌ Info: [14] │ step_elbo = -3235.485410286028 └ @ Main In[35]:28 ┌ Info: [15] │ step_elbo = -3226.4825612330606 └ @ Main In[35]:28 ┌ Info: [16] │ step_elbo = -3218.4085612846575 └ @ Main In[35]:28 ┌ Info: [17] │ step_elbo = -3209.639842972914 └ @ Main In[35]:28 ┌ Info: [18] │ step_elbo = -3202.432054017625 └ @ Main In[35]:28 ┌ Info: [19] │ step_elbo = -3193.836909448166 └ @ Main In[35]:28 ┌ Info: [20] │ step_elbo = -3184.2329537025767 └ @ Main In[35]:28 ┌ Info: Saved animation to │ fn = /home/tor/org-blog/posts/.2019-06-25-variational-inference-part-1-cavi/figures/gaussian_mixture_cluster_assignments.gif └ @ Plots /home/tor/.julia/packages/Plots/oiirH/src/animation.jl:90

## In the future

Though we have indeed derived the algorithm for a particular example, and the current implementation is specified to an *instance* of a particular mixture model, one could imagine generalizing this approach. We noted earlier that this is indeed an instance of more general family of distributions, which we can perform CAVI on with closed form updates blei16_variat_infer. One *could* imagine automatically detecting whether or not a `Model`

consisted of exponential distributions and if the entire model was conditionally conjugate. Then we could simply construct a suitable mean-field approximation and then run `CAVI`

with these more general updates!

That would be neat.

## Final remarks

Aight, so this was interesting and all, but we had to do a lot of manual labour to obtain the variational updates. In an ideal world you have defined some `Model`

and you just want to call `vi(model, alg)`

, maybe even make the choice of the variational family yourself by calling `vi(model, alg, q)`

, and then Turing does the rest. Not having to derive closed form updates, not having to worry about whether the densities in your model are in the exponential family, etc. In the next post we will have a look at a variational inference algorithm which gets us further in that direction: **automatic differentiation variational inference (ADVI)**.

The animation below is obtained simply by calling `vi(model, ADVI(10, 1000))`

on a simple generative model:

It's going to be *neat*.

## Bibliography

- [blei16_variat_infer] Blei, Kucukelbir, McAuliffe & Jon, Variational Inference: a Review for Statisticians,
*CoRR*, (2016). link. - [pmlr-v80-chen18k] Chen, Tao, Zhang, Henao & Duke, Variational Inference and Model Selection with Generalized Evidence Bounds, 893-902, in in: Proceedings of the 35th International Conference on Machine Learning, edited by Dy & Krause, PMLR (2018)
- [liu16_stein_variat_gradien_descen] Liu & Wang, Stein Variational Gradient Descent: a General Purpose Bayesian Inference Algorithm,
*CoRR*, (2016). link.

## Appendix

### Code

# Mean-field for arbitrary distributions import Distributions: _rand!, _logpdf struct MeanField{TDists <: AbstractVector{<: Distribution}} <: Distribution{Multivariate, Continuous} dists::TDists ranges::Vector{UnitRange{Int}} end MeanField(dists) = begin ranges::Vector{UnitRange{Int}} = [] idx = 1 for d ∈ dists push!(ranges, idx:idx + length(d) - 1) idx += length(d) end MeanField(dists, ranges) end # Base.length(mf::MeanField) = sum(length(d) for d ∈ mf.dists) Base.length(mf::MeanField) = mf.ranges[end][end] _rand!(rng::AbstractRNG, mf::MeanField, x::AbstractVector{T} where T <: Real) = begin for i ∈ eachindex(mf.ranges) d = mf.dists[i] r = mf.ranges[i] x[r] = rand(rng, d, 1) end x end _logpdf(mf::MeanField, x::AbstractVector{T} where T <: Real) = begin tot = 0.0 for i ∈ eachindex(mf.ranges) r = mf.ranges[i] if length(r) == 1 tot += logpdf(mf.dists[i], x[r[1]]) else tot += logpdf(mf.dists[i], x[r]) end end return tot end import StatsBase: params, entropy entropy(mf::MeanField) = begin sum(entropy.(mf.dists)) end

logdensity(model::Turing.Model, z) = begin # kind of hacky; improved impl coming to Turing soon: # https://github.com/TuringLang/Turing.jl/issues/817 # initialize VarInfo var_info = Turing.VarInfo() model(var_info) var_info.vals .= z model(var_info) var_info.logp end

### Derivation of variational updates

Recall the equation we found ealier for updates when the model is *conditionally conjugate*:

Now we consider the different mean-field factors.

#### Mixture assignments

In this case we simply have

since the cluster assignment of for is (conditionally) independent of the cluster assignments of the rest of the observations, denoted . The first term is simply

where is the cluster-weight of the i-th cluster, or equivalently, the probability of assignment to the i-th cluster. Thus the interesting term is the expectation of the log-likelihood. Though in our *implementation* of this Gaussian mixture, , we can identitically consider as a indicator vector, with the k-th component given by

where

Then we can write the log-likelihood

Then

where we have used the fact that the

to "move" the constant expression out of the sum, and, recalling that is simply a unit variance Gaussian with mean ,

Moreover, observe that

and that

Substituting back into the above expression

where we have observed the into the term. This is indeed is equivalent to the expression derived in blei16_variat_infer by simply observing that

This *finally* gives us the full update for :

#### Means of mixture components

For the means we have

Observe then that we can complete the square

Exponentiating this and bringing the term in red into a multiplicative constant (wrt. ) , leaving us with

where

Congratulations! It's a bo…Gaussian!

Actually, in blei16_variat_infer they insist on using instead of just writing , etc. At first I didn't understand why, but now I realize: the above is basically *deriving* that the conditionally conjugate density is a Gaussian, rather than assuming so! Neato.

## Footnotes:

^{1}

It's at least famous in certain circles. Maybe few circles, but still, they exist… There are dozens of us! Dozens!

^{2}

As we will see in later posts in this series, this is not always the case. E.g. the class of variational inference methods called *Ammortized VI* uses a *different* latent variable for each observation .

^{3}

Usually the *variational family* is chosen s.t. is cheap to sample from.

^{4}

You noticed the `v3`

? What happened to `v1`

and `v2`

? Don't worry about it, they're fine.