Variational Inference

Overview

In variational inference, the posterior distribution over a set of unobserved variables given some data is approximated by a variational distribution, :

The distribution is restricted to belong to a family of distributions of simpler form than , selected with the intention of making similar to the true posterior, .

We're basically making live simpler for ourselves by casting the approximate conditional inference as an optimization problem.

The evidence lower bound (ELBO)

• Specify a family of densities over the latent variables
• Each is a candidate to the exact conditional distribution
• Goal is then to find the best candidate, i.e. the one closest in KL divergence to the exact conditionanl distribution
1
• is the best approx. to the conditional distribution, within that family
• However, the equation above requires us to compute the log evidence which may be intractable, OR DOES IT?!

2
• Buuut, since we want to minimize wrt. , we can simply minizime the above, without having to worry about !
• By dropping constant term (wrt. ) and moving the sign outside, we get
3
• Thus, maximizing the is equivalent to minimizing the divergence

Why use the above representation of the ELBO as our objective function? Because can be rewritten as , thus we simply need to come up with:

• Model for the likelihood given some latent variables
• Some posterior probability for the latent variables
• We can rewrite the to give us a some intuition about the optimal variational density
4
• Basically says, "the evidence lower bound of is the maximization of the likelihood and minimization of the divergence from the prior distribution, combined"
• encourages densities that increases the likelihood of the data
• encourages densities close to the prior
• Another property (and the reason for the name), is that it puts a lower bound on the (log) evidence,
5
• Which means that if we increase the => must decrease

Examples of families

Mean-field variational family

• Assumes the latent variables to be mutually independent, i.e.
6
• Does not model the observed data, does not appear in the equation => it's the which connects the fitted variational density to the data

Optimization

Coordinate ascent mean-field variational inference (CAVI)

Notation

• is the data
• is the latent variables of our model for
• is the variational density
• is a variational factor
• , i.e. the expectation over all factors of keeping the factor constant

Overview

Based on coordinate ascent, thus no gradient information required.

Iteratively optimizes each factor of the mean-field variational density, while holding others fixed. Thus, climbing the ELBO to a local optimum.

Algorithm

• Input: Model , data set
• Output: Variational density

Then the actual algorithm goes as follows:

Psuedo implementation in Python

Which I believe could look something like this when written in Python:

import sys
import numpy as np

class UpdateableModel:
def __init__(self):
self.latent_var = np.random.rand(0, 1)

def __call__(self, z):
# compute q(z) or equivalently q(z_j | z_{l != j})
# since mean-field approx. allows us to assume independence
pass

@staticmethod
def proba_product(qs, z):
# compute q_1(z_1) * q_2(z_2) * ... * q_m(z_m)
res = 1
for j in range(len(qs)):
q_j = qs[j]
z_j = z[j]
res *= q_j(z_j)
return res

def compute_elbo(q, p_joint, z, x, all_possible_vals=None):
m = len(z)
joint_expect = 0
latent_expect = 0

for j in range(m):
q_j = q[j]
z_j = z[j]

joint_expect += q_j(z_j) * np.log(p_joint(z=z, x=x))
latent_expect += q_j(z_j) * np.log(q_j(z_j))

return joint_expect + latent_expect

def cavi(model, z, x, epsilon=5):
"""
Parameters
----------
model : callable
Computes our model probability $p(z, x)$ or $p(z_j | z_{l != j}, x)$
z : array-like
Initialized values for latent variables, e.g. for a GMM we would have
mu = z[0], sigma = z[1].
x : array-like
Represents the data.
"""
m = len(z)
q = [UpdateableModel() for j in range(m)]

elbo = sys.maxint

while elbo > epsilon:
for j in range(m):
# expectation for all latent variables
expect_log = 0
for j2 in range(m):
if j2 == j:
continue
expect_log += q[j2](z[j2]) * np.log(model(fixed=j, z=z, x=x))
q[j].latent_var = np.exp(expect_log)

elbo = compute_elbo(q, model, z=z, x=x)

return q


Mean-field approx. → assume latent variables are independent of each other → can simply be represented as a vector with the entry being a independent model correspondig to

Derivation of update

We simply rewrite the ELBO as a function of , since the independence between the variational factors implies that we can maximize the ELBO wrt. to each of those separately.

7

Were we have written the expectation wrt. using iterated expectation.

ELBO

Used in BBVI (REINFORCE)

• Can obtain unbiased gradient estimator by sampling from the variational distribution without having to compute ELBO analytically
• Only requires computation of score function of variational posterior:
• Given by

• Unbiased estimator obtained by sampling from

Reparametrization trick

• needs to be reparametrizable
• E.g. and

where and are parametrized means and std. dev

• Then we instead do the following:

where is some determinstic function and is the "noise" distribution

• Observe that this then also takes advantage of the structure of the joint distribution
• But also requires the join-distribution to be differentiable wrt.
• Often the entropy of can be computed analytically which reduces the variance of the gradient estimate since we only have to estimate the expectation of the first term
• Recall that

• E.g. in ADVI where we are using Gaussian Mean-Field approx. which means that the entropy terms reduces to where
• Assumptions
• must be differentiable
• must be differentiable
• Notes
• Usually lower variance than REINFORCE, and can potentially be reduced further if analytical entropy is available
• Being reparametrizable (and this reparametrization being differentiable) limits the family of variational posteriors

Methods

• Automatic Differentiation Variational Inference (ADVI)
• Objective:

• Variational posterior:
• Gaussian Mean-field approx.
• Assumptions:
• is transformable in each component
• and are independent
• Notes:
• If applicable, it just works
• Easy to implement
• Assumes independence
• Restrictive in choice of variational posterior (i.e. Gaussian mean-field)
• Black-box Variational Inference (BBVI)
• Objective

• Using variance reduction techniques (Rao-Blackwellization and control variate), for each ,

where

• Variational posterior
• Assumptions
• Notes
• Does not require differentiability of joint-density
• Gradient estimator usually has high variance, even with variance-reduction techniques

Example implementation

using ForwardDiff
using Flux.Tracker
using Flux.Optimise

"""
ADVI(samplers_per_step = 10, max_iters = 5000)

Automatic Differentiation Variational Inference (ADVI) for a given model.
"""
samples_per_step # number of samples used to estimate the ELBO in each optimization step
max_iters        # maximum number of gradient steps used in optimization
end

# setup
var_info = VarInfo()
model(var_info, SampleFromUniform())
num_params = size(var_info.vals, 1)

dists = var_info.dists
ranges = var_info.ranges

q = MeanField(zeros(num_params), zeros(num_params), dists, ranges)

# construct objective
elbo = ELBO()

θ = optimize(elbo, alg, q, model; optimizer = optimizer)
μ, ω = θ[1:length(q)], θ[length(q) + 1:end]

MeanField(μ, ω, dists, ranges)
end

# TODO: implement optimize like this?
# end

θ = randn(2 * length(q))
optimize!(elbo, alg, q, model, θ; optimizer = optimizer)

return θ
end

alg_name = alg_str(alg)
samples_per_step = alg.samples_per_step
max_iters = alg.max_iters

# number of previous gradients to use to compute s in adaGrad
stepsize_num_prev = 10

# setup
# var_info = Turing.VarInfo()
# model(var_info, Turing.SampleFromUniform())
# num_params = size(var_info.vals, 1)
num_params = length(q)

# # buffer
# θ = zeros(2 * num_params)

# HACK: re-use previous gradient acc if equal in value
# Can cause issues if two entries have idenitical values
if θ ∉ keys(optimizer.acc)
vs = [v for v ∈ keys(optimizer.acc)]
idx = findfirst(w -> vcat(q.μ, q.ω) == w, vs)
if idx != nothing
@info "[$alg_name] Re-using previous optimizer accumulator" θ .= vs[idx] end else @info "[$alg_name] Already present in optimizer acc"
end

# TODO: in (Blei et al, 2015) TRUNCATED ADAGrad is suggested; this is not available in Flux.Optimise

i = 0
prog = PROGRESS[] ? ProgressMeter.Progress(max_iters, 1, "[$alg_name] Optimizing...", 0) : 0 time_elapsed = @elapsed while (i < max_iters) # & converged # <= add criterion? A running mean maybe? # TODO: separate into a grad(...) call; need to manually provide diff_result buffers # ForwardDiff.gradient!(diff_result, f, x) grad!(elbo, alg,q, model, θ, diff_result, samples_per_step) # apply update rule Δ = DiffResults.gradient(diff_result) Δ = Optimise.apply!(optimizer, θ, Δ) @. θ = θ - Δ Turing.DEBUG && @debug "Step$i" Δ DiffResults.value(diff_result) norm(DiffResults.gradient(diff_result))
PROGRESS[] && (ProgressMeter.next!(prog))

i += 1
end

@info time_elapsed

return θ
end

# TODO: this probably slows down executation quite a bit; exists a better way of doing this?
f(θ_) = - vo(alg, q, model, θ_, args...)

chunk_size = getchunksize(alg)
# Set chunk size and do ForwardMode.
chunk = ForwardDiff.Chunk(min(length(θ), chunk_size))
end

# TODO: implement for Tracker
#     vo_tracked, vo_pullback = Tracker.forward()
# end
θ_tracked = Tracker.param(θ)
y = - vo(alg, q, model, θ_tracked, args...)
Tracker.back!(y, 1.0)

DiffResults.value!(out, Tracker.data(y))
end

function (elbo::ELBO)(alg::ADVI, q::MeanField, model::Model, θ::AbstractVector{T}, num_samples) where T <: Real
# setup
var_info = Turing.VarInfo()

# initialize VarInfo object
model(var_info, Turing.SampleFromUniform())

num_params = length(q)
μ, ω = θ[1:num_params], θ[num_params + 1: end]

elbo_acc = 0.0

# TODO: instead use rand(q, num_samples) and iterate through?

for i = 1:num_samples
# iterate through priors, sample and update
for i = 1:size(q.dists, 1)
prior = q.dists[i]
r = q.ranges[i]

# mean-field params for this set of model params
μ_i = μ[r]
ω_i = ω[r]

# obtain samples from mean-field posterior approximation
η = randn(length(μ_i))
ζ = center_diag_gaussian_inv(η, μ_i, exp.(ω_i))

# inverse-transform back to domain of original priro

# update
var_info.vals[r] = θ

# add the log-det-jacobian of inverse transform;
# logabsdet returns (log(abs(det(M))), sign(det(M))) so return first entry
elbo_acc += logabsdet(jac_inv_transform(prior, ζ))[1] / num_samples
end

# compute log density
model(var_info)
elbo_acc += var_info.logp / num_samples
end

# add the term for the entropy of the variational posterior
variational_posterior_entropy = sum(ω)
elbo_acc += variational_posterior_entropy

elbo_acc
end

# extract the mean-field Gaussian params
θ = vcat(q.μ, q.ω)

elbo(alg, q, model, θ, num_samples)
end


Black-box Variational Inference (BBVI)

Controlling the variance

• Variance of gradient estimator under MC estimation of ELBO) can be too large to be useful
Rao-Blackwellization
• Reduces variance of rv. by replacing it with its conditione expectation wrt. a subset of variables
• How

Simple example:

• Two rvs and
• Function
• Goal: compute expectation
• Letting

we note that

• Therefore: can use as MC approx. of , with variance

which means that is a lower variance estimator than .

• In this case

Consider mean-field approximation:

where denotes the parameter(s) of variational posterior of . Then the MC estimator for gradient wrt. is simply

<2019-06-03 Mon> But what the heck is ? Surely , right?

<2019-06-03 Mon> So you missed the bit where denotes the pdf of variables in the model that depend on the i-th variable, i.e. the Markov blanket of . Then is the joint probability that depends on those variables.

Important: model here refers to the variational distribution ! This means that in the case of a Mean-field approximation where the Markov blanket is an empty set, we simply sample jointly.

Control variates
• The idea is to replace the function being approximated by MC with another function which has the same expectation but lower variance, i.e. choose s.t.

and

• One particular example is, for some function ,

• We can then choose to minimize
• Variance of is then

• Therefore, good control variates have high covariance with
• Taking derivative wrt. and setting derivative equal to zero we get the optimal choice of , denoted :

Maybe you recognize this form from somewhere? OH YEAH YOU DO, it's the same expression we have for for the slope the ordinary least squares (OLS) estimator:

which we also know is the minimum variance estimator in that particular case.

We already know , which in the linear case just means that the intercept are the same. If we rearrange:

So it's like we're performing a linear regression in the expectation space, given some function ?

Fun stuff we could do is to let be a parametrized function and then minimize the variance wrt. also, right future-Tor?

• This case

In this particular case, we can choose

This always has expectation zero, which simply follows from

under sufficient restrictions allowing us to "move" the partial outside of the integral (e.g. smoothness wrt. is sufficient).

With the Rao-Blackwellized estimator obtained in previous section, we then have

The estimate of optimal choice of scaling , is then

where

• and denote the empirical estimators
• denotes the d-th component of , i.e. it can be multi-dimensional

Therefore, we end up with the MC gradient estimator (with lower variance than the previous one):

Algorithm

8

In the ranganath13_black_box_variat_infer as where is the t-th value of the Robbins-Monro sequence

Importance-weighted Autoencoder approach

Notation

• Unnormalized importance weights

• Normalized importance weights

Stuff

• burda15_impor_weigh_autoen
• Proposes new objective

• Lower bound on the marginal log-likelihood from the Jensen's Inequality and average importance weights is are unbiased estimator:

where expectations are wrt.

where we are using the reparamtrization trick with being the reparametrization function

• This then has to be empirically approximated, e.g. sampling from and computing the empirical estimate

How to implement

• Could just implement an alternative objective function and take the gradient?

Appendix A: Definitions

variational density
our approximate probability distribution
variational parameter
a parameter required to compute our variational density, i.e. parameters which define the approx. distribution to the latent variables. So sort of like "latent-latent variables", or as I like to call them, doubly latent variables (•_•) / ( •_•)>⌐■-■ / (⌐■_■), Disclaimer: I've never called them that before in my life..