# Variational Inference

## Table of Contents

## Overview

In **variational inference**, the posterior distribution over a set of *unobserved*
variables given some data
is approximated by a variational distribution, :

The distribution is restricted to belong to a family of
distributions of *simpler* form than ,
selected with the intention of making similar to the true
posterior, .

We're basically making live simpler for ourselves by casting the approximate
conditional inference as an *optimization problem*.

## The evidence lower bound (ELBO)

- Specify a family of densities over the latent variables
- Each is a candidate to the exact conditional distribution
**Goal**is then to find the best candidate, i.e. the one closest in KL divergence to the exact conditionanl distribution

- is the best approx. to the conditional distribution, within that family
However, the equation above requires us to compute the log

*evidence*which may be intractable, OR DOES IT?!2- Buuut, since we want to minimize wrt. , we can simply minizime the above,
*without*having to worry about ! - By dropping constant term (wrt. ) and moving the sign outside, we get

- Thus,
**maximizing**the is equivalent to minimizing the divergence

Why use the above representation of the `ELBO`

as our objective function?
Because can be rewritten as ,
thus we simply need to come up with:

- Model for the likelihood
*given*some latent variables - Some posterior probability for the latent variables

- We can rewrite the to give us a some intuition about the optimal variational density

- Basically says, "the evidence lower bound of is the maximization of the likelihood and minimization of
the divergence from the prior distribution, combined"
- encourages densities that increases the likelihood of the data
- encourages densities close to the prior

- Another property (and the reason for the name), is that it puts a lower bound on the (log) evidence,

- Which means that if we increase the => must decrease

## Examples of families

### Mean-field variational family

- Assumes the latent variables to be mutually independent, i.e.

**Does not model the observed data**, does not appear in the equation => it's the which connects the fitted variational density to the data

## Optimization

### Coordinate ascent mean-field variational inference (CAVI)

#### Notation

- is the data
- is the latent variables of our model for
- is the
*variational density* - is a
*variational factor* - , i.e. the
*expectation over all factors of keeping the factor constant*

#### Overview

Based on coordinate ascent, thus no gradient information required.

Iteratively optimizes each factor of the mean-field variational density,
while holding others fixed. Thus, climbing the `ELBO`

to a local optimum.

#### Algorithm

**Input:**Model , data set**Output:**Variational density

Then the actual algorithm goes as follows:

##### Psuedo implementation in Python

Which I believe could look something like this when written in Python:

import sys import numpy as np class UpdateableModel: def __init__(self): self.latent_var = np.random.rand(0, 1) def __call__(self, z): # compute q(z) or equivalently q(z_j | z_{l != j}) # since mean-field approx. allows us to assume independence pass @staticmethod def proba_product(qs, z): # compute q_1(z_1) * q_2(z_2) * ... * q_m(z_m) res = 1 for j in range(len(qs)): q_j = qs[j] z_j = z[j] res *= q_j(z_j) return res def compute_elbo(q, p_joint, z, x, all_possible_vals=None): m = len(z) joint_expect = 0 latent_expect = 0 for j in range(m): q_j = q[j] z_j = z[j] joint_expect += q_j(z_j) * np.log(p_joint(z=z, x=x)) latent_expect += q_j(z_j) * np.log(q_j(z_j)) return joint_expect + latent_expect def cavi(model, z, x, epsilon=5): """ Parameters ---------- model : callable Computes our model probability $p(z, x)$ or $p(z_j | z_{l != j}, x)$ z : array-like Initialized values for latent variables, e.g. for a GMM we would have mu = z[0], sigma = z[1]. x : array-like Represents the data. """ m = len(z) q = [UpdateableModel() for j in range(m)] elbo = sys.maxint while elbo > epsilon: for j in range(m): # expectation for all latent variables expect_log = 0 for j2 in range(m): if j2 == j: continue expect_log += q[j2](z[j2]) * np.log(model(fixed=j, z=z, x=x)) q[j].latent_var = np.exp(expect_log) elbo = compute_elbo(q, model, z=z, x=x) return q

Mean-field approx. → assume latent variables are independent of each other → can simply be represented as a vector with the entry being a independent model correspondig to

#### Derivation of update

We simply rewrite the `ELBO`

as a function of , since the independence between
the variational factors implies that we can maximize the `ELBO`

wrt. to each
of those separately.

Were we have written the expectation wrt. using iterated expectation.

## Gradient

### ELBO

#### Used in BBVI (REINFORCE)

- Can obtain unbiased gradient estimator by sampling
*from the variational distribution*without having to compute ELBO analytically - Only requires computation of score function of variational posterior:
Given by

- Unbiased estimator obtained by sampling from

#### Reparametrization trick

- needs to be reparametrizable
E.g. and

where and are parametrized means and std. dev

Then we instead do the following:

where is some determinstic function and is the "noise" distribution

- Observe that this then also takes advantage of the structure of the joint distribution
- But also requires the join-distribution to be differentiable wrt.

- Often the entropy of can be computed analytically which reduces the
*variance*of the gradient estimate since we only have to estimate the expectation of the first termRecall that

- E.g. in ADVI where we are using Gaussian Mean-Field approx. which means that the entropy terms reduces to where

- Assumptions
- must be differentiable
- must be differentiable

- Notes
- Usually lower variance than REINFORCE, and can potentially be reduced further if analytical entropy is available
- Not atually guaranteed zhang17_advan_variat_infer

- Being reparametrizable (and this reparametrization being differentiable) limits the family of variational posteriors
- Work towards using Gumbel-Max trick and replacing argmax operation with softmax operator for categorical distributions jang16_categ_repar_with_gumbel_softm

- Usually lower variance than REINFORCE, and can potentially be reduced further if analytical entropy is available

#### Differentiable lower-bounds of ELBO

- Aims to instead optimize lower-bounds of ELBO using more flexible variational posteriors
- E.g. SIVI yin18_semi_implic_variat_infer

## Methods

**Automatic Differentiation Variational Inference (ADVI)**Objective:

Gradient estimate:

- Variational posterior:
- Gaussian Mean-field approx.

- Assumptions:
- is transformable in each component
- and are
*independent*

- Notes:
- If applicable, it
*just works* - Easy to implement
- Assumes independence
- Restrictive in choice of variational posterior (i.e. Gaussian mean-field)

- If applicable, it

**Black-box Variational Inference (BBVI)**Objective

Gradient estimate

Using variance reduction techniques (Rao-Blackwellization and control variate), for each ,

where

- Variational posterior
- Assumptions
- Notes
- Does not require differentiability of joint-density
- Gradient estimator usually has high variance, even with variance-reduction techniques

### Automatic Differentiation Variational Inference (ADVI)

#### Example implementation

using ForwardDiff using Flux.Tracker using Flux.Optimise """ ADVI(samplers_per_step = 10, max_iters = 5000) Automatic Differentiation Variational Inference (ADVI) for a given model. """ struct ADVI{AD} <: VariationalInference{AD} samples_per_step # number of samples used to estimate the ELBO in each optimization step max_iters # maximum number of gradient steps used in optimization end ADVI(args...) = ADVI{ADBackend()}(args...) ADVI() = ADVI(10, 5000) alg_str(::ADVI) = "ADVI" vi(model::Model, alg::ADVI; optimizer = ADAGrad()) = begin # setup var_info = VarInfo() model(var_info, SampleFromUniform()) num_params = size(var_info.vals, 1) dists = var_info.dists ranges = var_info.ranges q = MeanField(zeros(num_params), zeros(num_params), dists, ranges) # construct objective elbo = ELBO() Turing.DEBUG && @debug "Optimizing ADVI..." θ = optimize(elbo, alg, q, model; optimizer = optimizer) μ, ω = θ[1:length(q)], θ[length(q) + 1:end] # TODO: make mutable instead? MeanField(μ, ω, dists, ranges) end # TODO: implement optimize like this? # (advi::ADVI)(elbo::EBLO, q::MeanField, model::Model) = begin # end function optimize(elbo::ELBO, alg::ADVI, q::MeanField, model::Model; optimizer = ADAGrad()) θ = randn(2 * length(q)) optimize!(elbo, alg, q, model, θ; optimizer = optimizer) return θ end function optimize!(elbo::ELBO, alg::ADVI{AD}, q::MeanField, model::Model, θ; optimizer = ADAGrad()) where AD alg_name = alg_str(alg) samples_per_step = alg.samples_per_step max_iters = alg.max_iters # number of previous gradients to use to compute `s` in adaGrad stepsize_num_prev = 10 # setup # var_info = Turing.VarInfo() # model(var_info, Turing.SampleFromUniform()) # num_params = size(var_info.vals, 1) num_params = length(q) # # buffer # θ = zeros(2 * num_params) # HACK: re-use previous gradient `acc` if equal in value # Can cause issues if two entries have idenitical values if θ ∉ keys(optimizer.acc) vs = [v for v ∈ keys(optimizer.acc)] idx = findfirst(w -> vcat(q.μ, q.ω) == w, vs) if idx != nothing @info "[$alg_name] Re-using previous optimizer accumulator" θ .= vs[idx] end else @info "[$alg_name] Already present in optimizer acc" end diff_result = DiffResults.GradientResult(θ) # TODO: in (Blei et al, 2015) TRUNCATED ADAGrad is suggested; this is not available in Flux.Optimise # Maybe consider contributed a truncated ADAGrad to Flux.Optimise i = 0 prog = PROGRESS[] ? ProgressMeter.Progress(max_iters, 1, "[$alg_name] Optimizing...", 0) : 0 time_elapsed = @elapsed while (i < max_iters) # & converged # <= add criterion? A running mean maybe? # TODO: separate into a `grad(...)` call; need to manually provide `diff_result` buffers # ForwardDiff.gradient!(diff_result, f, x) grad!(elbo, alg,q, model, θ, diff_result, samples_per_step) # apply update rule Δ = DiffResults.gradient(diff_result) Δ = Optimise.apply!(optimizer, θ, Δ) @. θ = θ - Δ Turing.DEBUG && @debug "Step $i" Δ DiffResults.value(diff_result) norm(DiffResults.gradient(diff_result)) PROGRESS[] && (ProgressMeter.next!(prog)) i += 1 end @info time_elapsed return θ end function grad!(vo::ELBO, alg::ADVI{AD}, q::MeanField, model::Model, θ::AbstractVector{T}, out::DiffResults.MutableDiffResult, args...) where {T <: Real, AD <: ForwardDiffAD} # TODO: this probably slows down executation quite a bit; exists a better way of doing this? f(θ_) = - vo(alg, q, model, θ_, args...) chunk_size = getchunksize(alg) # Set chunk size and do ForwardMode. chunk = ForwardDiff.Chunk(min(length(θ), chunk_size)) config = ForwardDiff.GradientConfig(f, θ, chunk) ForwardDiff.gradient!(out, f, θ, config) end # TODO: implement for `Tracker` # function grad(vo::ELBO, alg::ADVI, q::MeanField, model::Model, f, autodiff::Val{:backward}) # vo_tracked, vo_pullback = Tracker.forward() # end function grad!(vo::ELBO, alg::ADVI{AD}, q::MeanField, model::Model, θ::AbstractVector{T}, out::DiffResults.MutableDiffResult, args...) where {T <: Real, AD <: TrackerAD} θ_tracked = Tracker.param(θ) y = - vo(alg, q, model, θ_tracked, args...) Tracker.back!(y, 1.0) DiffResults.value!(out, Tracker.data(y)) DiffResults.gradient!(out, Tracker.grad(θ_tracked)) end function (elbo::ELBO)(alg::ADVI, q::MeanField, model::Model, θ::AbstractVector{T}, num_samples) where T <: Real # setup var_info = Turing.VarInfo() # initialize `VarInfo` object model(var_info, Turing.SampleFromUniform()) num_params = length(q) μ, ω = θ[1:num_params], θ[num_params + 1: end] elbo_acc = 0.0 # TODO: instead use `rand(q, num_samples)` and iterate through? for i = 1:num_samples # iterate through priors, sample and update for i = 1:size(q.dists, 1) prior = q.dists[i] r = q.ranges[i] # mean-field params for this set of model params μ_i = μ[r] ω_i = ω[r] # obtain samples from mean-field posterior approximation η = randn(length(μ_i)) ζ = center_diag_gaussian_inv(η, μ_i, exp.(ω_i)) # inverse-transform back to domain of original priro θ = invlink(prior, ζ) # update var_info.vals[r] = θ # add the log-det-jacobian of inverse transform; # `logabsdet` returns `(log(abs(det(M))), sign(det(M)))` so return first entry elbo_acc += logabsdet(jac_inv_transform(prior, ζ))[1] / num_samples end # compute log density model(var_info) elbo_acc += var_info.logp / num_samples end # add the term for the entropy of the variational posterior variational_posterior_entropy = sum(ω) elbo_acc += variational_posterior_entropy elbo_acc end function (elbo::ELBO)(alg::ADVI, q::MeanField, model::Model, num_samples) # extract the mean-field Gaussian params θ = vcat(q.μ, q.ω) elbo(alg, q, model, θ, num_samples) end

### Black-box Variational Inference (BBVI)

#### Stuff

#### Controlling the variance

- Variance of gradient estimator under MC estimation of ELBO) can be too large to be useful

##### Rao-Blackwellization

- Reduces variance of rv. by replacing it with its conditione expectation wrt. a subset of variables

- How

Simple example:

- Two rvs and
- Function
**Goal:**compute expectationLetting

we note that

**Therefore:**can use as MC approx. of , with variancewhich means that is a

*lower*variance estimator than .

- In this case

Consider mean-field approximation:

where denotes the parameter(s) of variational posterior of . Then the MC estimator for gradient wrt. is simply

*depend on the i-th variable*, i.e. the Markov blanket of . Then is the joint probability that depends on those variables.Important: model here refers to the variational distribution ! This means that in the case of a Mean-field approximation where the Markov blanket is an empty set, we simply sample jointly.

##### Control variates

The idea is to replace the function being approximated by MC with another function which has

*the same expectation but lower variance*, i.e. choose s.t.and

One particular example is, for some function ,

- We can then choose to minimize
Variance of is then

- Therefore,
*good control variates have high covariance with* Taking derivative wrt. and setting derivative equal to zero we get the optimal choice of , denoted :

Maybe you recognize this form from somewhere? OH YEAH YOU DO, it's the same expression we have for for the slope the ordinary least squares (OLS) estimator:

which we also know is the minimum variance estimator in that particular case.

We already know , which in the linear case just means that the intercept are the same. If we rearrange:

So it's like we're performing a linear regression in the expectation space, given some function ?

Fun stuff we could do is to let be a parametrized function and then minimize the variance wrt. also, right future-Tor?

- This case

In this particular case, we can choose

This always has expectation zero, which simply follows from

under sufficient restrictions allowing us to "move" the partial outside of the integral (e.g. smoothness wrt. is sufficient).

With the Rao-Blackwellized estimator obtained in previous section, we then have

The estimate of optimal choice of scaling , is then

where

- and denote the empirical estimators
- denotes the d-th component of , i.e. it can be multi-dimensional

Therefore, we end up with the MC gradient estimator (with lower variance than the previous one):

#### Algorithm

In the ranganath13_black_box_variat_infer as where is the t-th value of the Robbins-Monro sequence

### Importance-weighted Autoencoder approach

#### Notation

*Unnormalized*importance weights*Normalized*importance weights

#### Stuff

- burda15_impor_weigh_autoen
Proposes new objective

Lower bound on the marginal log-likelihood from the Jensen's Inequality and average importance weights is are unbiased estimator:

where expectations are wrt.

Gradient

where we are using the reparamtrization trick with being the reparametrization function

- This then has to be empirically approximated, e.g. sampling from and computing the empirical estimate

#### How to implement

- Could just implement an alternative objective function and take the gradient?

## Appendix A: Definitions

- variational density
- our approximate probability distribution
- variational parameter
- a parameter required to compute our
*variational density*, i.e. parameters which define the approx. distribution to the latent variables. So sort of like "latent-latent variables", or as I like to call them,*doubly latent variables*(•_•) / ( •_•)>⌐■-■ / (⌐■_■),**Disclaimer:**I've never called them that before in my life..

## Bibliography

# Bibliography

- [zhang17_advan_variat_infer] Zhang, Butepage, Kjellstrom, Hedvig & Mandt, Advances in Variational Inference,
*CoRR*, (2017). link. - [jang16_categ_repar_with_gumbel_softm] Jang, Gu & Poole, Categorical Reparameterization With Gumbel-Softmax,
*CoRR*, (2016). link. - [yin18_semi_implic_variat_infer] Yin & Zhou, Semi-Implicit Variational Inference,
*CoRR*, (2018). link. - [kucukelbir16_autom_differ_variat_infer] Kucukelbir, Tran, Ranganath, Rajesh, Gelman & Blei, Automatic Differentiation Variational Inference,
*CoRR*, (2016). link. - [kucukelbir15_autom_variat_infer_stan] Kucukelbir, Ranganath, Gelman, Andrew & Blei, Automatic Variational Inference in Stan,
*CoRR*, (2015). link. - [ranganath13_black_box_variat_infer] Ranganath, Gerrish, Blei & , Black Box Variational Inference,
*CoRR*, (2013). link. - [burda15_impor_weigh_autoen] Burda, Grosse, Salakhutdinov & Ruslan, Importance Weighted Autoencoders,
*CoRR*, (2015). link.