Variational Inference
Table of Contents
Overview
In variational inference, the posterior distribution over a set of unobserved variables given some data is approximated by a variational distribution, :
The distribution is restricted to belong to a family of distributions of simpler form than , selected with the intention of making similar to the true posterior, .
We're basically making live simpler for ourselves by casting the approximate conditional inference as an optimization problem.
The evidence lower bound (ELBO)
- Specify a family of densities over the latent variables
- Each is a candidate to the exact conditional distribution
- Goal is then to find the best candidate, i.e. the one closest in KL divergence to the exact conditionanl distribution
- is the best approx. to the conditional distribution, within that family
However, the equation above requires us to compute the log evidence which may be intractable, OR DOES IT?!
2- Buuut, since we want to minimize wrt. , we can simply minizime the above, without having to worry about !
- By dropping constant term (wrt. ) and moving the sign outside, we get
- Thus, maximizing the is equivalent to minimizing the divergence
Why use the above representation of the ELBO
as our objective function?
Because can be rewritten as ,
thus we simply need to come up with:
- Model for the likelihood given some latent variables
- Some posterior probability for the latent variables
- We can rewrite the to give us a some intuition about the optimal variational density
- Basically says, "the evidence lower bound of is the maximization of the likelihood and minimization of
the divergence from the prior distribution, combined"
- encourages densities that increases the likelihood of the data
- encourages densities close to the prior
- Another property (and the reason for the name), is that it puts a lower bound on the (log) evidence,
- Which means that if we increase the => must decrease
Examples of families
Mean-field variational family
- Assumes the latent variables to be mutually independent, i.e.
- Does not model the observed data, does not appear in the equation => it's the which connects the fitted variational density to the data
Optimization
Coordinate ascent mean-field variational inference (CAVI)
Notation
- is the data
- is the latent variables of our model for
- is the variational density
- is a variational factor
- , i.e. the expectation over all factors of keeping the factor constant
Overview
Based on coordinate ascent, thus no gradient information required.
Iteratively optimizes each factor of the mean-field variational density,
while holding others fixed. Thus, climbing the ELBO
to a local optimum.
Algorithm
- Input: Model , data set
- Output: Variational density
Then the actual algorithm goes as follows:
Psuedo implementation in Python
Which I believe could look something like this when written in Python:
import sys import numpy as np class UpdateableModel: def __init__(self): self.latent_var = np.random.rand(0, 1) def __call__(self, z): # compute q(z) or equivalently q(z_j | z_{l != j}) # since mean-field approx. allows us to assume independence pass @staticmethod def proba_product(qs, z): # compute q_1(z_1) * q_2(z_2) * ... * q_m(z_m) res = 1 for j in range(len(qs)): q_j = qs[j] z_j = z[j] res *= q_j(z_j) return res def compute_elbo(q, p_joint, z, x, all_possible_vals=None): m = len(z) joint_expect = 0 latent_expect = 0 for j in range(m): q_j = q[j] z_j = z[j] joint_expect += q_j(z_j) * np.log(p_joint(z=z, x=x)) latent_expect += q_j(z_j) * np.log(q_j(z_j)) return joint_expect + latent_expect def cavi(model, z, x, epsilon=5): """ Parameters ---------- model : callable Computes our model probability $p(z, x)$ or $p(z_j | z_{l != j}, x)$ z : array-like Initialized values for latent variables, e.g. for a GMM we would have mu = z[0], sigma = z[1]. x : array-like Represents the data. """ m = len(z) q = [UpdateableModel() for j in range(m)] elbo = sys.maxint while elbo > epsilon: for j in range(m): # expectation for all latent variables expect_log = 0 for j2 in range(m): if j2 == j: continue expect_log += q[j2](z[j2]) * np.log(model(fixed=j, z=z, x=x)) q[j].latent_var = np.exp(expect_log) elbo = compute_elbo(q, model, z=z, x=x) return q
Mean-field approx. → assume latent variables are independent of each other → can simply be represented as a vector with the entry being a independent model correspondig to
Derivation of update
We simply rewrite the ELBO
as a function of , since the independence between
the variational factors implies that we can maximize the ELBO
wrt. to each
of those separately.
Were we have written the expectation wrt. using iterated expectation.
Gradient
ELBO
Used in BBVI (REINFORCE)
- Can obtain unbiased gradient estimator by sampling from the variational distribution without having to compute ELBO analytically
- Only requires computation of score function of variational posterior:
Given by
- Unbiased estimator obtained by sampling from
Reparametrization trick
- needs to be reparametrizable
E.g. and
where and are parametrized means and std. dev
Then we instead do the following:
where is some determinstic function and is the "noise" distribution
- Observe that this then also takes advantage of the structure of the joint distribution
- But also requires the join-distribution to be differentiable wrt.
- Often the entropy of can be computed analytically which reduces the variance of the gradient estimate since we only have to estimate the expectation of the first term
Recall that
- E.g. in ADVI where we are using Gaussian Mean-Field approx. which means that the entropy terms reduces to where
- Assumptions
- must be differentiable
- must be differentiable
- Notes
- Usually lower variance than REINFORCE, and can potentially be reduced further if analytical entropy is available
- Not atually guaranteed zhang17_advan_variat_infer
- Being reparametrizable (and this reparametrization being differentiable) limits the family of variational posteriors
- Work towards using Gumbel-Max trick and replacing argmax operation with softmax operator for categorical distributions jang16_categ_repar_with_gumbel_softm
- Usually lower variance than REINFORCE, and can potentially be reduced further if analytical entropy is available
Differentiable lower-bounds of ELBO
- Aims to instead optimize lower-bounds of ELBO using more flexible variational posteriors
- E.g. SIVI yin18_semi_implic_variat_infer
Methods
- Automatic Differentiation Variational Inference (ADVI)
Objective:
Gradient estimate:
- Variational posterior:
- Gaussian Mean-field approx.
- Assumptions:
- is transformable in each component
- and are independent
- Notes:
- If applicable, it just works
- Easy to implement
- Assumes independence
- Restrictive in choice of variational posterior (i.e. Gaussian mean-field)
- Black-box Variational Inference (BBVI)
Objective
Gradient estimate
Using variance reduction techniques (Rao-Blackwellization and control variate), for each ,
where
- Variational posterior
- Assumptions
- Notes
- Does not require differentiability of joint-density
- Gradient estimator usually has high variance, even with variance-reduction techniques
Automatic Differentiation Variational Inference (ADVI)
Example implementation
using ForwardDiff using Flux.Tracker using Flux.Optimise """ ADVI(samplers_per_step = 10, max_iters = 5000) Automatic Differentiation Variational Inference (ADVI) for a given model. """ struct ADVI{AD} <: VariationalInference{AD} samples_per_step # number of samples used to estimate the ELBO in each optimization step max_iters # maximum number of gradient steps used in optimization end ADVI(args...) = ADVI{ADBackend()}(args...) ADVI() = ADVI(10, 5000) alg_str(::ADVI) = "ADVI" vi(model::Model, alg::ADVI; optimizer = ADAGrad()) = begin # setup var_info = VarInfo() model(var_info, SampleFromUniform()) num_params = size(var_info.vals, 1) dists = var_info.dists ranges = var_info.ranges q = MeanField(zeros(num_params), zeros(num_params), dists, ranges) # construct objective elbo = ELBO() Turing.DEBUG && @debug "Optimizing ADVI..." θ = optimize(elbo, alg, q, model; optimizer = optimizer) μ, ω = θ[1:length(q)], θ[length(q) + 1:end] # TODO: make mutable instead? MeanField(μ, ω, dists, ranges) end # TODO: implement optimize like this? # (advi::ADVI)(elbo::EBLO, q::MeanField, model::Model) = begin # end function optimize(elbo::ELBO, alg::ADVI, q::MeanField, model::Model; optimizer = ADAGrad()) θ = randn(2 * length(q)) optimize!(elbo, alg, q, model, θ; optimizer = optimizer) return θ end function optimize!(elbo::ELBO, alg::ADVI{AD}, q::MeanField, model::Model, θ; optimizer = ADAGrad()) where AD alg_name = alg_str(alg) samples_per_step = alg.samples_per_step max_iters = alg.max_iters # number of previous gradients to use to compute `s` in adaGrad stepsize_num_prev = 10 # setup # var_info = Turing.VarInfo() # model(var_info, Turing.SampleFromUniform()) # num_params = size(var_info.vals, 1) num_params = length(q) # # buffer # θ = zeros(2 * num_params) # HACK: re-use previous gradient `acc` if equal in value # Can cause issues if two entries have idenitical values if θ ∉ keys(optimizer.acc) vs = [v for v ∈ keys(optimizer.acc)] idx = findfirst(w -> vcat(q.μ, q.ω) == w, vs) if idx != nothing @info "[$alg_name] Re-using previous optimizer accumulator" θ .= vs[idx] end else @info "[$alg_name] Already present in optimizer acc" end diff_result = DiffResults.GradientResult(θ) # TODO: in (Blei et al, 2015) TRUNCATED ADAGrad is suggested; this is not available in Flux.Optimise # Maybe consider contributed a truncated ADAGrad to Flux.Optimise i = 0 prog = PROGRESS[] ? ProgressMeter.Progress(max_iters, 1, "[$alg_name] Optimizing...", 0) : 0 time_elapsed = @elapsed while (i < max_iters) # & converged # <= add criterion? A running mean maybe? # TODO: separate into a `grad(...)` call; need to manually provide `diff_result` buffers # ForwardDiff.gradient!(diff_result, f, x) grad!(elbo, alg,q, model, θ, diff_result, samples_per_step) # apply update rule Δ = DiffResults.gradient(diff_result) Δ = Optimise.apply!(optimizer, θ, Δ) @. θ = θ - Δ Turing.DEBUG && @debug "Step $i" Δ DiffResults.value(diff_result) norm(DiffResults.gradient(diff_result)) PROGRESS[] && (ProgressMeter.next!(prog)) i += 1 end @info time_elapsed return θ end function grad!(vo::ELBO, alg::ADVI{AD}, q::MeanField, model::Model, θ::AbstractVector{T}, out::DiffResults.MutableDiffResult, args...) where {T <: Real, AD <: ForwardDiffAD} # TODO: this probably slows down executation quite a bit; exists a better way of doing this? f(θ_) = - vo(alg, q, model, θ_, args...) chunk_size = getchunksize(alg) # Set chunk size and do ForwardMode. chunk = ForwardDiff.Chunk(min(length(θ), chunk_size)) config = ForwardDiff.GradientConfig(f, θ, chunk) ForwardDiff.gradient!(out, f, θ, config) end # TODO: implement for `Tracker` # function grad(vo::ELBO, alg::ADVI, q::MeanField, model::Model, f, autodiff::Val{:backward}) # vo_tracked, vo_pullback = Tracker.forward() # end function grad!(vo::ELBO, alg::ADVI{AD}, q::MeanField, model::Model, θ::AbstractVector{T}, out::DiffResults.MutableDiffResult, args...) where {T <: Real, AD <: TrackerAD} θ_tracked = Tracker.param(θ) y = - vo(alg, q, model, θ_tracked, args...) Tracker.back!(y, 1.0) DiffResults.value!(out, Tracker.data(y)) DiffResults.gradient!(out, Tracker.grad(θ_tracked)) end function (elbo::ELBO)(alg::ADVI, q::MeanField, model::Model, θ::AbstractVector{T}, num_samples) where T <: Real # setup var_info = Turing.VarInfo() # initialize `VarInfo` object model(var_info, Turing.SampleFromUniform()) num_params = length(q) μ, ω = θ[1:num_params], θ[num_params + 1: end] elbo_acc = 0.0 # TODO: instead use `rand(q, num_samples)` and iterate through? for i = 1:num_samples # iterate through priors, sample and update for i = 1:size(q.dists, 1) prior = q.dists[i] r = q.ranges[i] # mean-field params for this set of model params μ_i = μ[r] ω_i = ω[r] # obtain samples from mean-field posterior approximation η = randn(length(μ_i)) ζ = center_diag_gaussian_inv(η, μ_i, exp.(ω_i)) # inverse-transform back to domain of original priro θ = invlink(prior, ζ) # update var_info.vals[r] = θ # add the log-det-jacobian of inverse transform; # `logabsdet` returns `(log(abs(det(M))), sign(det(M)))` so return first entry elbo_acc += logabsdet(jac_inv_transform(prior, ζ))[1] / num_samples end # compute log density model(var_info) elbo_acc += var_info.logp / num_samples end # add the term for the entropy of the variational posterior variational_posterior_entropy = sum(ω) elbo_acc += variational_posterior_entropy elbo_acc end function (elbo::ELBO)(alg::ADVI, q::MeanField, model::Model, num_samples) # extract the mean-field Gaussian params θ = vcat(q.μ, q.ω) elbo(alg, q, model, θ, num_samples) end
Black-box Variational Inference (BBVI)
Stuff
Controlling the variance
- Variance of gradient estimator under MC estimation of ELBO) can be too large to be useful
Rao-Blackwellization
- Reduces variance of rv. by replacing it with its conditione expectation wrt. a subset of variables
- How
Simple example:
- Two rvs and
- Function
- Goal: compute expectation
Letting
we note that
Therefore: can use as MC approx. of , with variance
which means that is a lower variance estimator than .
- In this case
Consider mean-field approximation:
where denotes the parameter(s) of variational posterior of . Then the MC estimator for gradient wrt. is simply
Markov blanket of . Then is the joint probability that depends on those variables.
So you missed the bit where denotes the pdf of variables in the model that depend on the i-th variable, i.e. theImportant: model here refers to the variational distribution ! This means that in the case of a Mean-field approximation where the Markov blanket is an empty set, we simply sample jointly.
Control variates
The idea is to replace the function being approximated by MC with another function which has the same expectation but lower variance, i.e. choose s.t.
and
One particular example is, for some function ,
- We can then choose to minimize
Variance of is then
- Therefore, good control variates have high covariance with
Taking derivative wrt. and setting derivative equal to zero we get the optimal choice of , denoted :
Maybe you recognize this form from somewhere? OH YEAH YOU DO, it's the same expression we have for for the slope the ordinary least squares (OLS) estimator:
which we also know is the minimum variance estimator in that particular case.
We already know , which in the linear case just means that the intercept are the same. If we rearrange:
So it's like we're performing a linear regression in the expectation space, given some function ?
Fun stuff we could do is to let be a parametrized function and then minimize the variance wrt. also, right future-Tor?
- This case
In this particular case, we can choose
This always has expectation zero, which simply follows from
under sufficient restrictions allowing us to "move" the partial outside of the integral (e.g. smoothness wrt. is sufficient).
With the Rao-Blackwellized estimator obtained in previous section, we then have
The estimate of optimal choice of scaling , is then
where
- and denote the empirical estimators
- denotes the d-th component of , i.e. it can be multi-dimensional
Therefore, we end up with the MC gradient estimator (with lower variance than the previous one):
Algorithm
In the ranganath13_black_box_variat_infer as where is the t-th value of the Robbins-Monro sequence
Importance-weighted Autoencoder approach
Notation
Unnormalized importance weights
Normalized importance weights
Stuff
- burda15_impor_weigh_autoen
Proposes new objective
Lower bound on the marginal log-likelihood from the Jensen's Inequality and average importance weights is are unbiased estimator:
where expectations are wrt.
Gradient
where we are using the reparamtrization trick with being the reparametrization function
- This then has to be empirically approximated, e.g. sampling from and computing the empirical estimate
How to implement
- Could just implement an alternative objective function and take the gradient?
Appendix A: Definitions
- variational density
- our approximate probability distribution
- variational parameter
- a parameter required to compute our variational density, i.e. parameters which define the approx. distribution to the latent variables. So sort of like "latent-latent variables", or as I like to call them, doubly latent variables (•_•) / ( •_•)>⌐■-■ / (⌐■_■), Disclaimer: I've never called them that before in my life..
Bibliography
Bibliography
- [zhang17_advan_variat_infer] Zhang, Butepage, Kjellstrom, Hedvig & Mandt, Advances in Variational Inference, CoRR, (2017). link.
- [jang16_categ_repar_with_gumbel_softm] Jang, Gu & Poole, Categorical Reparameterization With Gumbel-Softmax, CoRR, (2016). link.
- [yin18_semi_implic_variat_infer] Yin & Zhou, Semi-Implicit Variational Inference, CoRR, (2018). link.
- [kucukelbir16_autom_differ_variat_infer] Kucukelbir, Tran, Ranganath, Rajesh, Gelman & Blei, Automatic Differentiation Variational Inference, CoRR, (2016). link.
- [kucukelbir15_autom_variat_infer_stan] Kucukelbir, Ranganath, Gelman, Andrew & Blei, Automatic Variational Inference in Stan, CoRR, (2015). link.
- [ranganath13_black_box_variat_infer] Ranganath, Gerrish, Blei & , Black Box Variational Inference, CoRR, (2013). link.
- [burda15_impor_weigh_autoen] Burda, Grosse, Salakhutdinov & Ruslan, Importance Weighted Autoencoders, CoRR, (2015). link.