Variational Inference

Overview
The evidence lower bound (ELBO)
Examples of families
- Mean-field variational family
Optimization
- Coordinate ascent mean-field variational inference (CAVI)
Gradient
- ELBO
Methods
Appendix A: Definitions
Bibliography

Overview

In variational inference, the posterior distribution over a set of unobserved variables $\mathbf{Z} = \{ Z_{1} \dots Z_{n} \}$ given some data $\mathbf{X}$ is approximated by a variational distribution, $Q({\mathbf{Z}})$ :

$\begin{equation*} P({\mathbf{Z}} \mid {\mathbf{X}}) \approx Q({\mathbf{Z}}) \end{equation*}$

The distribution $Q({\mathbf {Z}})$ is restricted to belong to a family of distributions of simpler form than $P({\mathbf {Z}}\mid {\mathbf {X}})$ , selected with the intention of making $Q({\mathbf {Z}})$ similar to the true posterior, $P({\mathbf {Z}}\mid {\mathbf {X}})$ .

We're basically making live simpler for ourselves by casting the approximate conditional inference as an optimization problem.

The evidence lower bound (ELBO)

Specify a family of densities over the latent variables
Each $q(\mathbf{z})$ is a candidate to the exact conditional distribution
Goal is then to find the best candidate, i.e. the one closest in KL divergence to the exact conditionanl distribution

$\begin{equation*} q^* (\mathbf{z}) = \underset{q(\mathbf{z})}{\text{arg min }} \text{KL} \Big( q(\mathbf{z}) || p(\mathbf{z} || \mathbf{x}) \Big) \end{equation*}$ 1

$q^*(.)$ is the best approx. to the conditional distribution, within that family
However, the equation above requires us to compute the log evidence $\log p(\mathbf{x})$ which may be intractable, OR DOES IT?!

$\begin{equation*} \begin{split} \text{KL} \Big( q(\mathbf{z}) || p(\mathbf{z} | \mathbf{x}) \Big) &= \mathbb{E}_q[\log q(\mathbf{z})] - \mathbb{E}_q[\log p(\mathbf{z}|\mathbf{x})] \\ &= \mathbb{E}_q[\log q(\mathbf{z})] - \mathbb{E}_q[\log p(\mathbf{z}, \mathbf{x})] + \log p(\mathbf{x}) \end{split} \end{equation*}$ 2
Buuut, since we want to minimize wrt. $q(\mathbf{z})$ , we can simply minizime the above, without having to worry about $\log p(\mathbf{x})$ !
By dropping constant term (wrt. $q$ ) and moving the sign outside, we get

$\begin{equation*} \text{ELBO}(q) = \mathbb{E}_q[\log p(\mathbf{z}, \mathbf{x})] - \mathbb{E}_q[\log q(\mathbf{z})] \end{equation*}$ 3

Thus, maximizing the $\text{ELBO}$ is equivalent to minimizing the $\text{KL}$ divergence

Why use the above representation of the ELBO as our objective function? Because $p(\mathbf{z}, \mathbf{x})$ can be rewritten as $p(\mathbf{x} | \mathbf{z}) p(\mathbf{z})$ , thus we simply need to come up with:

Model for the likelihood given some latent variables $\mathbf{z}$
Some posterior probability for the latent variables $\mathbf{z}$

We can rewrite the $\text{ELBO}$ to give us a some intuition about the optimal variational density

$\begin{equation*} \begin{split} \text{ELBO}(q) &= \mathbb{E}_q[\log p(\mathbf{z})] + \mathbb{E}_q[\log p(\mathbf{x} | \mathbf{z})] - \mathbb{E}_q[\log q(\mathbf{z})] \\ &= \mathbb{E}_q[\log p(\mathbf{x} | \mathbf{z})] - \text{KL} \Big( q(\mathbf{z}) || p(\mathbf{z}) \Big) \end{split} \end{equation*}$ 4

Basically says, "the evidence lower bound of is the maximization of the likelihood and minimization of the divergence from the prior distribution, combined"
- $\mathbb{E}_q[\log p(\mathbf{x} | \mathbf{z})]$ encourages densities that increases the likelihood of the data
- $\text{KL} \Big( q(\mathbf{z}) || p(\mathbf{z}) \Big)$ encourages densities close to the prior
Another property (and the reason for the name), is that it puts a lower bound on the (log) evidence, $\log p(\mathbf{x}) \geq \text{ELBO}(q)$

$\begin{equation*} \log p(\mathbf{x}) = \text{KL} \Big( q(\mathbf{z}) || p(\mathbf{z}|\mathbf{x}) \Big) + \text{ELBO}(q), \quad \text{KL}(\cdot) > 0 \end{equation*}$ 5

Which means that if we increase the $\text{EBLO}$ => $\text{KL} \Big( q(\mathbf{z}) || p(\mathbf{z}|\mathbf{x}) \Big)$ must decrease

Examples of families

Mean-field variational family

Assumes the latent variables $\mathbf{z}$ to be mutually independent, i.e.

$\begin{equation*} q(\mathbf{z}) = \prod_{j=1}^m q_j (z_j) \end{equation*}$ 6

Does not model the observed data, $\mathbf{x}$ does not appear in the equation => it's the $\text{ELBO}$ which connects the fitted variational density to the data

Optimization

Coordinate ascent mean-field variational inference (CAVI)

Notation

$\mathbf{x}$ is the data
$\mathbf{z}$ is the latent variables of our model for $p(\mathbf{x}|\mathbf{z})$
$q(\mathbf{z}) = \prod_{j=1}^m q_j(z_j)$ is the variational density
$q_j(z_j)$ is a variational factor
$\mathbb{E}_{-j} = \mathbb{E}_{i \neq j}$ , i.e. the expectation over all factors of $q$ keeping the $j^{\text{th}}$ factor constant

Overview

Based on coordinate ascent, thus no gradient information required.

Iteratively optimizes each factor of the mean-field variational density, while holding others fixed. Thus, climbing the ELBO to a local optimum.

Algorithm

Input: Model $p(\mathbf{x}, \mathbf{z})$ , data set $\mathbf{x}$
Output: Variational density $q(\mathbf{z}) = \prod_{j=1}^m q_j(z_j)$

Then the actual algorithm goes as follows:

Psuedo implementation in Python

Which I believe could look something like this when written in Python:

import sys
import numpy as np


class UpdateableModel:
    def __init__(self):
        self.latent_var = np.random.rand(0, 1)

    def __call__(self, z):
        # compute q(z) or equivalently q(z_j | z_{l != j})
        # since mean-field approx. allows us to assume independence
        pass

    @staticmethod
    def proba_product(qs, z):
        # compute q_1(z_1) * q_2(z_2) * ... * q_m(z_m)
        res = 1
        for j in range(len(qs)):
            q_j = qs[j]
            z_j = z[j]
            res *= q_j(z_j)
        return res


def compute_elbo(q, p_joint, z, x, all_possible_vals=None):
    m = len(z)
    joint_expect = 0
    latent_expect = 0

    for j in range(m):
        q_j = q[j]
        z_j = z[j]

        joint_expect += q_j(z_j) * np.log(p_joint(z=z, x=x))
        latent_expect += q_j(z_j) * np.log(q_j(z_j))

    return joint_expect + latent_expect


def cavi(model, z, x, epsilon=5):
    """
    Parameters
    ----------
    model : callable
        Computes our model probability $p(z, x)$ or $p(z_j | z_{l != j}, x)$
    z : array-like
        Initialized values for latent variables, e.g. for a GMM we would have
        mu = z[0], sigma = z[1].
    x : array-like
        Represents the data.
    """
    m = len(z)
    q = [UpdateableModel() for j in range(m)]

    elbo = sys.maxint

    while elbo > epsilon:
        for j in range(m):
            # expectation for all latent variables
            expect_log = 0
            for j2 in range(m):
                if j2 == j:
                    continue
                expect_log += q[j2](z[j2]) * np.log(model(fixed=j, z=z, x=x))
            q[j].latent_var = np.exp(expect_log)

        elbo = compute_elbo(q, model, z=z, x=x)

    return q

Mean-field approx. → assume latent variables are independent of each other → $q(\mathbf{z})$ can simply be represented as a vector with the $j^{\text{th}}$ entry being a independent model correspondig to $q_j(z_j)$

Derivation of update

We simply rewrite the ELBO as a function of $q_j$ , since the independence between the variational factors $q_j$ implies that we can maximize the ELBO wrt. to each of those separately.

$\begin{equation*} \text{ELBO}(q_j) = \mathbb{E}_j \Big[ \mathbb{E}_{-j} \log p(z_j, \mathbf{z}_{-j}, \mathbf{x}) \Big] - \mathbb{E}_j [q_j ( z_j)] + \text{constant} \end{equation*}$ 7

Were we have written the expectation $\mathbb{E} [\log p(\mathbf{z}, \mathbf{x})]$ wrt. using iterated expectation.

Gradient

ELBO

Used in BBVI (REINFORCE)

Can obtain unbiased gradient estimator by sampling from the variational distribution $q_{\theta}$ without having to compute ELBO analytically
Only requires computation of score function of variational posterior: $\nabla_{\theta} \log q_{\theta}(z)$
Given by

$\begin{equation*} \nabla_{\theta} \mathcal{L} = \mathbb{E}_{q_{\theta}} \big[ \nabla_{\theta} \log q_{\theta}(z) \big( \log p(x, z) - \log q_{\theta}(z) \big) \big] \end{equation*}$
- Unbiased estimator obtained by sampling from $q_{\theta}$

Reparametrization trick

needs to be reparametrizable
- E.g. $\varepsilon \sim \mathcal{N}(0, 1)$ and
  
  $\begin{equation*} g(\varepsilon, \theta) = \mu(\theta) + \varepsilon \sigma(\theta) \end{equation*}$
  
  where $\mu$ and $\sigma$ are parametrized means and std. dev
Then we instead do the following:

$\begin{equation*} \nabla_{\theta} \mathcal{L} = \mathbb{E}_{\varepsilon \sim r(\varepsilon)} \Big[ \nabla_{\theta} \Big( \log p\big(x, g(\varepsilon, \theta) \big) - \log q_{\theta} \big( g(\varepsilon, \theta) \big) \Big) \Big] \end{equation*}$

where $g$ is some determinstic function and $r(\varepsilon)$ is the "noise" distribution
Observe that this then also takes advantage of the structure of the joint distribution
- But also requires the join-distribution to be differentiable wrt. $z := g(\varepsilon, \theta)$
Often the entropy of can be computed analytically which reduces the variance of the gradient estimate since we only have to estimate the expectation of the first term
- Recall that
  
  $\begin{equation*} H(q_{\theta}) = \mathbb{E}_{z \sim q_{\theta}} \big[ \log q_{\theta}(z) \big] = \mathbb{E}_{\varepsilon \sim r(\varepsilon)} \big[ \log q_{\theta} \big( \underbrace{g(\varepsilon, \theta)}_{=: z} \big) \big] \end{equation*}$
- E.g. in ADVI where we are using Gaussian Mean-Field approx. which means that the entropy terms reduces to $\log \sigma = \omega$ where $\sigma = \exp(\omega)$
Assumptions
- $g(\varepsilon, \theta)$ must be differentiable
- $p(x, z)$ must be differentiable
Notes
- Usually lower variance than REINFORCE, and can potentially be reduced further if analytical entropy is available
  - Not atually guaranteed zhang17_advan_variat_infer
- Being reparametrizable (and this reparametrization being differentiable) limits the family of variational posteriors
  - Work towards using Gumbel-Max trick and replacing argmax operation with softmax operator for categorical distributions jang16_categ_repar_with_gumbel_softm

Differentiable lower-bounds of ELBO

Aims to instead optimize lower-bounds of ELBO using more flexible variational posteriors
E.g. SIVI yin18_semi_implic_variat_infer

Methods

Automatic Differentiation Variational Inference (ADVI)
- Objective:
  
  $\begin{equation*} \mathcal{L}(\mu, \omega) = \mathbb{E}_{\eta \sim \mathcal{N}(\mathbf{0}, \mathbf{I})} \bigg[ \log p \Big( \mathbf{X}, \big( T^{-1} \circ S_{\boldsymbol{\mu}, \boldsymbol{\omega}}^{-1} \big) (\eta) \Big) + \log \bigg( \left| \det \big( J(T^{-1} \circ S_{\boldsymbol{\mu}, \boldsymbol{\omega}}^{-1})\big|_{\eta} \big) \right| \bigg) \bigg] + \sum_{k=1}^{K} \omega_k \end{equation*}$
- Gradient estimate:
  
  $\begin{equation*} \nabla_{\mu, \omega} \mathcal{L} \approx \frac{1}{M} \sum_{m=1}^{M} \nabla_{\mu, \omega} \bigg[ \log p \Big( \mathbf{X}, \big( T^{-1} \circ S_{\boldsymbol{\mu}, \boldsymbol{\omega}}^{-1} \big) (\eta_s) \Big) + \log \bigg( \left| \det \big( J(T^{-1} \circ S_{\boldsymbol{\mu}, \boldsymbol{\omega}}^{-1})\big|_{\eta_s} \big) \right| \bigg) + \sum_{k=1}^{K} \omega_k \bigg] \quad \text{where} \quad \eta_s \sim \mathcal{N}(0, I) \end{equation*}$
- Variational posterior:
  - Gaussian Mean-field approx.
- Assumptions:
  - $p(z)$ is transformable in each component
  - $z_i$ and $z_j$ are independent
- Notes:
  - $\textcolor{green}{\checkmark}$ If applicable, it just works
  - $\textcolor{green}{\checkmark}$ Easy to implement
  - $\textcolor{red}{\cross}$ Assumes independence
  - $\textcolor{red}{\cross}$ Restrictive in choice of variational posterior $q_{\theta}(z)$ (i.e. Gaussian mean-field)
Black-box Variational Inference (BBVI)
- Objective
  
  $\begin{equation*} \mathcal{L}(\theta) := \mathbb{E}_{z \sim q_{\theta}(z)} \big[ \log p(x, z) - \log q_{\theta}(z) \big] \end{equation*}$
- Gradient estimate
  
  $\begin{equation*} \nabla_{\theta} \mathcal{L} \approx \frac{1}{m} \sum_{m=1}^{M} \nabla_{\theta} \log q_{\theta} (z_m) \big( \log p(x, z_m) - \log q_{\theta}(z_m) \big) \quad \text{where} \quad z_m \sim q_{\theta}(z) \end{equation*}$
  - Using variance reduction techniques (Rao-Blackwellization and control variate), for each $\theta_i$ ,
    
    $\begin{equation*} \hat{\nabla}_{\theta_i} \mathcal{L} := \frac{1}{M} \sum_{m=1}^{M} \nabla_{\theta_i} \log q_{\theta_i}(z_m) \big( \log p_i(x, z_m) - \log q_{\theta_i}(z_m) - \hat{a}_i^* \big) \quad \text{where} \quad z_m \sim q_{(i)}(z) \end{equation*}$
    
    where
    
    $\begin{equation*} \begin{split} f_i(z) &= \nabla_{\theta_i} \log q_{\theta_i}(z) \big( \log p(x, z) - \log q_{\theta_i}(z) \big) \\ h_i(z) &= \nabla_{\theta_i} \log q_{\theta_i}(z) \\ \hat{a}_i^* &= \frac{\sum_{d = 1}^{n_i} \widehat{\Cov}(f_i^d, h_i^d) }{\sum_{d=1}^{n_i} \widehat{\Var}(h_i^d)} \end{split} \end{equation*}$
- Variational posterior
- Assumptions
- Notes
  - $\textcolor{green}{\checkmark}$ Does not require differentiability of joint-density $p(x, z)$
  - $\textcolor{red}{\cross}$ Gradient estimator usually has high variance, even with variance-reduction techniques

Automatic Differentiation Variational Inference (ADVI)

kucukelbir16_autom_differ_variat_infer
kucukelbir15_autom_variat_infer_stan

Example implementation

using ForwardDiff
using Flux.Tracker
using Flux.Optimise


"""
    ADVI(samplers_per_step = 10, max_iters = 5000)

Automatic Differentiation Variational Inference (ADVI) for a given model.
"""
struct ADVI{AD} <: VariationalInference{AD}
    samples_per_step # number of samples used to estimate the ELBO in each optimization step
    max_iters        # maximum number of gradient steps used in optimization
end

ADVI(args...) = ADVI{ADBackend()}(args...)
ADVI() = ADVI(10, 5000)

alg_str(::ADVI) = "ADVI"

vi(model::Model, alg::ADVI; optimizer = ADAGrad()) = begin
    # setup
    var_info = VarInfo()
    model(var_info, SampleFromUniform())
    num_params = size(var_info.vals, 1)

    dists = var_info.dists
    ranges = var_info.ranges

    q = MeanField(zeros(num_params), zeros(num_params), dists, ranges)

    # construct objective
    elbo = ELBO()

    Turing.DEBUG && @debug "Optimizing ADVI..."
    θ = optimize(elbo, alg, q, model; optimizer = optimizer)
    μ, ω = θ[1:length(q)], θ[length(q) + 1:end]

    # TODO: make mutable instead?
    MeanField(μ, ω, dists, ranges) 
end

# TODO: implement optimize like this?
# (advi::ADVI)(elbo::EBLO, q::MeanField, model::Model) = begin
# end

function optimize(elbo::ELBO, alg::ADVI, q::MeanField, model::Model; optimizer = ADAGrad())
    θ = randn(2 * length(q))
    optimize!(elbo, alg, q, model, θ; optimizer = optimizer)

    return θ
end

function optimize!(elbo::ELBO, alg::ADVI{AD}, q::MeanField, model::Model, θ; optimizer = ADAGrad()) where AD
    alg_name = alg_str(alg)
    samples_per_step = alg.samples_per_step
    max_iters = alg.max_iters

    # number of previous gradients to use to compute `s` in adaGrad
    stepsize_num_prev = 10

    # setup
    # var_info = Turing.VarInfo()
    # model(var_info, Turing.SampleFromUniform())
    # num_params = size(var_info.vals, 1)
    num_params = length(q)

    # # buffer
    # θ = zeros(2 * num_params)

    # HACK: re-use previous gradient `acc` if equal in value
    # Can cause issues if two entries have idenitical values
    if θ ∉ keys(optimizer.acc)
        vs = [v for v ∈ keys(optimizer.acc)]
        idx = findfirst(w -> vcat(q.μ, q.ω) == w, vs)
        if idx != nothing
            @info "[$alg_name] Re-using previous optimizer accumulator"
            θ .= vs[idx]
        end
    else
        @info "[$alg_name] Already present in optimizer acc"
    end

    diff_result = DiffResults.GradientResult(θ)

    # TODO: in (Blei et al, 2015) TRUNCATED ADAGrad is suggested; this is not available in Flux.Optimise
    # Maybe consider contributed a truncated ADAGrad to Flux.Optimise

    i = 0
    prog = PROGRESS[] ? ProgressMeter.Progress(max_iters, 1, "[$alg_name] Optimizing...", 0) : 0

    time_elapsed = @elapsed while (i < max_iters) # & converged # <= add criterion? A running mean maybe?
        # TODO: separate into a `grad(...)` call; need to manually provide `diff_result` buffers
        # ForwardDiff.gradient!(diff_result, f, x)
        grad!(elbo, alg,q, model, θ, diff_result, samples_per_step)

        # apply update rule
        Δ = DiffResults.gradient(diff_result)
        Δ = Optimise.apply!(optimizer, θ, Δ)
        @. θ = θ - Δ

        Turing.DEBUG && @debug "Step $i" Δ DiffResults.value(diff_result) norm(DiffResults.gradient(diff_result))
        PROGRESS[] && (ProgressMeter.next!(prog))

        i += 1
    end

    @info time_elapsed

    return θ
end

function grad!(vo::ELBO, alg::ADVI{AD}, q::MeanField, model::Model, θ::AbstractVector{T}, out::DiffResults.MutableDiffResult, args...) where {T <: Real, AD <: ForwardDiffAD}
    # TODO: this probably slows down executation quite a bit; exists a better way of doing this?
    f(θ_) = - vo(alg, q, model, θ_, args...)

    chunk_size = getchunksize(alg)
    # Set chunk size and do ForwardMode.
    chunk = ForwardDiff.Chunk(min(length(θ), chunk_size))
    config = ForwardDiff.GradientConfig(f, θ, chunk)
    ForwardDiff.gradient!(out, f, θ, config)
end

# TODO: implement for `Tracker`
# function grad(vo::ELBO, alg::ADVI, q::MeanField, model::Model, f, autodiff::Val{:backward})
#     vo_tracked, vo_pullback = Tracker.forward()
# end
function grad!(vo::ELBO, alg::ADVI{AD}, q::MeanField, model::Model, θ::AbstractVector{T}, out::DiffResults.MutableDiffResult, args...) where {T <: Real, AD <: TrackerAD}
    θ_tracked = Tracker.param(θ)
    y = - vo(alg, q, model, θ_tracked, args...)
    Tracker.back!(y, 1.0)

    DiffResults.value!(out, Tracker.data(y))
    DiffResults.gradient!(out, Tracker.grad(θ_tracked))
end

function (elbo::ELBO)(alg::ADVI, q::MeanField, model::Model, θ::AbstractVector{T}, num_samples) where T <: Real
    # setup
    var_info = Turing.VarInfo()

    # initialize `VarInfo` object
    model(var_info, Turing.SampleFromUniform())

    num_params = length(q)
    μ, ω = θ[1:num_params], θ[num_params + 1: end]

    elbo_acc = 0.0

    # TODO: instead use `rand(q, num_samples)` and iterate through?

    for i = 1:num_samples
        # iterate through priors, sample and update
        for i = 1:size(q.dists, 1)
            prior = q.dists[i]
            r = q.ranges[i]

            # mean-field params for this set of model params
            μ_i = μ[r]
            ω_i = ω[r]

            # obtain samples from mean-field posterior approximation
            η = randn(length(μ_i))
            ζ = center_diag_gaussian_inv(η, μ_i, exp.(ω_i))

            # inverse-transform back to domain of original priro
            θ = invlink(prior, ζ)

            # update
            var_info.vals[r] = θ

            # add the log-det-jacobian of inverse transform;
            # `logabsdet` returns `(log(abs(det(M))), sign(det(M)))` so return first entry
            elbo_acc += logabsdet(jac_inv_transform(prior, ζ))[1] / num_samples
        end

        # compute log density
        model(var_info)
        elbo_acc += var_info.logp / num_samples
    end

    # add the term for the entropy of the variational posterior
    variational_posterior_entropy = sum(ω)
    elbo_acc += variational_posterior_entropy

    elbo_acc
end

function (elbo::ELBO)(alg::ADVI, q::MeanField, model::Model, num_samples)
    # extract the mean-field Gaussian params
    θ = vcat(q.μ, q.ω)

    elbo(alg, q, model, θ, num_samples)
end

Black-box Variational Inference (BBVI)

ranganath13_black_box_variat_infer

Stuff

$\begin{equation*} \nabla_{\theta} \mathcal{L} \approx \frac{1}{S} \sum_{s=1}^{S} \nabla_{\theta} \log q_{\theta} (z_s) \big( \log p(x, z_s) - \log q_{\theta}(z_s) \big) \quad \text{where} \quad z_s \sim q_{\theta} \end{equation*}$

Controlling the variance

Variance of gradient estimator under MC estimation of ELBO) can be too large to be useful

Rao-Blackwellization

Reduces variance of rv. by replacing it with its conditione expectation wrt. a subset of variables

How
Simple example:
- Two rvs $X$ and $Y$
- Function $f(X, Y)$
- Goal: compute expectation $\mathbb{E} \big[ f(X, Y) \big]$
- Letting
  
  $\begin{equation*} \hat{f}(X) := \mathbb{E} \big[ f(X, Y) \mid X \big] \end{equation*}$
  
  we note that
  
  $\begin{equation*} \mathbb{E} \big[ \hat{f}(X) \big] = \mathbb{E} \big[ f(X, Y) \big] \end{equation*}$
- Therefore: can use $\hat{f}(X)$ as MC approx. of $\mathbb{E} \big[ f(X, Y) \big]$ , with variance
  
  $\begin{equation*} \Var \big( \hat{f}(X)\big) = \text{Var} \big( f(X, Y) \big) - \mathbb{E} \big[ \big( f(X, Y) - \hat{f}(X) \big)^2 \big] \end{equation*}$
  
  which means that $\hat{f}(X)$ is a lower variance estimator than $f(X, Y)$ .
In this case

Consider mean-field approximation:

$\begin{equation*} q_{\theta} (z) = \prod_{i=1}^{n} q_{\theta_i} \big( z_i \big) \end{equation*}$

where $\theta_i$ denotes the parameter(s) of variational posterior of $Z_i$ . Then the MC estimator for gradient wrt. $\theta_i$ is simply

$\begin{equation*} \frac{1}{S} \sum_{s=1}^{S} \nabla_{\theta_i} \log q_{\theta_i} (z_s) \big( \log p_i(x, z_s) - \log q_{\theta_i}(z_s) \big) \quad \text{where} \quad z_s \sim q_{\theta_i} \end{equation*}$

<2019-06-03 Mon> But what the heck is $p_i$ ? Surely $p_i = p$ , right?

<2019-06-03 Mon> So you missed the bit where $q_{(i)}$ denotes the pdf of variables in the model that depend on the i-th variable, i.e. the Markov blanket of $z_i$ . Then $p_i(x, z_{(i)})$ is the joint probability that depends on those variables.

Important: model here refers to the variational distribution $q$ ! This means that in the case of a Mean-field approximation where the Markov blanket is an empty set, we simply sample $z$ jointly.

Control variates

The idea is to replace the function $f$ being approximated by MC with another function $\hat{f}$ which has the same expectation but lower variance, i.e. choose $\hat{f}$ s.t.

$\begin{equation*} \mathbb{E}[f(X)] = \mathbb{E}[\hat{f}(X)] \end{equation*}$

and

$\begin{equation*} \Var \big( f(X) \big) \ge \Var \big( \hat{f}(X) \big) \end{equation*}$
One particular example is, for some function $f$ ,

$\begin{equation*} \hat{f}(z) := f(z) - a \big( h(z) - \mathbb{E} [h(z)] \big) \end{equation*}$
- We can then choose $a$ to minimize $\Var \big( \hat{f}(Z) \big)$
- Variance of $\hat{f}$ is then
  
  $\begin{equation*} \Var (\hat{f}) = \Var(f) + a^2 \Var(h) - 2a \Cov(f, h) \end{equation*}$
- Therefore, good control variates have high covariance with $f$
- Taking derivative wrt. $a$ and setting derivative equal to zero we get the optimal choice of $a$ , denoted $a^*$ :
  
  $\begin{equation*} a^* := \frac{\Cov \big( f, h \big)}{\Var(h)} \end{equation*}$

Maybe you recognize this form from somewhere? OH YEAH YOU DO, it's the same expression we have for for the slope the ordinary least squares (OLS) estimator:

$\begin{equation*} \hat{\beta} = \frac{\Cov(X, Y)}{\Var(X)} \end{equation*}$

which we also know is the minimum variance estimator in that particular case.

We already know $\mathbb{E} [\hat{f}] = \mathbb{E}[f]$ , which in the linear case just means that the intercept are the same. If we rearrange:

$\begin{equation*} \mathbb{E} [\hat{f}(Z)] + a \big( h(Z) - \mathbb{E}[h(Z)] \big) = \mathbb{E} [f(Z)] \end{equation*}$

So it's like we're performing a linear regression in the expectation space, given some function $h$ ?

Fun stuff we could do is to let $h$ be a parametrized function and then minimize the variance wrt. $h$ also, right future-Tor?

This case
In this particular case, we can choose

$\begin{equation*} h(z) := \nabla_{\theta} \log q_{\theta}(z) \end{equation*}$

This always has expectation zero, which simply follows from

$\begin{equation*} \begin{split} \mathbb{E} \big[ \partial_{\theta_i} \log q_{\theta}(z) \big] &= \int_{}^{} \big( \partial_{\theta_i} \log q_{\theta}(z) \big) q_{\theta}(z) \dd{z} \\ &= \int_{}^{} \frac{1}{q_{\theta}(z)} \pdv{q_{\theta}(z)}{\theta_i} q_{\theta}(z) \dd{z} \\ &= \int_{}^{} \pdv{q_{\theta}(z)}{\theta_i} \dd{z} \\ &= \pdv{}{\theta_i} \int_{}^{} q_{\theta}(z) \dd{z} \\ &= \pdv{}{\theta_i} 1 \\ &= 0 \end{split} \end{equation*}$

under sufficient restrictions allowing us to "move" the partial outside of the integral (e.g. smoothness wrt. $\theta$ is sufficient).

With the Rao-Blackwellized estimator obtained in previous section, we then have

$\begin{equation*} \begin{split} f_i(z) &= \nabla_{\theta_i} \log q_{\theta_i}(z) \big( \log p(x, z) - \log q_{\theta_i}(z) \big) \\ h_i(z) &= \nabla_{\theta_i} \log q_{\theta_i}(z) \end{split} \end{equation*}$

The estimate of optimal choice of scaling $a$ , is then

$\begin{equation*} \hat{a}_i^* = \frac{\sum_{d = 1}^{n_i} \widehat{\Cov}(f_i^d, h_i^d) }{\sum_{d=1}^{n_i} \widehat{\Var}(h_i^d)} \end{equation*}$

where
- $\widehat{\Var}$ and $\widehat{\Cov}$ denote the empirical estimators
- $f_i^d$ denotes the d-th component of $f_i$ , i.e. it can be multi-dimensional
Therefore, we end up with the MC gradient estimator (with lower variance than the previous one):

$\begin{equation*} \hat{\nabla}_{\theta_i} \mathcal{L} := \frac{1}{S} \sum_{s=1}^{S} \nabla_{\theta_i} q_{\theta_i}(z_s) \big( \log p_i(x, z_s) - \log q_{\theta_i}(z_s) - \hat{a}_i^* \big) \quad \text{where} \quad z_s \sim q_{(i)} \end{equation*}$

Algorithm

$\begin{algorithm*}[H] \KwData{data $x$, joint distribution $p$, mean-field variational family $q$} \KwResult{posterior distribution $\hat{p}$} \theta_{1:n} = random() \\ \While{not converged} { \tcp{draw samples from variational distribution} \For{$s = 1, \dots, S$} { $z_s \sim q_{\theta}$ } \For{$i = 1, \dots, n$} { \For{$s = 1, \dots, S$} { $f_i[s] := \nabla_{\theta_i} \log q_{i} (z_s) \big( \log p_i(x, z_s) - \log q_i(z_s) \big)$ \\ $h_i[s] := \nabla_{\theta_i} \log q_i(z_s)$ } \tcp{compute scaling} $\hat{a}_i^* := \frac{\sum_{d=1}^{n_i} \widehat{\Cov}(f_i^d, h_i^d)}{\sum_{d = 1}^{n_i} \widehat{\Var}(h_i^d)}$ \\ \tcp{lower-variance gradient estimator} $\hat{\nabla}_{\theta_i} \mathcal{L} := \frac{1}{S} \sum_{s=1}^{S} f_i[s] - \hat{a}_i^* h_i[s]$ } \tcp{update $\theta$} $\dots$ } \caption{Black-Box Variational Inference (BBVI)} \end{algorithm*}$ 8

In the ranganath13_black_box_variat_infer as $\theta_{t + 1} := \theta_t + \rho \hat{\nabla}_{\theta} \mathcal{L}$ where $\rho$ is the t-th value of the Robbins-Monro sequence

Importance-weighted Autoencoder approach

Notation

Unnormalized importance weights

$\begin{equation*} w_i := \frac{p(x, z_i)}{q(z_i \mid x)} \end{equation*}$
Normalized importance weights

$\begin{equation*} \tilde{w}_i := \frac{w_i}{\sum_{j=1}^{k} w_j} \end{equation*}$

Stuff

burda15_impor_weigh_autoen
Proposes new objective

$\begin{equation*} \mathcal{L}_k(x) = \mathbb{E}_{z_1, \dots, z_k \sim q_{\theta}(z)} \bigg[ \log \bigg( \frac{1}{k} \sum_{i=1}^{k} \frac{p(x, z)}{q_{\theta}(z_i \mid x)} \bigg) \bigg] \end{equation*}$
- Lower bound on the marginal log-likelihood from the Jensen's Inequality and average importance weights is are unbiased estimator:
  
  $\begin{equation*} \mathcal{L}_k = \mathbb{E} \bigg[ \log \bigg( \frac{1}{k} \sum_{i=1}^{k} w_i \bigg) \bigg] \le \log \mathbb{E} \bigg[ \frac{1}{k} \sum_{i=1}^{k} w_i \bigg] = \log p(x) \end{equation*}$
  
  where expectations are wrt. $q(z \mid x)$
Gradient

$\begin{equation*} \begin{split} \nabla_{\theta} \mathcal{L}_k(x) &= \mathbb{E}_{\varepsilon_1, \dots, \varepsilon_k} \bigg[ \sum_{i=1}^{k} \tilde{w}_i \nabla_{\theta} \log \Big( w\big(x, \underbrace{g(x, \varepsilon_i, \theta)}_{=: z_i}, \theta\big) \Big) \bigg] \end{split} \end{equation*}$

where we are using the reparamtrization trick with $g$ being the reparametrization function
- This then has to be empirically approximated, e.g. sampling $z_{i, m}$ from $m = 1, \dots, M$ and computing the empirical estimate

How to implement

Could just implement an alternative objective function and take the gradient?

Appendix A: Definitions

variational density: our approximate probability distribution $q(\mathbf{z})$
variational parameter: a parameter required to compute our variational density, i.e. parameters which define the approx. distribution to the latent variables. So sort of like "latent-latent variables", or as I like to call them, doubly latent variables (•_•) / ( •_•)>⌐■-■ / (⌐■_■), Disclaimer: I've never called them that before in my life..

Bibliography

[zhang17_advan_variat_infer] Zhang, Butepage, Kjellstrom, Hedvig & Mandt, Advances in Variational Inference, CoRR, (2017). link.
[jang16_categ_repar_with_gumbel_softm] Jang, Gu & Poole, Categorical Reparameterization With Gumbel-Softmax, CoRR, (2016). link.
[yin18_semi_implic_variat_infer] Yin & Zhou, Semi-Implicit Variational Inference, CoRR, (2018). link.
[kucukelbir16_autom_differ_variat_infer] Kucukelbir, Tran, Ranganath, Rajesh, Gelman & Blei, Automatic Differentiation Variational Inference, CoRR, (2016). link.
[kucukelbir15_autom_variat_infer_stan] Kucukelbir, Ranganath, Gelman, Andrew & Blei, Automatic Variational Inference in Stan, CoRR, (2015). link.
[ranganath13_black_box_variat_infer] Ranganath, Gerrish, Blei & , Black Box Variational Inference, CoRR, (2013). link.
[burda15_impor_weigh_autoen] Burda, Grosse, Salakhutdinov & Ruslan, Importance Weighted Autoencoders, CoRR, (2015). link.

Variational Inference

Table of Contents

Overview

The evidence lower bound (ELBO)

Examples of families

Mean-field variational family

Optimization

Coordinate ascent mean-field variational inference (CAVI)

Notation

Overview

Algorithm

Psuedo implementation in Python

Derivation of update

Gradient

ELBO

Used in BBVI (REINFORCE)

Reparametrization trick

Differentiable lower-bounds of ELBO

Methods

Automatic Differentiation Variational Inference (ADVI)

Example implementation

Black-box Variational Inference (BBVI)

Stuff

Controlling the variance

Rao-Blackwellization

Control variates

Algorithm

Importance-weighted Autoencoder approach

Notation

Stuff

How to implement

Appendix A: Definitions

Bibliography

Bibliography