Variational Inference

Table of Contents

Overview

In variational inference, the posterior distribution over a set of unobserved variables $\mathbf{Z} = \{ Z_{1} \dots Z_{n} \}$ given some data $\mathbf{X}$ is approximated by a variational distribution, $Q({\mathbf{Z}})$ :

\begin{equation*}
 P({\mathbf{Z}} \mid {\mathbf{X}}) \approx Q({\mathbf{Z}})
\end{equation*}

The distribution $Q({\mathbf  {Z}})$ is restricted to belong to a family of distributions of simpler form than $P({\mathbf  {Z}}\mid {\mathbf  {X}})$, selected with the intention of making $Q({\mathbf  {Z}})$ similar to the true posterior, $P({\mathbf  {Z}}\mid {\mathbf  {X}})$.

We're basically making live simpler for ourselves by casting the approximate conditional inference as an optimization problem.

The evidence lower bound (ELBO)

  • Specify a family of densities over the latent variables
  • Each $q(\mathbf{z})$ is a candidate to the exact conditional distribution
  • Goal is then to find the best candidate, i.e. the one closest in KL divergence to the exact conditionanl distribution
\begin{equation*}
      q^* (\mathbf{z}) = \underset{q(\mathbf{z})}{\text{arg min }} \text{KL} \Big( q(\mathbf{z}) || p(\mathbf{z} || \mathbf{x}) \Big)
\end{equation*}
1
  • $q^*(.)$ is the best approx. to the conditional distribution, within that family
  • However, the equation above requires us to compute the log evidence $\log p(\mathbf{x})$ which may be intractable, OR DOES IT?!

        \begin{equation*}
    \begin{split}
    \text{KL} \Big( q(\mathbf{z}) || p(\mathbf{z} | \mathbf{x}) \Big) &= \mathbb{E}_q[\log q(\mathbf{z})] - \mathbb{E}_q[\log p(\mathbf{z}|\mathbf{x})] \\
    &= \mathbb{E}_q[\log q(\mathbf{z})] - \mathbb{E}_q[\log p(\mathbf{z}, \mathbf{x})] + \log p(\mathbf{x})
    \end{split}
\end{equation*}
2
  • Buuut, since we want to minimize wrt. $q(\mathbf{z})$, we can simply minizime the above, without having to worry about $\log p(\mathbf{x})$ !
  • By dropping constant term (wrt. $q$) and moving the sign outside, we get
\begin{equation*}
\text{ELBO}(q) =  \mathbb{E}_q[\log p(\mathbf{z}, \mathbf{x})] - \mathbb{E}_q[\log q(\mathbf{z})]
\end{equation*}
3
  • Thus, maximizing the $\text{ELBO}$ is equivalent to minimizing the $\text{KL}$ divergence

Why use the above representation of the ELBO as our objective function? Because $p(\mathbf{z}, \mathbf{x})$ can be rewritten as $p(\mathbf{x} | \mathbf{z}) p(\mathbf{z})$, thus we simply need to come up with:

  • Model for the likelihood given some latent variables $\mathbf{z}$
  • Some posterior probability for the latent variables $\mathbf{z}$
  • We can rewrite the $\text{ELBO}$ to give us a some intuition about the optimal variational density
\begin{equation*}
\begin{split}
      \text{ELBO}(q) &= \mathbb{E}_q[\log p(\mathbf{z})] + \mathbb{E}_q[\log p(\mathbf{x} | \mathbf{z})] - \mathbb{E}_q[\log q(\mathbf{z})] \\
      &= \mathbb{E}_q[\log p(\mathbf{x} | \mathbf{z})] - \text{KL} \Big( q(\mathbf{z}) || p(\mathbf{z}) \Big)
\end{split}
\end{equation*}
4
  • Basically says, "the evidence lower bound of is the maximization of the likelihood and minimization of the divergence from the prior distribution, combined"
    • $\mathbb{E}_q[\log p(\mathbf{x} | \mathbf{z})]$ encourages densities that increases the likelihood of the data
    • $\text{KL} \Big( q(\mathbf{z}) || p(\mathbf{z}) \Big)$ encourages densities close to the prior
  • Another property (and the reason for the name), is that it puts a lower bound on the (log) evidence, $\log p(\mathbf{x}) \geq \text{ELBO}(q)$
\begin{equation*}
\log p(\mathbf{x}) = \text{KL} \Big( q(\mathbf{z}) || p(\mathbf{z}|\mathbf{x}) \Big) + \text{ELBO}(q), \quad \text{KL}(\cdot) > 0
\end{equation*}
5
  • Which means that if we increase the $\text{EBLO}$ => $\text{KL} \Big( q(\mathbf{z}) || p(\mathbf{z}|\mathbf{x}) \Big)$ must decrease

Examples of families

Mean-field variational family

  • Assumes the latent variables $\mathbf{z}$ to be mutually independent, i.e.
\begin{equation*}
q(\mathbf{z}) = \prod_{j=1}^m q_j (z_j)
\end{equation*}
6
  • Does not model the observed data, $\mathbf{x}$ does not appear in the equation => it's the $\text{ELBO}$ which connects the fitted variational density to the data

Optimization

Coordinate ascent mean-field variational inference (CAVI)

Notation

  • $\mathbf{x}$ is the data
  • $\mathbf{z}$ is the latent variables of our model for $p(\mathbf{x}|\mathbf{z})$
  • $q(\mathbf{z}) = \prod_{j=1}^m q_j(z_j)$ is the variational density
  • $q_j(z_j)$ is a variational factor
  • $\mathbb{E}_{-j} = \mathbb{E}_{i \neq j}$, i.e. the expectation over all factors of $q$ keeping the $j^{\text{th}}$ factor constant

Overview

Based on coordinate ascent, thus no gradient information required.

Iteratively optimizes each factor of the mean-field variational density, while holding others fixed. Thus, climbing the ELBO to a local optimum.

Algorithm

  • Input: Model $p(\mathbf{x}, \mathbf{z})$, data set $\mathbf{x}$
  • Output: Variational density $q(\mathbf{z}) = \prod_{j=1}^m q_j(z_j)$

Then the actual algorithm goes as follows:

cavi_algorithm.PNG

Psuedo implementation in Python

Which I believe could look something like this when written in Python:

import sys
import numpy as np


class UpdateableModel:
    def __init__(self):
        self.latent_var = np.random.rand(0, 1)

    def __call__(self, z):
        # compute q(z) or equivalently q(z_j | z_{l != j})
        # since mean-field approx. allows us to assume independence
        pass

    @staticmethod
    def proba_product(qs, z):
        # compute q_1(z_1) * q_2(z_2) * ... * q_m(z_m)
        res = 1
        for j in range(len(qs)):
            q_j = qs[j]
            z_j = z[j]
            res *= q_j(z_j)
        return res


def compute_elbo(q, p_joint, z, x, all_possible_vals=None):
    m = len(z)
    joint_expect = 0
    latent_expect = 0

    for j in range(m):
        q_j = q[j]
        z_j = z[j]

        joint_expect += q_j(z_j) * np.log(p_joint(z=z, x=x))
        latent_expect += q_j(z_j) * np.log(q_j(z_j))

    return joint_expect + latent_expect


def cavi(model, z, x, epsilon=5):
    """
    Parameters
    ----------
    model : callable
        Computes our model probability $p(z, x)$ or $p(z_j | z_{l != j}, x)$
    z : array-like
        Initialized values for latent variables, e.g. for a GMM we would have
        mu = z[0], sigma = z[1].
    x : array-like
        Represents the data.
    """
    m = len(z)
    q = [UpdateableModel() for j in range(m)]

    elbo = sys.maxint

    while elbo > epsilon:
        for j in range(m):
            # expectation for all latent variables
            expect_log = 0
            for j2 in range(m):
                if j2 == j:
                    continue
                expect_log += q[j2](z[j2]) * np.log(model(fixed=j, z=z, x=x))
            q[j].latent_var = np.exp(expect_log)

        elbo = compute_elbo(q, model, z=z, x=x)

    return q

Mean-field approx. → assume latent variables are independent of each other → $q(\mathbf{z})$ can simply be represented as a vector with the $j^{\text{th}}$ entry being a independent model correspondig to $q_j(z_j)$

Derivation of update

We simply rewrite the ELBO as a function of $q_j$, since the independence between the variational factors $q_j$ implies that we can maximize the ELBO wrt. to each of those separately.

\begin{equation*}
\text{ELBO}(q_j) = \mathbb{E}_j \Big[ \mathbb{E}_{-j} \log p(z_j, \mathbf{z}_{-j}, \mathbf{x}) \Big] - \mathbb{E}_j [q_j ( z_j)] + \text{constant}
\end{equation*}
7

Were we have written the expectation $\mathbb{E} [\log p(\mathbf{z}, \mathbf{x})]$ wrt. using iterated expectation.

Gradient

ELBO

Used in BBVI (REINFORCE)

  • Can obtain unbiased gradient estimator by sampling from the variational distribution $q_{\theta}$ without having to compute ELBO analytically
  • Only requires computation of score function of variational posterior: $\nabla_{\theta} \log q_{\theta}(z)$
  • Given by

    \begin{equation*}
\nabla_{\theta} \mathcal{L} = \mathbb{E}_{q_{\theta}} \big[ \nabla_{\theta} \log q_{\theta}(z) \big( \log p(x, z) - \log q_{\theta}(z) \big) \big]
\end{equation*}
    • Unbiased estimator obtained by sampling from $q_{\theta}$

Reparametrization trick

  • $q_{\theta}$ needs to be reparametrizable
    • E.g. $\varepsilon \sim \mathcal{N}(0, 1)$ and

      \begin{equation*}
g(\varepsilon, \theta) = \mu(\theta) + \varepsilon \sigma(\theta)
\end{equation*}

      where $\mu$ and $\sigma$ are parametrized means and std. dev

  • Then we instead do the following:

    \begin{equation*}
\nabla_{\theta} \mathcal{L} = \mathbb{E}_{\varepsilon \sim r(\varepsilon)} \Big[ \nabla_{\theta} \Big( \log p\big(x, g(\varepsilon, \theta) \big) - \log q_{\theta} \big( g(\varepsilon, \theta) \big) \Big) \Big]
\end{equation*}

    where $g$ is some determinstic function and $r(\varepsilon)$ is the "noise" distribution

  • Observe that this then also takes advantage of the structure of the joint distribution
    • But also requires the join-distribution to be differentiable wrt. $z := g(\varepsilon, \theta)$
  • Often the entropy of $q_{\theta}$ can be computed analytically which reduces the variance of the gradient estimate since we only have to estimate the expectation of the first term
    • Recall that

      \begin{equation*}
H(q_{\theta}) = \mathbb{E}_{z \sim q_{\theta}} \big[ \log q_{\theta}(z) \big] = \mathbb{E}_{\varepsilon \sim r(\varepsilon)} \big[ \log q_{\theta} \big( \underbrace{g(\varepsilon, \theta)}_{=: z} \big) \big]
\end{equation*}
    • E.g. in ADVI where we are using Gaussian Mean-Field approx. which means that the entropy terms reduces to $\log \sigma = \omega$ where $\sigma = \exp(\omega)$
  • Assumptions
    • $g(\varepsilon, \theta)$ must be differentiable
    • $p(x, z)$ must be differentiable
  • Notes
    • $\textcolor{green}{\checkmark}$ Usually lower variance than REINFORCE, and can potentially be reduced further if analytical entropy is available
    • $\textcolor{red}{\cross}$ Being reparametrizable (and this reparametrization being differentiable) limits the family of variational posteriors

Differentiable lower-bounds of ELBO

Methods

  • Automatic Differentiation Variational Inference (ADVI)
    • Objective:

      \begin{equation*}
\mathcal{L}(\mu, \omega) = \mathbb{E}_{\eta \sim \mathcal{N}(\mathbf{0}, \mathbf{I})} \bigg[ \log p \Big( \mathbf{X}, \big( T^{-1} \circ S_{\boldsymbol{\mu}, \boldsymbol{\omega}}^{-1} \big) (\eta) \Big) + \log \bigg( \left| \det \big( J(T^{-1} \circ S_{\boldsymbol{\mu}, \boldsymbol{\omega}}^{-1})\big|_{\eta} \big) \right| \bigg) \bigg] + \sum_{k=1}^{K} \omega_k
\end{equation*}
    • Gradient estimate:

      \begin{equation*}
\nabla_{\mu, \omega} \mathcal{L} \approx \frac{1}{M} \sum_{m=1}^{M} \nabla_{\mu, \omega} \bigg[ \log p \Big( \mathbf{X}, \big( T^{-1} \circ S_{\boldsymbol{\mu}, \boldsymbol{\omega}}^{-1} \big) (\eta_s) \Big) + \log \bigg( \left| \det \big( J(T^{-1} \circ S_{\boldsymbol{\mu}, \boldsymbol{\omega}}^{-1})\big|_{\eta_s} \big) \right| \bigg) + \sum_{k=1}^{K} \omega_k \bigg] \quad \text{where} \quad \eta_s \sim \mathcal{N}(0, I)
\end{equation*}
    • Variational posterior:
      • Gaussian Mean-field approx.
    • Assumptions:
      • $p(z)$ is transformable in each component
      • $z_i$ and $z_j$ are independent
    • Notes:
      • $\textcolor{green}{\checkmark}$ If applicable, it just works
      • $\textcolor{green}{\checkmark}$ Easy to implement
      • $\textcolor{red}{\cross}$ Assumes independence
      • $\textcolor{red}{\cross}$ Restrictive in choice of variational posterior $q_{\theta}(z)$ (i.e. Gaussian mean-field)
  • Black-box Variational Inference (BBVI)
    • Objective

      \begin{equation*}
\mathcal{L}(\theta) := \mathbb{E}_{z \sim q_{\theta}(z)} \big[ \log p(x, z) - \log q_{\theta}(z) \big]
\end{equation*}
    • Gradient estimate

      \begin{equation*}
\nabla_{\theta} \mathcal{L} \approx \frac{1}{m} \sum_{m=1}^{M} \nabla_{\theta} \log q_{\theta} (z_m) \big( \log p(x, z_m) - \log q_{\theta}(z_m) \big) \quad \text{where} \quad z_m \sim q_{\theta}(z)
\end{equation*}
      • Using variance reduction techniques (Rao-Blackwellization and control variate), for each $\theta_i$,

        \begin{equation*}
\hat{\nabla}_{\theta_i} \mathcal{L} := \frac{1}{M} \sum_{m=1}^{M} \nabla_{\theta_i} \log q_{\theta_i}(z_m) \big( \log p_i(x, z_m) - \log q_{\theta_i}(z_m) - \hat{a}_i^* \big) \quad \text{where} \quad z_m \sim q_{(i)}(z)
\end{equation*}

        where

        \begin{equation*}
\begin{split}
  f_i(z) &= \nabla_{\theta_i} \log q_{\theta_i}(z)  \big( \log p(x, z) - \log q_{\theta_i}(z) \big) \\
  h_i(z) &= \nabla_{\theta_i} \log q_{\theta_i}(z) \\
  \hat{a}_i^* &= \frac{\sum_{d = 1}^{n_i} \widehat{\Cov}(f_i^d, h_i^d) }{\sum_{d=1}^{n_i} \widehat{\Var}(h_i^d)}
\end{split}
\end{equation*}
    • Variational posterior
    • Assumptions
    • Notes
      • $\textcolor{green}{\checkmark}$ Does not require differentiability of joint-density $p(x, z)$
      • $\textcolor{red}{\cross}$ Gradient estimator usually has high variance, even with variance-reduction techniques

Automatic Differentiation Variational Inference (ADVI)

Example implementation

using ForwardDiff
using Flux.Tracker
using Flux.Optimise


"""
    ADVI(samplers_per_step = 10, max_iters = 5000)

Automatic Differentiation Variational Inference (ADVI) for a given model.
"""
struct ADVI{AD} <: VariationalInference{AD}
    samples_per_step # number of samples used to estimate the ELBO in each optimization step
    max_iters        # maximum number of gradient steps used in optimization
end

ADVI(args...) = ADVI{ADBackend()}(args...)
ADVI() = ADVI(10, 5000)

alg_str(::ADVI) = "ADVI"

vi(model::Model, alg::ADVI; optimizer = ADAGrad()) = begin
    # setup
    var_info = VarInfo()
    model(var_info, SampleFromUniform())
    num_params = size(var_info.vals, 1)

    dists = var_info.dists
    ranges = var_info.ranges

    q = MeanField(zeros(num_params), zeros(num_params), dists, ranges)

    # construct objective
    elbo = ELBO()

    Turing.DEBUG && @debug "Optimizing ADVI..."
    θ = optimize(elbo, alg, q, model; optimizer = optimizer)
    μ, ω = θ[1:length(q)], θ[length(q) + 1:end]

    # TODO: make mutable instead?
    MeanField(μ, ω, dists, ranges) 
end

# TODO: implement optimize like this?
# (advi::ADVI)(elbo::EBLO, q::MeanField, model::Model) = begin
# end

function optimize(elbo::ELBO, alg::ADVI, q::MeanField, model::Model; optimizer = ADAGrad())
    θ = randn(2 * length(q))
    optimize!(elbo, alg, q, model, θ; optimizer = optimizer)

    return θ
end

function optimize!(elbo::ELBO, alg::ADVI{AD}, q::MeanField, model::Model, θ; optimizer = ADAGrad()) where AD
    alg_name = alg_str(alg)
    samples_per_step = alg.samples_per_step
    max_iters = alg.max_iters

    # number of previous gradients to use to compute `s` in adaGrad
    stepsize_num_prev = 10

    # setup
    # var_info = Turing.VarInfo()
    # model(var_info, Turing.SampleFromUniform())
    # num_params = size(var_info.vals, 1)
    num_params = length(q)

    # # buffer
    # θ = zeros(2 * num_params)

    # HACK: re-use previous gradient `acc` if equal in value
    # Can cause issues if two entries have idenitical values
    if θ ∉ keys(optimizer.acc)
        vs = [v for v ∈ keys(optimizer.acc)]
        idx = findfirst(w -> vcat(q.μ, q.ω) == w, vs)
        if idx != nothing
            @info "[$alg_name] Re-using previous optimizer accumulator"
            θ .= vs[idx]
        end
    else
        @info "[$alg_name] Already present in optimizer acc"
    end

    diff_result = DiffResults.GradientResult(θ)

    # TODO: in (Blei et al, 2015) TRUNCATED ADAGrad is suggested; this is not available in Flux.Optimise
    # Maybe consider contributed a truncated ADAGrad to Flux.Optimise

    i = 0
    prog = PROGRESS[] ? ProgressMeter.Progress(max_iters, 1, "[$alg_name] Optimizing...", 0) : 0

    time_elapsed = @elapsed while (i < max_iters) # & converged # <= add criterion? A running mean maybe?
        # TODO: separate into a `grad(...)` call; need to manually provide `diff_result` buffers
        # ForwardDiff.gradient!(diff_result, f, x)
        grad!(elbo, alg,q, model, θ, diff_result, samples_per_step)

        # apply update rule
        Δ = DiffResults.gradient(diff_result)
        Δ = Optimise.apply!(optimizer, θ, Δ)
        @. θ = θ - Δ

        Turing.DEBUG && @debug "Step $i" Δ DiffResults.value(diff_result) norm(DiffResults.gradient(diff_result))
        PROGRESS[] && (ProgressMeter.next!(prog))

        i += 1
    end

    @info time_elapsed

    return θ
end

function grad!(vo::ELBO, alg::ADVI{AD}, q::MeanField, model::Model, θ::AbstractVector{T}, out::DiffResults.MutableDiffResult, args...) where {T <: Real, AD <: ForwardDiffAD}
    # TODO: this probably slows down executation quite a bit; exists a better way of doing this?
    f(θ_) = - vo(alg, q, model, θ_, args...)

    chunk_size = getchunksize(alg)
    # Set chunk size and do ForwardMode.
    chunk = ForwardDiff.Chunk(min(length(θ), chunk_size))
    config = ForwardDiff.GradientConfig(f, θ, chunk)
    ForwardDiff.gradient!(out, f, θ, config)
end

# TODO: implement for `Tracker`
# function grad(vo::ELBO, alg::ADVI, q::MeanField, model::Model, f, autodiff::Val{:backward})
#     vo_tracked, vo_pullback = Tracker.forward()
# end
function grad!(vo::ELBO, alg::ADVI{AD}, q::MeanField, model::Model, θ::AbstractVector{T}, out::DiffResults.MutableDiffResult, args...) where {T <: Real, AD <: TrackerAD}
    θ_tracked = Tracker.param(θ)
    y = - vo(alg, q, model, θ_tracked, args...)
    Tracker.back!(y, 1.0)

    DiffResults.value!(out, Tracker.data(y))
    DiffResults.gradient!(out, Tracker.grad(θ_tracked))
end

function (elbo::ELBO)(alg::ADVI, q::MeanField, model::Model, θ::AbstractVector{T}, num_samples) where T <: Real
    # setup
    var_info = Turing.VarInfo()

    # initialize `VarInfo` object
    model(var_info, Turing.SampleFromUniform())

    num_params = length(q)
    μ, ω = θ[1:num_params], θ[num_params + 1: end]

    elbo_acc = 0.0

    # TODO: instead use `rand(q, num_samples)` and iterate through?

    for i = 1:num_samples
        # iterate through priors, sample and update
        for i = 1:size(q.dists, 1)
            prior = q.dists[i]
            r = q.ranges[i]

            # mean-field params for this set of model params
            μ_i = μ[r]
            ω_i = ω[r]

            # obtain samples from mean-field posterior approximation
            η = randn(length(μ_i))
            ζ = center_diag_gaussian_inv(η, μ_i, exp.(ω_i))

            # inverse-transform back to domain of original priro
            θ = invlink(prior, ζ)

            # update
            var_info.vals[r] = θ

            # add the log-det-jacobian of inverse transform;
            # `logabsdet` returns `(log(abs(det(M))), sign(det(M)))` so return first entry
            elbo_acc += logabsdet(jac_inv_transform(prior, ζ))[1] / num_samples
        end

        # compute log density
        model(var_info)
        elbo_acc += var_info.logp / num_samples
    end

    # add the term for the entropy of the variational posterior
    variational_posterior_entropy = sum(ω)
    elbo_acc += variational_posterior_entropy

    elbo_acc
end

function (elbo::ELBO)(alg::ADVI, q::MeanField, model::Model, num_samples)
    # extract the mean-field Gaussian params
    θ = vcat(q.μ, q.ω)

    elbo(alg, q, model, θ, num_samples)
end

Black-box Variational Inference (BBVI)

Stuff

\begin{equation*}
\nabla_{\theta} \mathcal{L} \approx \frac{1}{S} \sum_{s=1}^{S} \nabla_{\theta} \log q_{\theta} (z_s) \big( \log p(x, z_s) - \log q_{\theta}(z_s) \big) \quad \text{where} \quad z_s \sim q_{\theta}
\end{equation*}

Controlling the variance

  • Variance of gradient estimator under MC estimation of ELBO) can be too large to be useful
Rao-Blackwellization
  • Reduces variance of rv. by replacing it with its conditione expectation wrt. a subset of variables
  • How

    Simple example:

    • Two rvs $X$ and $Y$
    • Function $f(X, Y)$
    • Goal: compute expectation $\mathbb{E} \big[ f(X, Y) \big]$
    • Letting

      \begin{equation*}
\hat{f}(X) := \mathbb{E} \big[ f(X, Y) \mid X \big]
\end{equation*}

      we note that

      \begin{equation*}
\mathbb{E} \big[ \hat{f}(X) \big] = \mathbb{E} \big[ f(X, Y) \big]
\end{equation*}
    • Therefore: can use $\hat{f}(X)$ as MC approx. of $\mathbb{E} \big[ f(X, Y) \big]$, with variance

      \begin{equation*}
\Var \big( \hat{f}(X)\big) = \text{Var} \big( f(X, Y) \big) - \mathbb{E} \big[ \big( f(X, Y) - \hat{f}(X) \big)^2 \big]
\end{equation*}

      which means that $\hat{f}(X)$ is a lower variance estimator than $f(X, Y)$.

  • In this case

    Consider mean-field approximation:

    \begin{equation*}
q_{\theta} (z) = \prod_{i=1}^{n} q_{\theta_i} \big( z_i \big)
\end{equation*}

    where $\theta_i$ denotes the parameter(s) of variational posterior of $Z_i$. Then the MC estimator for gradient wrt. $\theta_i$ is simply

    \begin{equation*}
\frac{1}{S} \sum_{s=1}^{S} \nabla_{\theta_i} \log q_{\theta_i} (z_s)  \big( \log p_i(x, z_s) - \log q_{\theta_i}(z_s) \big) \quad \text{where} \quad z_s \sim q_{\theta_i}
\end{equation*}

    <2019-06-03 Mon> But what the heck is $p_i$? Surely $p_i = p$, right?

    <2019-06-03 Mon> So you missed the bit where $q_{(i)}$ denotes the pdf of variables in the model that depend on the i-th variable, i.e. the Markov blanket of $z_i$. Then $p_i(x, z_{(i)})$ is the joint probability that depends on those variables.

    Important: model here refers to the variational distribution $q$! This means that in the case of a Mean-field approximation where the Markov blanket is an empty set, we simply sample $z$ jointly.

Control variates
  • The idea is to replace the function $f$ being approximated by MC with another function $\hat{f}$ which has the same expectation but lower variance, i.e. choose $\hat{f}$ s.t.

    \begin{equation*}
\mathbb{E}[f(X)] = \mathbb{E}[\hat{f}(X)]
\end{equation*}

    and

    \begin{equation*}
\Var \big( f(X) \big) \ge \Var \big( \hat{f}(X) \big)
\end{equation*}
  • One particular example is, for some function $f$,

    \begin{equation*}
\hat{f}(z) := f(z) - a \big( h(z) - \mathbb{E} [h(z)] \big)
\end{equation*}
    • We can then choose $a$ to minimize $\Var \big( \hat{f}(Z) \big)$
    • Variance of $\hat{f}$ is then

      \begin{equation*}
\Var (\hat{f}) = \Var(f) + a^2 \Var(h) - 2a \Cov(f, h)
\end{equation*}
    • Therefore, good control variates have high covariance with $f$
    • Taking derivative wrt. $a$ and setting derivative equal to zero we get the optimal choice of $a$, denoted $a^*$:

      \begin{equation*}
a^* := \frac{\Cov \big( f, h \big)}{\Var(h)}
\end{equation*}

Maybe you recognize this form from somewhere? OH YEAH YOU DO, it's the same expression we have for for the slope the ordinary least squares (OLS) estimator:

\begin{equation*}
\hat{\beta} = \frac{\Cov(X, Y)}{\Var(X)}
\end{equation*}

which we also know is the minimum variance estimator in that particular case.

We already know $\mathbb{E} [\hat{f}] = \mathbb{E}[f]$, which in the linear case just means that the intercept are the same. If we rearrange:

\begin{equation*}
\mathbb{E} [\hat{f}(Z)] + a \big( h(Z) - \mathbb{E}[h(Z)] \big) = \mathbb{E} [f(Z)]
\end{equation*}

So it's like we're performing a linear regression in the expectation space, given some function $h$?

Fun stuff we could do is to let $h$ be a parametrized function and then minimize the variance wrt. $h$ also, right future-Tor?

  • This case

    In this particular case, we can choose

    \begin{equation*}
h(z) := \nabla_{\theta} \log q_{\theta}(z)
\end{equation*}

    This always has expectation zero, which simply follows from

    \begin{equation*}
\begin{split}
  \mathbb{E} \big[ \partial_{\theta_i} \log q_{\theta}(z) \big] &amp;= \int_{}^{} \big( \partial_{\theta_i} \log q_{\theta}(z) \big) q_{\theta}(z) \dd{z} \\
  &amp;= \int_{}^{} \frac{1}{q_{\theta}(z)} \pdv{q_{\theta}(z)}{\theta_i} q_{\theta}(z) \dd{z} \\
  &amp;= \int_{}^{} \pdv{q_{\theta}(z)}{\theta_i} \dd{z} \\
  &amp;= \pdv{}{\theta_i} \int_{}^{} q_{\theta}(z) \dd{z} \\
  &amp;= \pdv{}{\theta_i} 1 \\
  &amp;= 0
\end{split}
\end{equation*}

    under sufficient restrictions allowing us to "move" the partial outside of the integral (e.g. smoothness wrt. $\theta$ is sufficient).

    With the Rao-Blackwellized estimator obtained in previous section, we then have

    \begin{equation*}
\begin{split}
  f_i(z) &amp;= \nabla_{\theta_i} \log q_{\theta_i}(z)  \big( \log p(x, z) - \log q_{\theta_i}(z) \big) \\
  h_i(z) &amp;= \nabla_{\theta_i} \log q_{\theta_i}(z)
\end{split}
\end{equation*}

    The estimate of optimal choice of scaling $a$, is then

    \begin{equation*}
\hat{a}_i^* = \frac{\sum_{d = 1}^{n_i} \widehat{\Cov}(f_i^d, h_i^d) }{\sum_{d=1}^{n_i} \widehat{\Var}(h_i^d)}
\end{equation*}

    where

    • $\widehat{\Var}$ and $\widehat{\Cov}$ denote the empirical estimators
    • $f_i^d$ denotes the d-th component of $f_i$, i.e. it can be multi-dimensional

    Therefore, we end up with the MC gradient estimator (with lower variance than the previous one):

    \begin{equation*}
\hat{\nabla}_{\theta_i} \mathcal{L} := \frac{1}{S} \sum_{s=1}^{S} \nabla_{\theta_i} q_{\theta_i}(z_s) \big( \log p_i(x, z_s) - \log q_{\theta_i}(z_s) - \hat{a}_i^* \big) \quad \text{where} \quad z_s \sim q_{(i)}
\end{equation*}

Algorithm

\begin{algorithm*}[H]
  \KwData{data $x$, joint distribution $p$, mean-field variational family $q$}
  \KwResult{posterior distribution $\hat{p}$}
  \theta_{1:n} = random() \\
  \While{not converged} {
    \tcp{draw samples from variational distribution}
    \For{$s = 1, \dots, S$} {
      $z_s \sim q_{\theta}$
    }

    \For{$i = 1, \dots, n$} {
      \For{$s = 1, \dots, S$} {
        $f_i[s] := \nabla_{\theta_i} \log q_{i} (z_s) \big( \log p_i(x, z_s) - \log q_i(z_s) \big)$ \\
        $h_i[s] := \nabla_{\theta_i} \log q_i(z_s)$
      }

      \tcp{compute scaling}
      $\hat{a}_i^* := \frac{\sum_{d=1}^{n_i} \widehat{\Cov}(f_i^d, h_i^d)}{\sum_{d = 1}^{n_i} \widehat{\Var}(h_i^d)}$ \\

      \tcp{lower-variance gradient estimator}
      $\hat{\nabla}_{\theta_i} \mathcal{L} := \frac{1}{S} \sum_{s=1}^{S} f_i[s] - \hat{a}_i^* h_i[s]$
    }
    \tcp{update $\theta$}
    $\dots$
  }
  \caption{Black-Box Variational Inference (BBVI)}
\end{algorithm*}
8

In the ranganath13_black_box_variat_infer as $\theta_{t + 1} := \theta_t + \rho \hat{\nabla}_{\theta} \mathcal{L}$ where $\rho$ is the t-th value of the Robbins-Monro sequence

Importance-weighted Autoencoder approach

Notation

  • Unnormalized importance weights

    \begin{equation*}
w_i := \frac{p(x, z_i)}{q(z_i \mid x)}
\end{equation*}
  • Normalized importance weights

    \begin{equation*}
\tilde{w}_i := \frac{w_i}{\sum_{j=1}^{k} w_j}
\end{equation*}

Stuff

  • burda15_impor_weigh_autoen
  • Proposes new objective

    \begin{equation*}
\mathcal{L}_k(x) = \mathbb{E}_{z_1, \dots, z_k \sim q_{\theta}(z)} \bigg[ \log \bigg( \frac{1}{k} \sum_{i=1}^{k} \frac{p(x, z)}{q_{\theta}(z_i \mid x)} \bigg) \bigg]
\end{equation*}
    • Lower bound on the marginal log-likelihood from the Jensen's Inequality and average importance weights is are unbiased estimator:

      \begin{equation*}
\mathcal{L}_k = \mathbb{E} \bigg[ \log \bigg( \frac{1}{k} \sum_{i=1}^{k} w_i \bigg) \bigg] \le \log \mathbb{E} \bigg[ \frac{1}{k} \sum_{i=1}^{k} w_i \bigg] = \log p(x)
\end{equation*}

      where expectations are wrt. $q(z \mid x)$

  • Gradient

    \begin{equation*}
\begin{split}
  \nabla_{\theta} \mathcal{L}_k(x) &amp;= \mathbb{E}_{\varepsilon_1, \dots, \varepsilon_k} \bigg[ \sum_{i=1}^{k} \tilde{w}_i \nabla_{\theta} \log \Big( w\big(x, \underbrace{g(x, \varepsilon_i, \theta)}_{=: z_i}, \theta\big) \Big) \bigg]
\end{split}
\end{equation*}

    where we are using the reparamtrization trick with $g$ being the reparametrization function

    • This then has to be empirically approximated, e.g. sampling $z_{i, m}$ from $m = 1, \dots, M$ and computing the empirical estimate

How to implement

  • Could just implement an alternative objective function and take the gradient?

Appendix A: Definitions

variational density
our approximate probability distribution $q(\mathbf{z})$
variational parameter
a parameter required to compute our variational density, i.e. parameters which define the approx. distribution to the latent variables. So sort of like "latent-latent variables", or as I like to call them, doubly latent variables (•_•) / ( •_•)>⌐■-■ / (⌐■_■), Disclaimer: I've never called them that before in my life..

Bibliography

Bibliography