Variational Inference

Table of Contents

Overview

In variational inference, the posterior distribution over a set of unobserved variables variational_inference_f7b3f4f8a0c2f177bcf2dbb15c00a4be1f2b3955.png given some data variational_inference_b21bd8b95668becc72730c55b05e2dc1d75f6b27.png is approximated by a variational distribution, variational_inference_61e02c0f58ea559006f846ee624b87f336786a2b.png :

variational_inference_3a82b80835c4df6a0d4798f00de81e47741dca4f.png

The distribution variational_inference_a18669c8da0612193cbfca0b919fa46543470700.png is restricted to belong to a family of distributions of simpler form than variational_inference_aeadcd9da495d171166b0daa93d0b039d325af52.png, selected with the intention of making variational_inference_a18669c8da0612193cbfca0b919fa46543470700.png similar to the true posterior, variational_inference_aeadcd9da495d171166b0daa93d0b039d325af52.png.

We're basically making live simpler for ourselves by casting the approximate conditional inference as an optimization problem.

The evidence lower bound (ELBO)

  • Specify a family of densities over the latent variables
  • Each variational_inference_a2a36af5c818c3adbe8f56a2a85a63d646a01fbf.png is a candidate to the exact conditional distribution
  • Goal is then to find the best candidate, i.e. the one closest in KL divergence to the exact conditionanl distribution

variational_inference_b508157437d9d981570940d28c0db9594ab0cc7c.png

  • variational_inference_e2d8509fdbc9c132828402cde378d60555a80ec1.png is the best approx. to the conditional distribution, within that family
  • However, the equation above requires us to compute the log evidence variational_inference_2dade119b630e613ded4a21c4fcad100382027ba.png which may be intractable, OR DOES IT?!

    variational_inference_8c54ec1969224bd15d8f8c0613aed650b5462f4e.png

  • Buuut, since we want to minimize wrt. variational_inference_a2a36af5c818c3adbe8f56a2a85a63d646a01fbf.png, we can simply minizime the above, without having to worry about variational_inference_2dade119b630e613ded4a21c4fcad100382027ba.png !
  • By dropping constant term (wrt. variational_inference_9974194a836f8c06efda1c78a55ab184c6c2d746.png) and moving the sign outside, we get

variational_inference_d63963d2ddce9ccc4054b93b4be44de2c5ca01f1.png

  • Thus, maximizing the variational_inference_9c25009abf3fec3aa6252efa11d540352a69e432.png is equivalent to minimizing the variational_inference_067ee973edce4702e251738334a098ed417efd8d.png divergence

Why use the above representation of the ELBO as our objective function? Because variational_inference_abd8fdd47cba587a6f46d64f9b30079ca07ba0f3.png can be rewritten as variational_inference_e33313f90866c999cc1201aa1d87ab68e961e8bc.png, thus we simply need to come up with:

  • Model for the likelihood given some latent variables variational_inference_8a2541015659aa86ee2bb832f7dbb6a3945791aa.png
  • Some posterior probability for the latent variables variational_inference_8a2541015659aa86ee2bb832f7dbb6a3945791aa.png
  • We can rewrite the variational_inference_9c25009abf3fec3aa6252efa11d540352a69e432.png to give us a some intuition about the optimal variational density

variational_inference_a8d49f8284baf745a4c4e2456fd308c3a08aeb88.png

  • Basically says, "the evidence lower bound of is the maximization of the likelihood and minimization of the divergence from the prior distribution, combined"
    • variational_inference_8c3fd2fa5c203c8df53a22ceb2b9aab643ddf8fc.png encourages densities that increases the likelihood of the data
    • variational_inference_70e1c780bc21a512e072474971f079168ec7df90.png encourages densities close to the prior
  • Another property (and the reason for the name), is that it puts a lower bound on the (log) evidence, variational_inference_3a955bd43c5b65af6f776d7e7cb0828d29d40c7b.png

variational_inference_49244a043643b77293ceeaa539aec92e1c4b5c88.png

  • Which means that if we increase the variational_inference_beb64450790a8be2f8a8ccd4b77dafb4bf5a64c8.png => variational_inference_a11b4674e5bf9429c6e437abe1b3b1c5ad7ecb45.png must decrease

Examples of families

Mean-field variational family

  • Assumes the latent variables variational_inference_8a2541015659aa86ee2bb832f7dbb6a3945791aa.png to be mutually independent, i.e.

variational_inference_47c31f88b70ecf2fe92d0923ef0e42637747ce76.png

  • Does not model the observed data, variational_inference_17da0758469180666a34cfa1eb0d65bd169e80ca.png does not appear in the equation => it's the variational_inference_9c25009abf3fec3aa6252efa11d540352a69e432.png which connects the fitted variational density to the data

Optimization

Coordinate ascent mean-field variational inference (CAVI)

Notation

  • variational_inference_17da0758469180666a34cfa1eb0d65bd169e80ca.png is the data
  • variational_inference_8a2541015659aa86ee2bb832f7dbb6a3945791aa.png is the latent variables of our model for variational_inference_d6440881a214454b89daf4c2a7e844e3cab2b8ba.png
  • variational_inference_87bc100f5db5bf1ccafa2581d7696a3f27fb3195.png is the variational density
  • variational_inference_8b34a5842f09b033248a46e3623d35bc77033a09.png is a variational factor
  • variational_inference_df01748388d856b13c3e63271c42b1c625c70d53.png, i.e. the expectation over all factors of variational_inference_9974194a836f8c06efda1c78a55ab184c6c2d746.png keeping the variational_inference_5f61c2bb5716245c8b58a7571ad75eb2c9b6ce65.png factor constant

Overview

Based on coordinate ascent, thus no gradient information required.

Iteratively optimizes each factor of the mean-field variational density, while holding others fixed. Thus, climbing the ELBO to a local optimum.

Algorithm

  • Input: Model variational_inference_fbbb165bcbe7fcb6d9fcc57e68d819311f424304.png, data set variational_inference_17da0758469180666a34cfa1eb0d65bd169e80ca.png
  • Output: Variational density variational_inference_87bc100f5db5bf1ccafa2581d7696a3f27fb3195.png

Then the actual algorithm goes as follows:

cavi_algorithm.PNG

Psuedo implementation in Python

Which I believe could look something like this when written in Python:

import sys
import numpy as np


class UpdateableModel:
    def __init__(self):
        self.latent_var = np.random.rand(0, 1)

    def __call__(self, z):
        # compute q(z) or equivalently q(z_j | z_{l != j})
        # since mean-field approx. allows us to assume independence
        pass

    @staticmethod
    def proba_product(qs, z):
        # compute q_1(z_1) * q_2(z_2) * ... * q_m(z_m)
        res = 1
        for j in range(len(qs)):
            q_j = qs[j]
            z_j = z[j]
            res *= q_j(z_j)
        return res


def compute_elbo(q, p_joint, z, x, all_possible_vals=None):
    m = len(z)
    joint_expect = 0
    latent_expect = 0

    for j in range(m):
        q_j = q[j]
        z_j = z[j]

        joint_expect += q_j(z_j) * np.log(p_joint(z=z, x=x))
        latent_expect += q_j(z_j) * np.log(q_j(z_j))

    return joint_expect + latent_expect


def cavi(model, z, x, epsilon=5):
    """
    Parameters
    ----------
    model : callable
        Computes our model probability $p(z, x)$ or $p(z_j | z_{l != j}, x)$
    z : array-like
        Initialized values for latent variables, e.g. for a GMM we would have
        mu = z[0], sigma = z[1].
    x : array-like
        Represents the data.
    """
    m = len(z)
    q = [UpdateableModel() for j in range(m)]

    elbo = sys.maxint

    while elbo > epsilon:
        for j in range(m):
            # expectation for all latent variables
            expect_log = 0
            for j2 in range(m):
                if j2 == j:
                    continue
                expect_log += q[j2](z[j2]) * np.log(model(fixed=j, z=z, x=x))
            q[j].latent_var = np.exp(expect_log)

        elbo = compute_elbo(q, model, z=z, x=x)

    return q

Mean-field approx. → assume latent variables are independent of each other → variational_inference_a2a36af5c818c3adbe8f56a2a85a63d646a01fbf.png can simply be represented as a vector with the variational_inference_5f61c2bb5716245c8b58a7571ad75eb2c9b6ce65.png entry being a independent model correspondig to variational_inference_8b34a5842f09b033248a46e3623d35bc77033a09.png

Derivation of update

We simply rewrite the ELBO as a function of variational_inference_603b3cfceaf8a537967899b2c0ca88a0c9652c7e.png, since the independence between the variational factors variational_inference_603b3cfceaf8a537967899b2c0ca88a0c9652c7e.png implies that we can maximize the ELBO wrt. to each of those separately.

variational_inference_4b56324cecc82d98c302d63a26e723480784a543.png

Were we have written the expectation variational_inference_77537e6e6b034617f2905b1144dbadacdf976a83.png wrt. using iterated expectation.

Gradient

ELBO

Used in BBVI (REINFORCE)

  • Can obtain unbiased gradient estimator by sampling from the variational distribution variational_inference_72a4e99309ce680595a6354649d5b0085d735564.png without having to compute ELBO analytically
  • Only requires computation of score function of variational posterior: variational_inference_01dc35618ed2bf2a5c855390e61080c81dfd0ee9.png
  • Given by

    variational_inference_ae749fbfbb31addf48b4a258b31b322d62bfb9c8.png

    • Unbiased estimator obtained by sampling from variational_inference_72a4e99309ce680595a6354649d5b0085d735564.png

Reparametrization trick

  • variational_inference_72a4e99309ce680595a6354649d5b0085d735564.png needs to be reparametrizable
    • E.g. variational_inference_26f0b4dc609b4427d92dd77c1f237185bf81453b.png and

      variational_inference_c924bd592f951e491e2516d224ba6c0996018084.png

      where variational_inference_daf4c98ec256e58b010370c12204e88945e82b90.png and variational_inference_8a303c22b5ff8640fb384bd5649f789459e04e8b.png are parametrized means and std. dev

  • Then we instead do the following:

    variational_inference_6c3eb2343c16838181d6f574194bdbdaafbe245b.png

    where variational_inference_c4f480233088a134e88f2426541b2f00ca318b55.png is some determinstic function and variational_inference_9bf5d0788bd50a16a34363cc274a39b783a6d5d6.png is the "noise" distribution

  • Observe that this then also takes advantage of the structure of the joint distribution
    • But also requires the join-distribution to be differentiable wrt. variational_inference_a295ff85aecbb8f91647c26f3666bc3358ec1000.png
  • Often the entropy of variational_inference_72a4e99309ce680595a6354649d5b0085d735564.png can be computed analytically which reduces the variance of the gradient estimate since we only have to estimate the expectation of the first term
    • Recall that

      variational_inference_4bf0149e528cc36c2b23a649014ef0c6b39044b7.png

    • E.g. in ADVI where we are using Gaussian Mean-Field approx. which means that the entropy terms reduces to variational_inference_0dfe84faeb541b0a5be1e9755a8e7067917179ec.png where variational_inference_cda601ed6abcdf3cf82642e9404e04aae78e7e72.png
  • Assumptions
    • variational_inference_06aebb2500c6397b3e0500c9c2fd8720384cb9f7.png must be differentiable
    • variational_inference_ab20ba1962a9eb230e624066148820ba492659f0.png must be differentiable
  • Notes
    • variational_inference_a7e252ffcd9c903278ae557e56cb7be8558baa8a.png Usually lower variance than REINFORCE, and can potentially be reduced further if analytical entropy is available
    • variational_inference_6886a00d9c01084b5381e30f53c1f7d209894d5b.png Being reparametrizable (and this reparametrization being differentiable) limits the family of variational posteriors

Differentiable lower-bounds of ELBO

Methods

  • Automatic Derivation Variational Inference (ADVI)
    • Objective:

      variational_inference_47779b1d741fc3494e903fd20d58d3a6d501fb73.png

    • Gradient estimate:

      variational_inference_79ab01abc05dfd60ee67322734b8cc1a2cf29f37.png

    • Variational posterior:
      • Gaussian Mean-field approx.
    • Assumptions:
      • variational_inference_2c4da5bed51becdf57f225ff9e50ef8ebd691a30.png is transformable in each component
      • variational_inference_f4d02d8b757b78d2991836b88709a29d2d6c38f9.png and variational_inference_ca8c324750c3c72c23b4ef983938d4996b807e6d.png are independent
    • Notes:
      • variational_inference_a7e252ffcd9c903278ae557e56cb7be8558baa8a.png If applicable, it just works
      • variational_inference_a7e252ffcd9c903278ae557e56cb7be8558baa8a.png Easy to implement
      • variational_inference_6886a00d9c01084b5381e30f53c1f7d209894d5b.png Assumes independence
      • variational_inference_6886a00d9c01084b5381e30f53c1f7d209894d5b.png Restrictive in choice of variational posterior variational_inference_34724c66b3e97a635a63f6916f5be50dd45cf282.png (i.e. Gaussian mean-field)
  • Black-box Variational Inference (BBVI)
    • Objective

      variational_inference_7c969b8dcafae1037bf56a744a8ec86316259d0b.png

    • Gradient estimate

      variational_inference_c060336172cf9012e15e757646822c924a70c154.png

      • Using variance reduction techniques (Rao-Blackwellization and control variate), for each variational_inference_ca21b159c150deb1c1d9f0bb5f72aef9e4b607f6.png,

        variational_inference_31e30e60b89ad8d98619ad47a90a58706475c243.png

        where

        variational_inference_ccbab7b260348053a5b9a12f6d7681d1a8606440.png

    • Variational posterior
    • Assumptions
    • Notes
      • variational_inference_a7e252ffcd9c903278ae557e56cb7be8558baa8a.png
      • variational_inference_6886a00d9c01084b5381e30f53c1f7d209894d5b.png Gradient estimator usually has high variance, even with variance-reduction techniques

Automatic Differentiation Variational Inference (ADVI)

Example implementation

using ForwardDiff
using Flux.Tracker
using Flux.Optimise


"""
    ADVI(samplers_per_step = 10, max_iters = 5000)

Automatic Differentiation Variational Inference (ADVI) for a given model.
"""
struct ADVI{AD} <: VariationalInference{AD}
    samples_per_step # number of samples used to estimate the ELBO in each optimization step
    max_iters        # maximum number of gradient steps used in optimization
end

ADVI(args...) = ADVI{ADBackend()}(args...)
ADVI() = ADVI(10, 5000)

alg_str(::ADVI) = "ADVI"

vi(model::Model, alg::ADVI; optimizer = ADAGrad()) = begin
    # setup
    var_info = VarInfo()
    model(var_info, SampleFromUniform())
    num_params = size(var_info.vals, 1)

    dists = var_info.dists
    ranges = var_info.ranges

    q = MeanField(zeros(num_params), zeros(num_params), dists, ranges)

    # construct objective
    elbo = ELBO()

    Turing.DEBUG && @debug "Optimizing ADVI..."
    θ = optimize(elbo, alg, q, model; optimizer = optimizer)
    μ, ω = θ[1:length(q)], θ[length(q) + 1:end]

    # TODO: make mutable instead?
    MeanField(μ, ω, dists, ranges) 
end

# TODO: implement optimize like this?
# (advi::ADVI)(elbo::EBLO, q::MeanField, model::Model) = begin
# end

function optimize(elbo::ELBO, alg::ADVI, q::MeanField, model::Model; optimizer = ADAGrad())
    θ = randn(2 * length(q))
    optimize!(elbo, alg, q, model, θ; optimizer = optimizer)

    return θ
end

function optimize!(elbo::ELBO, alg::ADVI{AD}, q::MeanField, model::Model, θ; optimizer = ADAGrad()) where AD
    alg_name = alg_str(alg)
    samples_per_step = alg.samples_per_step
    max_iters = alg.max_iters

    # number of previous gradients to use to compute `s` in adaGrad
    stepsize_num_prev = 10

    # setup
    # var_info = Turing.VarInfo()
    # model(var_info, Turing.SampleFromUniform())
    # num_params = size(var_info.vals, 1)
    num_params = length(q)

    # # buffer
    # θ = zeros(2 * num_params)

    # HACK: re-use previous gradient `acc` if equal in value
    # Can cause issues if two entries have idenitical values
    if θ ∉ keys(optimizer.acc)
        vs = [v for v ∈ keys(optimizer.acc)]
        idx = findfirst(w -> vcat(q.μ, q.ω) == w, vs)
        if idx != nothing
            @info "[$alg_name] Re-using previous optimizer accumulator"
            θ .= vs[idx]
        end
    else
        @info "[$alg_name] Already present in optimizer acc"
    end

    diff_result = DiffResults.GradientResult(θ)

    # TODO: in (Blei et al, 2015) TRUNCATED ADAGrad is suggested; this is not available in Flux.Optimise
    # Maybe consider contributed a truncated ADAGrad to Flux.Optimise

    i = 0
    prog = PROGRESS[] ? ProgressMeter.Progress(max_iters, 1, "[$alg_name] Optimizing...", 0) : 0

    time_elapsed = @elapsed while (i < max_iters) # & converged # <= add criterion? A running mean maybe?
        # TODO: separate into a `grad(...)` call; need to manually provide `diff_result` buffers
        # ForwardDiff.gradient!(diff_result, f, x)
        grad!(elbo, alg,q, model, θ, diff_result, samples_per_step)

        # apply update rule
        Δ = DiffResults.gradient(diff_result)
        Δ = Optimise.apply!(optimizer, θ, Δ)
        @. θ = θ - Δ

        Turing.DEBUG && @debug "Step $i" Δ DiffResults.value(diff_result) norm(DiffResults.gradient(diff_result))
        PROGRESS[] && (ProgressMeter.next!(prog))

        i += 1
    end

    @info time_elapsed

    return θ
end

function grad!(vo::ELBO, alg::ADVI{AD}, q::MeanField, model::Model, θ::AbstractVector{T}, out::DiffResults.MutableDiffResult, args...) where {T <: Real, AD <: ForwardDiffAD}
    # TODO: this probably slows down executation quite a bit; exists a better way of doing this?
    f(θ_) = - vo(alg, q, model, θ_, args...)

    chunk_size = getchunksize(alg)
    # Set chunk size and do ForwardMode.
    chunk = ForwardDiff.Chunk(min(length(θ), chunk_size))
    config = ForwardDiff.GradientConfig(f, θ, chunk)
    ForwardDiff.gradient!(out, f, θ, config)
end

# TODO: implement for `Tracker`
# function grad(vo::ELBO, alg::ADVI, q::MeanField, model::Model, f, autodiff::Val{:backward})
#     vo_tracked, vo_pullback = Tracker.forward()
# end
function grad!(vo::ELBO, alg::ADVI{AD}, q::MeanField, model::Model, θ::AbstractVector{T}, out::DiffResults.MutableDiffResult, args...) where {T <: Real, AD <: TrackerAD}
    θ_tracked = Tracker.param(θ)
    y = - vo(alg, q, model, θ_tracked, args...)
    Tracker.back!(y, 1.0)

    DiffResults.value!(out, Tracker.data(y))
    DiffResults.gradient!(out, Tracker.grad(θ_tracked))
end

function (elbo::ELBO)(alg::ADVI, q::MeanField, model::Model, θ::AbstractVector{T}, num_samples) where T <: Real
    # setup
    var_info = Turing.VarInfo()

    # initialize `VarInfo` object
    model(var_info, Turing.SampleFromUniform())

    num_params = length(q)
    μ, ω = θ[1:num_params], θ[num_params + 1: end]

    elbo_acc = 0.0

    # TODO: instead use `rand(q, num_samples)` and iterate through?

    for i = 1:num_samples
        # iterate through priors, sample and update
        for i = 1:size(q.dists, 1)
            prior = q.dists[i]
            r = q.ranges[i]

            # mean-field params for this set of model params
            μ_i = μ[r]
            ω_i = ω[r]

            # obtain samples from mean-field posterior approximation
            η = randn(length(μ_i))
            ζ = center_diag_gaussian_inv(η, μ_i, exp.(ω_i))

            # inverse-transform back to domain of original priro
            θ = invlink(prior, ζ)

            # update
            var_info.vals[r] = θ

            # add the log-det-jacobian of inverse transform;
            # `logabsdet` returns `(log(abs(det(M))), sign(det(M)))` so return first entry
            elbo_acc += logabsdet(jac_inv_transform(prior, ζ))[1] / num_samples
        end

        # compute log density
        model(var_info)
        elbo_acc += var_info.logp / num_samples
    end

    # add the term for the entropy of the variational posterior
    variational_posterior_entropy = sum(ω)
    elbo_acc += variational_posterior_entropy

    elbo_acc
end

function (elbo::ELBO)(alg::ADVI, q::MeanField, model::Model, num_samples)
    # extract the mean-field Gaussian params
    θ = vcat(q.μ, q.ω)

    elbo(alg, q, model, θ, num_samples)
end

Black-box Variational Inference (BBVI)

Stuff

variational_inference_ead9dbbc7220016b60db77e5e56cba39d5756188.png

Controlling the variance

  • Variance of gradient estimator under MC estimation of ELBO) can be too large to be useful
Rao-Blackwellization
  • Reduces variance of rv. by replacing it with its conditione expectation wrt. a subset of variables
  • How

    Simple example:

    • Two rvs variational_inference_f40ad5f32532ae52dd17a4315b7711042277a778.png and variational_inference_a4242ffc7d6e6fc3298db18c548b41b6309e0a68.png
    • Function variational_inference_0bc6853473f5d85eee3cea9e652db6a507a50dd3.png
    • Goal: compute expectation variational_inference_511ad53eef401cf08b6081fcb06d5185afd3a6e9.png
    • Letting

      variational_inference_4250e975a14777f75be82eee02dd837a58d174f9.png

      we note that

      variational_inference_c7db69ac10df8ed7e0dc691d2b4b8067fdafa813.png

    • Therefore: can use variational_inference_ceec8053e8d18401f11d5b6d8d79a0c87cf1c456.png as MC approx. of variational_inference_511ad53eef401cf08b6081fcb06d5185afd3a6e9.png, with variance

      variational_inference_d25a15cea4ddfe09082523abb4d2cf466f6100c4.png

      which means that variational_inference_ceec8053e8d18401f11d5b6d8d79a0c87cf1c456.png is a lower variance estimator than variational_inference_0bc6853473f5d85eee3cea9e652db6a507a50dd3.png.

  • In this case

    Consider mean-field approximation:

    variational_inference_68e40183bdecb46b1e53fe946d282f67beb18007.png

    where variational_inference_ca21b159c150deb1c1d9f0bb5f72aef9e4b607f6.png denotes the parameter(s) of variational posterior of variational_inference_06e9ac6ef762027586f7317a2cd8ec83c9146a2f.png. Then the MC estimator for gradient wrt. variational_inference_ca21b159c150deb1c1d9f0bb5f72aef9e4b607f6.png is simply

    variational_inference_874390f1244d40977e0c2cb2d53f2087ab49085e.png

    <2019-06-03 Mon> But what the heck is variational_inference_4a77bca9657725862e67fa3d15043c1dd9654cf1.png? Surely variational_inference_3c902e8535222d8d0615cbf5029be12f4e9d53ed.png, right?

    <2019-06-03 Mon> So you missed the bit where variational_inference_78859c56a9fac5d874aff8fd463bbcfb4549b2a5.png denotes the pdf of variables in the model that depend on the i-th variable, i.e. the Markov blanket of variational_inference_f4d02d8b757b78d2991836b88709a29d2d6c38f9.png. Then variational_inference_ef0b019c8003c516e33dc2273cf661a018c1410d.png is the joint probability that depends on those variables.

    Important: model here refers to the variational distribution variational_inference_9974194a836f8c06efda1c78a55ab184c6c2d746.png! This means that in the case of a Mean-field approximation where the Markov blanket is an empty set, we simply sample variational_inference_01f7b7127bdc73de1b82efda8d5a045166ffb9fa.png jointly.

Control variates
  • The idea is to replace the function variational_inference_f017a9b0f9e8a176a3db97e0116f98ff496ac318.png being approximated by MC with another function variational_inference_1a191be975e51b67f80753aa90f4ac671df23893.png which has the same expectation but lower variance, i.e. choose variational_inference_1a191be975e51b67f80753aa90f4ac671df23893.png s.t.

    variational_inference_de3548b4bf95cf75a523f0aa580f5023d4b5ed12.png

    and

    variational_inference_985f301e57f392f33ede5486203f321ec459fedb.png

  • One particular example is, for some function variational_inference_f017a9b0f9e8a176a3db97e0116f98ff496ac318.png,

    variational_inference_8e54f0332e52ad9c0eee96e0ca9b0e1399c69121.png

    • We can then choose variational_inference_cb254ffe3e498884a0e8a2679f6c3a003f7219a5.png to minimize variational_inference_436a5c305091f58cc283d2a784da4b8ffe6df828.png
    • Variance of variational_inference_1a191be975e51b67f80753aa90f4ac671df23893.png is then

      variational_inference_0fe186b979c215145f4a9d4e185673ecf513a1d6.png

    • Therefore, good control variates have high covariance with variational_inference_f017a9b0f9e8a176a3db97e0116f98ff496ac318.png
    • Taking derivative wrt. variational_inference_cb254ffe3e498884a0e8a2679f6c3a003f7219a5.png and setting derivative equal to zero we get the optimal choice of variational_inference_cb254ffe3e498884a0e8a2679f6c3a003f7219a5.png, denoted variational_inference_855b45667a1f7df3534c1462b0822e1a889cf48d.png:

      variational_inference_416d74fe5cee1078f7018d1a3aa42d4e0d687aaf.png

Maybe you recognize this form from somewhere? OH YEAH YOU DO, it's the same expression we have for for the slope the ordinary least squares (OLS) estimator:

variational_inference_4ee3c7fad0d23ef251192f6237677687d78cc255.png

which we also know is the minimum variance estimator in that particular case.

We already know variational_inference_6542001cb56ce6c884ba6ca4f1bc4ce7305c6b8d.png, which in the linear case just means that the intercept are the same. If we rearrange:

variational_inference_16e61fac0ad0522cb5436ff310a750cbaa793749.png

So it's like we're performing a linear regression in the expectation space, given some function variational_inference_c4207654f80c68e96d536088ad629a9dfa3c6927.png?

Fun stuff we could do is to let variational_inference_c4207654f80c68e96d536088ad629a9dfa3c6927.png be a parametrized function and then minimize the variance wrt. variational_inference_c4207654f80c68e96d536088ad629a9dfa3c6927.png also, right future-Tor?

  • This case

    In this particular case, we can choose

    variational_inference_ab7bfe8600ce4f5d98154a582b88d7c9cc9964e0.png

    This always has expectation zero, which simply follows from

    variational_inference_8e28e725e02d513e754cb67ab28eb5ac7a8ce732.png

    under sufficient restrictions allowing us to "move" the partial outside of the integral (e.g. smoothness wrt. variational_inference_eb13eedc64828a894bed17ed8b010e5fea060eff.png is sufficient).

    With the Rao-Blackwellized estimator obtained in previous section, we then have

    variational_inference_a1b62d0b5f0e8438bb4fe9f03dd5f342c949ecf7.png

    The estimate of optimal choice of scaling variational_inference_cb254ffe3e498884a0e8a2679f6c3a003f7219a5.png, is then

    variational_inference_12af013497c78077cc640689581c23e45bc26e87.png

    where

    • variational_inference_b596492b16aa74b4714fed0da875a56b21a0ec47.png and variational_inference_ea930cdacda4b2bf91e15fcd24e7f75bed0f6508.png denote the empirical estimators
    • variational_inference_9aba623455614cd217e7dea193a693d853100db0.png denotes the d-th component of variational_inference_3935454e6161397ad1f0db18e59ce71ee973a63c.png, i.e. it can be multi-dimensional

    Therefore, we end up with the MC gradient estimator (with lower variance than the previous one):

    variational_inference_c6f7469bb64ea5ff6dbc9232d7e0a23b9301ef8c.png

Algorithm

variational_inference_30c949ca844eaf1ec166f4cdf7838e6151d7dcd7.png

In the ranganath13_black_box_variat_infer as variational_inference_aa7e33f478b3c7ea67d6deee937aff9d5a93efa5.png where variational_inference_ccb973f0cf1a9e40c184ec65f392888f18a6471e.png is the t-th value of the Robbins-Monro sequence

Appendix A: Definitions

variational density
our approximate probability distribution variational_inference_a2a36af5c818c3adbe8f56a2a85a63d646a01fbf.png
variational parameter
a parameter required to compute our variational density, i.e. parameters which define the approx. distribution to the latent variables. So sort of like "latent-latent variables", or as I like to call them, doubly latent variables (•_•) / ( •_•)>⌐■-■ / (⌐■_■), Disclaimer: I've never called them that before in my life..

Bibliography