Variational Optimization & Bayesian Inference

5/20/2025

1. The Geometry of Integration

Bayesian Inference is fundamentally about integration.

p(x) = \int p(x|z) p(z) dz

This marginal likelihood (Evidence) is the holy grail. It allows for model selection and explains generalization. However, in high dimensions, the volume of the typical set shrinks exponentially. MCMC methods (sampling) struggle to traverse this void. Variational Inference (VI) proposes a radical shift: Don’t sample. Optimize. We replace the integration problem with a functional optimization problem over a family of distributions $\mathcal{Q}$ .

q^* = \text{argmin}_{q \in \mathcal{Q}} D(q \| p(\cdot | x))

This transforms inference into a problem typically solvable by Stochastic Gradient Descent.

2. Functional Calculus and the ELBO

Before the algebraic derivation, we examine the functional analysis. We operate on the space of probability density functions. Consider the Kullback-Leibler Divergence functional:

\mathbb{J}[q] = \int q(z) \log \frac{q(z)}{p(z|x)} dz

To minimize this, we take the Functional Derivative (Variational derivative) $\frac{\delta \mathbb{J}}{\delta q}$ . Using the Euler-Lagrange framework, let the integrand be $L(z, q, q') = q \log q - q \log p$ .

\frac{\delta L}{\delta q} = (1 + \log q) - \log p

We impose the constraint $\int q(z) dz = 1$ using a Lagrange multiplier $\lambda$ .

\frac{\delta}{\delta q} ( \mathbb{J}[q] + \lambda (\int q - 1) ) = 1 + \log q(z) - \log p(z|x) + \lambda = 0

Solving for $q(z)$ :

\log q(z) = \log p(z|x) - (1 + \lambda)

q^*(z) \propto p(z|x)

This confirms that the unconstrained functional minimum is indeed the posterior. Since we cannot evaluate $p(z|x)$ , we restrict $q$ to a tractible parametric family $\mathcal{Q} = \{q_\phi\}$ .

The Evidence Lower Bound (Algebraic View):

\log p(x) = \log \int p(x, z) dz = \log \mathbb{E}_q \left[ \frac{p(x, z)}{q(z)} \right]

By Jensen’s Inequality (since $\log$ is concave):

\log p(x) \ge \mathbb{E}_q \left[ \log \frac{p(x, z)}{q(z)} \right] = \text{ELBO}(\phi)

\text{ELBO} = \mathbb{E}_q [\log p(x, z)] - \mathbb{E}_q [\log q(z)]

= \text{Reconstruction} - \text{Entropy}

Maximizing the ELBO forces $q$ to put mass where $p(x, z)$ is high (Reconstruction), but also to stay spread out (Entropy). The gap is exactly the KL divergence: $\log p(x) - \text{ELBO} = D_{KL}(q \| p(z|x))$ .

3. Optimization Strategies

3.1 Coordinate Ascent (Mean Field)

Assume $q(z) = \prod q_i(z_i)$ . The optimal $q_j^*(z_j)$ given others fixed is:

\log q_j^*(z_j) = \mathbb{E}_{-j} [ \log p(x, z) ] + \text{const}

This is tractable for Conjugate Exponential families. Gradients are not required, only expectations. Used in Latent Dirichlet Allocation (LDA).

3.2 Black-Box Variational Inference (BBVI)

Assume $q_\lambda$ is Gaussian $\mathcal{N}(\mu, \sigma^2)$ . Parameter $\lambda = (\mu, \sigma)$ . Maximizing $\mathcal{L}(\lambda)$ via Gradient Descent.

\nabla_\lambda \mathcal{L} = \nabla_\lambda \mathbb{E}_{q_\lambda} [ \log p(x, z) - \log q_\lambda(z) ]

Log-Derivative Trick (REINFORCE):

\nabla \mathbb{E}_q [f] = \mathbb{E}_q [ f(z) \nabla \log q_\lambda(z) ]

High variance? Reparameterization Trick: Let $z = g(\epsilon, \lambda)$ . e.g., $z = \mu + \sigma \epsilon, \epsilon \sim \mathcal{N}(0, 1)$ .

\nabla_\lambda \mathbb{E}_{p(\epsilon)} [ f(g(\epsilon, \lambda)) ] = \mathbb{E}_{p(\epsilon)} [ \nabla_z f \nabla_\lambda g ]

This uses the gradient of the model $\nabla_z \log p$ . Much lower variance. This is the engine of Variational Autoencoders (VAEs).

4. The Geometry of Natural Gradients

Standard Gradient Descent $\phi \leftarrow \phi - \alpha \nabla_\phi \mathcal{L}$ assumes the parameter space is Euclidean ( $L_2$ distance). But parameters of a distribution live on a statistical manifold. A small change in $\mu$ (mean) means something very different if $\sigma$ is small vs large. Trust Region optimization:

\phi_{new} = \text{argmax}_\phi \mathcal{L}(\phi) \quad \text{s.t. } D_{KL}(q_\phi \| q_{\phi_{old}}) < \epsilon

The KL divergence to second order is the Fisher Information Matrix $F$ :

D_{KL}(q_{\phi+d\phi} \| q_\phi) \approx \frac{1}{2} d\phi^T F(\phi) d\phi

where $F(\phi) = \mathbb{E}_q [ \nabla \log q \nabla \log q^T ]$ . The constraint becomes a quadratic bound. The optimal update direction is the Natural Gradient:

g_{nat} = F^{-1}(\phi) \nabla_\phi \mathcal{L}

Gaussian Natural Gradients: For $q(z) = \mathcal{N}(\mu, \Sigma)$ , the Fisher Matrix is block diagonal. Remarkably, for exponential families, we don’t need to invert the $F$ matrix explicitly! Using the Canonical Parameters $\eta$ , the Gradients $\nabla_\eta \mathcal{L}$ are equivalent to the Natural Gradients in the Mean Parameter space. Binary updates become $O(1)$ .

\lambda_{new} = (1 - \rho) \lambda_{old} + \rho (\mathbb{E}_q[T(z)] + \dots)

This is the basis of SVI (Stochastic Variational Inference) which scales to billions of examples.

5. Stochastic Variational Inference (SVI)

Traditional VI requires iterating through the entire dataset to compute the coordinate ascent updates. For $N = 10^9$ documents (LDA), this is impossible. Hoffman et al. (2013) introduced SVI. Key idea: The Natural Gradient of the global parameters is a sum of local structures.

\nabla \mathcal{L} = \mathbb{E}_q [\eta_g] - \lambda_g + \sum_{i=1}^N (\mathbb{E}_{q_i} [\eta_l] - \lambda_l)

We can estimate this noisy gradient using a minibatch of size $M$ .

\hat{\nabla} \mathcal{L} = \mathbb{E}_q [\eta_g] - \lambda_g + \frac{N}{M} \sum_{m=1}^M (\mathbb{E}_{q_m} [\eta_l] - \lambda_l)

Because we are in the Natural Gradient space (Riemannian), we can take steps of size $\rho_t$ satisfying Robbins-Monro conditions ( $\sum \rho_t = \infty, \sum \rho_t^2 < \infty$ ).

6. Normalizing Flows (Beyond Mean Field)

The biggest limitation of VI is choice of $q$ . Using a Gaussian $q$ limits us to unimodal, light-tailed approximations. Normalizing Flows allow us to construct complex densities by transforming a simple base distribution (e.g. Gaussian) through a sequence of invertible maps $f_k$ .

z_K = f_K \circ \dots \circ f_1(z_0), \quad z_0 \sim \mathcal{N}(0, I)

Change of Variables Formula:

\log q_K(z_K) = \log q_0(z_0) - \sum_{k=1}^K \log \left| \det \frac{\partial f_k}{\partial z_{k-1}} \right|

The challenge is finding functions with (1) High expressivity and (2) Linear-time Jacobian determinants.

Planar Flows (Rezende & Mohamed 2015):

f(z) = z + u \tanh(w^T z + b)

New density concentrates mass along a hyperplane. Jacobian Determinant Lemma:

\det \left( I + u \psi(z)^T \right) = 1 + u^T \psi(z)

Stacking 10-20 Planar Flows allows the posterior to snake around complex, multi-modal landscapes.

7. Stein Variational Gradient Descent (SVGD)

BBVI is limited by the family $q$ (e.g., Gaussian implies no multi-modality). We want non-parametric VI. Use particles. Let $\{z_i\}_{i=1}^N$ be particles approximating $q$ . We want to move particles $z \to z + \epsilon \phi(z)$ to decrease KL. Let $T(z) = z + \epsilon \phi(z)$ . $q_{[T]} = T_\# q$ . Derivative of KL:

\nabla_\epsilon D_{KL}(q_{[T]} \| p) |_{\epsilon=0} = - \mathbb{E}_q [ \text{trace}(\mathcal{A}_p \phi) ]

where $\mathcal{A}_p$ is the Stein Operator: $\mathcal{A}_p \phi = \phi \nabla \log p + \nabla \cdot \phi$ . We seek optimal perturbation $\phi$ in the unit ball of an RKHS $\mathcal{H}_k$ .

\phi^*(x) \propto \mathbb{E}_{y \sim q} [ k(x, y) \nabla_y \log p(y) + \nabla_y k(x, y) ]

Algorithm (SVGD):

z_i \leftarrow z_i + \epsilon \left( \frac{1}{N} \sum_j [ k(z_i, z_j) \nabla \log p(x, z_j) + \nabla_{z_j} k(z_i, z_j) ] \right)

Interpretation:

Driver: Moves $z_i$ towards high probability ( $\nabla \log p$ ).
Repulsion: Moves $z_i$ away from $z_j$ ( $\nabla k$ term acts as repulsive force). This prevents mode collapse! It allows particles to cover the posterior.

8. Variance Reduction: The Pathwise Derivative

Why did VAEs (2014) revolutionize the field? Because standard gradient estimation for expectations is noisy. We want $\nabla_\phi \mathbb{E}_{q_\phi} [f(z)]$ .

Method 1: Score Function (REINFORCE)

\nabla_\phi \mathbb{E}_q [f] = \mathbb{E}_q [ f(z) \nabla_\phi \log q_\phi(z) ]

Likelihood Ratio trick. valid for any density $q$ (even discrete). Variance: Proportional to $\text{Var}(f(z))$ . High noise.

Method 2: Pathwise Derivative (Reparameterization) Assume continuous $z$ and diffeomorphism $z = g(\epsilon, \phi)$ .

\mathbb{E}_{q_\phi} [f(z)] = \mathbb{E}_{p(\epsilon)} [ f(g(\epsilon, \phi)) ]

Gradient moves inside:

\nabla_\phi \mathbb{E} = \mathbb{E}_{p(\epsilon)} [ \nabla_z f(z) \nabla_\phi g(\epsilon, \phi) ]

Variance: Typically orders of magnitude lower. Limitation: Requires $f$ (model) and $q$ to be differentiable.

9. JAX Implementation: Particles vs Distribution


import jax
import jax.numpy as jnp
from jax import grad, vmap, jit
from jax import random
import matplotlib.pyplot as plt
 
# 1. Define Kernels
def rbf_kernel(X, h=-1):
    # vectorized RBF kernel
    # X: (N, D)
    # returns K: (N, N), grad_K: (N, N, D)
    
    # Pairwise distances
    diff = X[:, None, :] - X[None, :, :] # (N, N, D)
    sq_dist = jnp.sum(diff**2, axis=-1)   # (N, N)
    
    # Median Heuristic
    if h < 0:
        h = jnp.median(sq_dist) / jnp.log(X.shape[0])
        
    K = jnp.exp(-sq_dist / h)
    
    # Gradient of Kernel w.r.t the first particle set
    grad_K = -K[..., None] * diff * (2/h)
    return K, grad_K
 
# 2. SVGD Step
@jit
def svgd_step(particles, log_prob_grad, step_size, optimizer_state=None):
    # Compute Score Function
    grad_logp = log_prob_grad(particles) # (N, D)
    
    # Kernel interaction
    K, grad_K = rbf_kernel(particles) # (N, N), (N, N, D)
    
    term1 = K @ grad_logp  # (N, D)
    term2 = jnp.sum(grad_K, axis=1) # (N, D) # Sum over j
    
    phi = (term1 + term2) / particles.shape[0]
    
    return particles + step_size * phi
 
# 3. Target Distribution: Bimodal Mixture
def target_log_prob(x):
    # Mixture of two Gaussians at (-2, -2) and (2, 2)
    mu1 = jnp.array([-2.0, -2.0])
    mu2 = jnp.array([2.0, 2.0])
    w1, w2 = 0.5, 0.5
    
    log_p1 = -0.5 * jnp.sum((x - mu1)**2)
    log_p2 = -0.5 * jnp.sum((x - mu2)**2)
    
    # LogSumExp trick
    return jax.scipy.special.logsumexp(jnp.array([log_p1, log_p2]))
 
# Wrapper for vmapped gradients
dist_grad = vmap(grad(target_log_prob))
 
def run_simulation():
    key = random.PRNGKey(42)
    # Start all particles at (0,0) (Mode collapse state)
    particles = random.normal(key, (100, 2)) * 0.1
    
    history = []
    step_size = 0.1
    for i in range(200):
        particles = svgd_step(particles, dist_grad, step_size)
        if i % 100 == 0: history.append(particles)
        
    # Plotting
    # Particles should split into two groups and cover both modes!
    return particles
 
# Observation: Even effectively starting from a single point, the "Repulsive Force"
# of the kernel (grad_K) pushes particles apart, forcing exploration of the second mode.
# This proves SVGD > MCMC for multimodal mixing in many cases.

10. Conclusion: The Optimization Perspective

Variational Inference represents a philosophical shift in Statistics. Instead of simulating the process (MCMC), we design an optimization landscape (ELBO) whose geometry entices the solution to reveal itself. From the functional calculus of the Euler-Lagrange equations to the differential geometry of Natural Gradients and the particle physics of Stein Flows, VI provides a rich playground where Geometry meets Probability. As we scale to massive datasets and complex deep generative models, the “Optimization” view of inference is likely to dominate the “Sampling” view for the foreseeable future.

4. Stein Variational Gradient Descent (SVGD)

\nabla_\epsilon D_{KL}(q_{[T]} \| p) |_{\epsilon=0} = - \mathbb{E}_q [ \text{trace}(\mathcal{A}_p \phi) ]

where $\mathcal{A}_p$ is the Stein Operator: $\mathcal{A}_p \phi = \phi \nabla \log p + \nabla \cdot \phi$ . We seek optimal perturbation $\phi$ in the unit ball of an RKHS $\mathcal{H}_k$ .

\phi^* = \text{argmax}_{\phi \in \mathcal{H}, \|\phi\| \le 1} \mathbb{E}_q [\mathcal{A}_p \phi]

Result (Liu & Wang 2016):

\phi^*(x) \propto \mathbb{E}_{y \sim q} [ k(x, y) \nabla_y \log p(y) + \nabla_y k(x, y) ]

Algorithm (SVGD):

z_i \leftarrow z_i + \epsilon \left( \frac{1}{N} \sum_j [ k(z_i, z_j) \nabla \log p(x, z_j) + \nabla_{z_j} k(z_i, z_j) ] \right)

Interpretation:

Driver: Moves $z_i$ towards high probability ( $\nabla \log p$ ).
Repulsion: Moves $z_i$ away from $z_j$ ( $\nabla k$ term acts as repulsive force). This prevents mode collapse! It allows particles to cover the posterior.

5. Stochastic Variational Inference (SVI)

\nabla \mathcal{L} = \mathbb{E}_q [\eta_g] - \lambda_g + \sum_{i=1}^N (\mathbb{E}_{q_i} [\eta_l] - \lambda_l)

We can estimate this noisy gradient using a minibatch of size $M$ .

\hat{\nabla} \mathcal{L} = \mathbb{E}_q [\eta_g] - \lambda_g + \frac{N}{M} \sum_{m=1}^M (\mathbb{E}_{q_m} [\eta_l] - \lambda_l)

Because we are in the Natural Gradient space (Riemannian), we can take steps of size $\rho_t$ satisfying Robbins-Monro conditions ( $\sum \rho_t = \infty, \sum \rho_t^2 < \infty$ ). This proved that Bayesian inference could scale to “Big Data” comparably to deep learning.

6. Stein Variational Gradient Descent (SVGD)

\nabla_\epsilon D_{KL}(q_{[T]} \| p) |_{\epsilon=0} = - \mathbb{E}_q [ \text{trace}(\mathcal{A}_p \phi) ]

where $\mathcal{A}_p$ is the Stein Operator: $\mathcal{A}_p \phi = \phi \nabla \log p + \nabla \cdot \phi$ . We seek optimal perturbation $\phi$ in the unit ball of an RKHS $\mathcal{H}_k$ .

\phi^*(x) \propto \mathbb{E}_{y \sim q} [ k(x, y) \nabla_y \log p(y) + \nabla_y k(x, y) ]

Algorithm (SVGD):

z_i \leftarrow z_i + \epsilon \left( \frac{1}{N} \sum_j [ k(z_i, z_j) \nabla \log p(x, z_j) + \nabla_{z_j} k(z_i, z_j) ] \right)

Interpretation:

Driver: Moves $z_i$ towards high probability ( $\nabla \log p$ ).
Repulsion: Moves $z_i$ away from $z_j$ ( $\nabla k$ term acts as repulsive force). This prevents mode collapse! It allows particles to cover the posterior.

JAX Implementation: Particles vs Distribution


import jax
import jax.numpy as jnp
from jax import grad, vmap, jit
from jax import random
import matplotlib.pyplot as plt
 
# 1. Define Kernels
def rbf_kernel(X, h=-1):
    # vectorized RBF kernel
    # X: (N, D)
    # returns K: (N, N), grad_K: (N, N, D)
    
    # Pairwise distances
    diff = X[:, None, :] - X[None, :, :] # (N, N, D)
    sq_dist = jnp.sum(diff**2, axis=-1)   # (N, N)
    
    # Median Heuristic
    if h < 0:
        h = jnp.median(sq_dist) / jnp.log(X.shape[0])
        
    K = jnp.exp(-sq_dist / h)
    
    # Gradient of Kernel w.r.t the first particle set
    grad_K = -K[..., None] * diff * (2/h)
    return K, grad_K
 
# 2. SVGD Step
@jit
def svgd_step(particles, log_prob_grad, step_size, optimizer_state=None):
    # Compute Score Function
    grad_logp = log_prob_grad(particles) # (N, D)
    
    # Kernel interaction
    K, grad_K = rbf_kernel(particles) # (N, N), (N, N, D)
    
    # The Stein Force
    # phi(xi) = sum_j [ k(xj, xi) score(xj) + grad_xj k(xj, xi) ] / N
    # Note: Our rbf_kernel implementation returns grad w.r.t first arg.
    # Symmetry allows us to use it directly.
    
    term1 = K @ grad_logp  # (N, D)
    term2 = jnp.sum(grad_K, axis=1) # (N, D) # Sum over j
    
    phi = (term1 + term2) / particles.shape[0]
    
    return particles + step_size * phi
 
# 3. Target Distribution: Bimodal Mixture
def target_log_prob(x):
    # Mixture of two Gaussians at (-2, -2) and (2, 2)
    mu1 = jnp.array([-2.0, -2.0])
    mu2 = jnp.array([2.0, 2.0])
    w1, w2 = 0.5, 0.5
    
    log_p1 = -0.5 * jnp.sum((x - mu1)**2)
    log_p2 = -0.5 * jnp.sum((x - mu2)**2)
    
    # LogSumExp trick
    return jax.scipy.special.logsumexp(jnp.array([log_p1, log_p2]))
 
# Wrapper for vmapped gradients
dist_grad = vmap(grad(target_log_prob))
 
def run_simulation():
    key = random.PRNGKey(42)
    # Start all particles at (0,0) (Mode collapse state)
    particles = random.normal(key, (100, 2)) * 0.1
    
    history = []
    step_size = 0.1
    for i in range(200):
        particles = svgd_step(particles, dist_grad, step_size)
        if i % 100 == 0: history.append(particles)
        
    # Plotting
    # Particles should split into two groups and cover both modes!
    return particles
 
# Observation: Even effectively starting from a single point, the "Repulsive Force"
# of the kernel (grad_K) pushes particles apart, forcing exploration of the second mode.
# This proves SVGD > MCMC for multimodal mixing in many cases.

6. Variance Reduction: The Pathwise Derivative

Why did VAEs (2014) revolutionize the field? Because standard gradient estimation for expectations is noisy. We want $\nabla_\phi \mathbb{E}_{q_\phi} [f(z)]$ .

Method 1: Score Function (REINFORCE)

\nabla_\phi \mathbb{E}_q [f] = \mathbb{E}_q [ f(z) \nabla_\phi \log q_\phi(z) ]

Likelihood Ratio trick. valid for any density $q$ (even discrete). Variance: Proportional to $\text{Var}(f(z))$ . Even if the optimum is found, the gradient noise remains high unless $f(z) \approx 0$ (Requires baselines). $O(1/ \text{LearningRate})$ convergence.

Method 2: Pathwise Derivative (Reparameterization) Assume continuous $z$ and diffeomorphism $z = g(\epsilon, \phi)$ .

\mathbb{E}_{q_\phi} [f(z)] = \mathbb{E}_{p(\epsilon)} [ f(g(\epsilon, \phi)) ]

Gradient moves inside:

\nabla_\phi \mathbb{E} = \mathbb{E}_{p(\epsilon)} [ \nabla_z f(z) \nabla_\phi g(\epsilon, \phi) ]

Variance: Relies on $\nabla_z f$ (gradient of the model). Typical variance is orders of magnitude lower than REINFORCE. Why? The Score Function effectively “probes” the function by random sampling. The Pathwise derivative uses the analytic knowledge of the function’s slope to guide the update. Limitation: Requires $f$ (model) and $q$ to be differentiable. Cannot handle discrete latent variables easily (requires Gumbel-Softmax relaxation).

Conclusion: The Optimization Perspective

Historical Timeline

Year	Event	Significance
1998	Michael Jordan et al.	Formalize Variational Methods for Graphical Models.
1998	Shun-Ichi Amari	Natural Gradient Descent.
2013	Hoffman et al.	Stochastic Variational Inference (SVI).
2014	Kingma & Welling	VAEs (Reparameterization Trick).
2015	Rezende & Mohamed	Normalizing Flows.
2016	Liu & Wang	Stein Variational Gradient Descent (SVGD).

Appendix A: VAE Loss Derivation (Full Details)

The Variational Autoencoder has Generative Model $p_\theta(x|z)$ and prior $p(z)$ . Approximator $q_\phi(z|x)$ . ELBO:

\mathcal{L} = \mathbb{E}_{q_\phi(z|x)} [ \log p_\theta(x|z) ] - D_{KL}( q_\phi(z|x) \| p(z) )

A.1 The Gaussian KL Term

Let $q(z|x) = \mathcal{N}(\mu, \Sigma)$ and $p(z) = \mathcal{N}(0, I)$ . General formula for KL between two Gaussians $\mathcal{N}_0, \mathcal{N}_1$ :

D_{KL}(N_0 \| N_1) = \frac{1}{2} \left[ \text{tr}(\Sigma_1^{-1} \Sigma_0) + (\mu_1 - \mu_0)^T \Sigma_1^{-1} (\mu_1 - \mu_0) - k + \ln \frac{\det \Sigma_1}{\det \Sigma_0} \right]

Plugging in $\mu_0=\mu, \Sigma_0=\Sigma$ (usually diag representation $\sigma^2$ ) and $\mu_1=0, \Sigma_1=I$ : This loss term is minimized when $\mu \to 0$ and $\sigma \to 1$ . It acts as a regularizer keeping the latent space compact.

A.2 The Reconstruction Term

Monte Carlo estimate. $z^{(l)} = \mu + \sigma \odot \epsilon^{(l)}$ . If $p(x|z) = \mathcal{N}(x; \text{Decoder}(z), I)$ :

\log p \propto - \frac{1}{2} \| x - \text{Decoder}(z) \|^2

This recovers the MSE loss.

Appendix B: The Gumbel-Softmax Trick (Discrete Regularization)

Reparameterization works for continuous variables. What about categories? We cannot backpropagate through sampling a categorical variable $z \sim \text{Cat}(\pi)$ . Gumbel-Max Trick:

z = \text{one\_hot}( \text{argmax}( \log \pi_i + g_i ) )

where $g_i \sim \text{Gumbel}(0, 1)$ . Concrete Relaxation (Jang et al. 2016): Replace argmax with Softmax with temperature $\tau$ :

y_i = \frac{\exp((\log \pi_i + g_i)/\tau)}{\sum \exp((\log \pi_j + g_j)/\tau)}

As $\tau \to 0$ , $y$ approaches the one-hot sample. As $\tau \to \infty$ , $y$ approaches uniform.

Appendix C: Generalization to Alpha-Divergences

Why minimize $D_{KL}(q \| p)$ and not $D_{KL}(p \| q)$ (Expectation Propagation) or something else? The “exclusive” KL ( $q\|p$ ) forces $q$ to fit into a mode of $p$ (Zero forcing). The “inclusive” KL ( $p\|q$ ) forces $q$ to cover the entire mass of $p$ (Mass covering). We can generalize using Renyi’s $\alpha$ -divergence:

D_\alpha(p \| q) = \frac{1}{\alpha - 1} \log \int p(z)^\alpha q(z)^{1-\alpha} dz

Limit $\alpha \to 1$ : Inclusive KL. Limit $\alpha \to 0$ : Exclusive KL ( $q\|p$ ). Black Box Alpha-VI (Hernandez-Lobato et al. 2016): We can minimize $D_\alpha$ directly using the VR-max bound or importance sampling.

\mathcal{L}_\alpha(q) \approx \frac{1}{\alpha} \log \frac{1}{K} \sum_{k=1}^K \left( \frac{p(x, z_k)}{q(z_k)} \right)^\alpha

This connects VI to Particle Filters and Sequential Monte Carlo. By tuning $\alpha$ , we can control the behavior of the approximator: from mode-seeking to mean-seeking to heavy-tailed covering.

Appendix D: Glossary of Terms

BBVI: Black-Box Variational Inference. Uses gradients of ELBO.
ELBO: Evidence Lower Bound. The objective function for VI.
LDA: Latent Dirichlet Allocation. A classic topic model using Mean Field VI.
Mean Field: Assumption that $q$ factorizes fully.
Natural Gradient: Gradient step in Riemannian manifold defined by Fisher Information.
Normalizing Flow: A sequence of invertible transformations to model complex densities.
Reparameterization Trick: $\nabla \mathbb{E}[f(z)] = \mathbb{E}[\nabla f(g(\epsilon))]$ . Low variance.
SVGD: Particle-based VI using kernelized Stein discrepancy.

References

1. Blei, D. M. et al. (2017). “Variational Inference: A Review for Statisticians”. Extensive overview of VI as an alternative to MCMC. Covers Mean Field, Stochastic VI, and connections to convex optimization.

2. Liu, Q., & Wang, D. (2016). “Stein Variational Gradient Descent”. Introduces SVGD. Uses Stein’s identity and RKHS logic to derive a deterministic particle flow that simulates the heat flow of the posterior.

3. Kingma, D. P., & Welling, M. (2014). “Auto-Encoding Variational Bayes”. The paper that introduced VAEs and the Reparameterization Trick. It connected Deep Learning (backprop) with Variational Inference.

4. Rezende, D. J., & Mohamed, S. (2015). “Variational Inference with Normalizing Flows”.

mostró how to construct arbitrarily complex posteriors $q(z)$ by transforming simple Gaussians through invertible neural networks.*

5. Amari, S. I. (1998). “Natural Gradient Works Efficiently in Learning”. The seminal paper defining the Natural Gradient for neural networks. It shows that the parameter space is Riemannian and the Fisher Information Metric is the unique metric invariant to reparameterization.

6. Jordan, M. I. et al. (1999). “An Introduction to Variational Methods for Graphical Models”. The classic tutorial that established the Mean Field approximation as the standard tool for exponential families in graphical models.