← Back to blog

Demystifying Variational Autoencoders (VAEs): From Variational Inference to Deep Generative Models

Generative modeling is one of the most exciting areas in modern artificial intelligence. While models like GANs and Diffusion Models often dominate the headlines, the Variational Autoencoder (VAE) remains a foundational, elegant, and deeply principled framework that beautifully bridges the gap between probabilistic graphical models and deep neural networks.

Unlike standard autoencoders—which simply compress data into a static bottleneck vector—VAEs learn the underlying probability distribution of the data. This allows them to not only compress information but also generate entirely new, realistic data samples.

In this post, we will tear down the mathematical mechanics of VAEs, starting from the foundational principles of Variational Inference and mapping them directly to deep learning architectures.


1. The Core Problem: Variational Inference

To understand VAEs, we first need to look at Variational Inference (VI). While classic methods like Gaussian Mixture Models (GMMs) parameterize a fixed, pre-determined number of distributions, variational inference provides a fundamental optimization framework for estimating any complex data distribution.

Imagine we have an observed data sample \(x\) and we believe it is influenced by some hidden, unobserved factors (latent variables) \(z\). Our goal is to model the true distribution of our data, \(p(x)\).

However, computing \(p(x)\) directly requires integrating over all possible configurations of the latent variables \(z\):

\[p(x) = \int p_\theta(x,z) dz\]

In practice, this integral is completely intractable for complex data (like images) because the latent space is massive and non-linear. To bypass this, we look for a tractable approximation.

Derivation 1: The Density Estimation Angle (ELBO)

To find an approximation for our data distribution, we can look at the log-likelihood \(\log p(x)\). By introducing an arbitrary proposal distribution \(q_\lambda(z|x)\) over the latent variables and applying Jensen's Inequality, we can derive a mathematical lower bound:

\[\begin{aligned} \log p(x) & = \log \int p_\theta(x,z) dz \\ & = \log \int \frac{q_\lambda(z|x)}{q_\lambda(z|x)} p_\theta(x,z) dz \\ & \ge \int q_\lambda(z|x) \log \frac{p_\theta(x,z)}{q_\lambda(z|x)} dz \quad \text{(via Jensen's Inequality)} \\ & = \mathbb{E}_{q_\lambda(z|x)} \biggl[ \log \frac{p_\theta(x,z)}{q_\lambda(z|x)} \biggr] \\ & \equiv \mathtt{ELBO}(x; \theta, \lambda) \end{aligned}\]

This expression is known as the Evidence Lower Bound (ELBO). Because \(\log p(x)\) is untractable, we maximize the ELBO instead. Maximizing this lower bound forces our approximation closer to the true data distribution.

Derivation 2: The Latent Projection Angle (KL Divergence)

Another way to look at this is from the perspective of representation learning. We want to find the true hidden factors \(z\) that "cause" our observation \(x\). This is defined by the true posterior distribution \(p_\theta(z|x)\).

Since we cannot compute \(p_\theta(z|x)\) directly, we introduce a tractable distribution \(q_\lambda(z|x)\)—acting as an inference model—to approximate it. To make \(q_\lambda(z|x)\) match the true distribution as closely as possible, we minimize the Kullback-Leibler (KL) Divergence between them:

\[\begin{aligned} D_{KL}(q_\lambda(z|x) \parallel p_\theta(z|x)) & = \int q_\lambda(z|x) \log \frac{q_\lambda(z|x)}{p_\theta(z|x)} dz \\ & = \int q_\lambda(z|x) \log \frac{q_\lambda(z|x)p_\theta(x)}{p_\theta(z,x)} dz \\ & = \int q_\lambda(z|x) \biggl( \log p_\theta(x) + \log \frac{q_\lambda(z|x)}{p_\theta(z,x)} \biggr) dz \\ & = \log p_\theta(x) + \int q_\lambda(z|x) \log \frac{q_\lambda(z|x)}{p_\theta(z,x)} dz \\ & = \log p_\theta(x) - \mathtt{ELBO}(x; \theta, \lambda) \end{aligned}\]

Rearranging this formula highlights a beautiful relationship:

\[\log p_\theta(x) = \mathtt{ELBO}(x; \theta, \lambda) + D_{KL}(q_\lambda(z|x) \parallel p_\theta(z|x))\]

Because \(\log p_\theta(x)\) is fixed with respect to our variational parameters \(\lambda\), minimizing the KL divergence is exactly equivalent to maximizing the ELBO. Furthermore, because KL divergence is always \(\ge 0\), this mathematically proves that the ELBO is always a lower bound on the true log-likelihood.


2. Bringing in Deep Learning: The VAE Framework

While classic variational inference relies on iterative coordinate ascent methods (like Black-Box Variational Inference), Variational Autoencoders (VAEs) parameterize these distributions using Neural Networks.

To turn the ELBO into a functional neural network loss function, we rewrite it by breaking the joint distribution \(p_\theta(x, z)\) into \(p_\theta(x|z)p_\theta(z)\):

\[\begin{aligned} \mathtt{ELBO}(x; \theta, \lambda) & = \mathbb{E}_{q_\lambda(z|x)} \biggl[ \log \frac{p_\theta(x|z) p_\theta(z)}{q_\lambda(z|x)} \biggr] \\ & = \mathbb{E}_{q_\lambda(z|x)} \bigl[ \log p_\theta(x|z) \bigr] - D_{KL}(q_\lambda(z|x) \parallel p_\theta(z)) \end{aligned}\]

In deep learning, we traditionally minimize a loss function rather than maximizing an objective. Therefore, we flip the sign to minimize the Negative ELBO:

\[\mathcal{L}_{VAE}(x) = -\mathbb{E}_{q_\lambda(z|x)} \bigl[ \log p_\theta(x|z) \bigr] + D_{KL}(q_\lambda(z|x) \parallel p_\theta(z))\]

This gives us two clear, competing terms to optimize:

  1. The Reconstruction Loss (\(-\mathbb{E}_{q_\lambda(z|x)} \bigl[ \log p_\theta(x|z) \bigr]\))
  2. The KL Regularization Term (\(D_{KL}(q_\lambda(z|x) \parallel p_\theta(z))\))

3. Breaking Down the Components

To make this architecture concrete, we make specific probabilistic assumptions that yield clean, analytical solutions.

The Regularization Term (The Encoder)

We assume that the true prior of our latent space is a standard, isotropic Gaussian distribution: \(p_\theta(z) = \mathcal{N}(z; 0, \mathbb{I})\). This acts as a force pulling our latent representations toward the center of the vector space.

The inference distribution \(q_\lambda(z|x)\) is modeled as a diagonal Gaussian, parameterized by an Encoder Neural Network that outputs a mean vector \(f_\mu(x)\) and a variance vector \(f_\sigma(x)\):

\[q_\lambda(z|x) = \mathcal{N}(z; f_{\mu}(x), f_{\sigma}^2(x)\mathbb{I})\]

Because both the prior and the variational posterior are Gaussian distributions, the KL divergence can be solved analytically in closed form, removing the need for costly approximations:

\[D_{KL}(q_\lambda(z|x) \parallel p_\theta(z)) = -\frac{1}{2} \sum_{j=1}^{J} \left( 1 + \log(\sigma_j^2) - \mu_j^2 - \sigma_j^2 \right)\]

The Reconstruction Term (The Decoder)

The second part of our network is the Decoder, \(p_\theta(x|z)\), which takes a sample from our latent space \(z\) and maps it back onto the data space. Conventionally, we assume \(p_\theta(x|z)\) is a diagonal Gaussian parameterized by a decoder network \(g_\mu(z)\) with a constant scaling factor \(c\):

\[p_\theta(x | z) = \mathcal{N}(x ; g_{\mu}(z), c\mathbb{I})\]

We approximate the mathematical expectation using Monte Carlo sampling. Empirically, a sample size of \(1\) per iteration is sufficient during training.

When assuming a Gaussian distribution, maximizing this log-likelihood mathematically boils down to an \(L_2\) Loss (Mean Squared Error) between the original input and the reconstructed output.

💡 Pro-Tip: Changing your structural assumptions about \(p_\theta(x|z)\) changes your loss function! If you assume a Laplace distribution, your reconstruction term becomes an \(L_1\) Loss. If your input data is binary (or normalized between 0 and 1) and you choose a Binomial distribution, it becomes a Binary Cross-Entropy Loss.


4. The Secret Sauce: The Reparameterization Trick

There is a glaring engineering issue in the architecture described so far. The encoder outputs parameters \(\mu\) and \(\sigma\), we sample a latent vector \(z\) from \(\mathcal{N}(\mu, \sigma^2)\), and we pass \(z\) to the decoder.

Because sampling is a stochastic process, it creates a break in our computational graph. Gradients cannot flow backward through a random sampling node, making standard backpropagation impossible.

To fix this, VAEs use the Reparameterization Trick. Instead of sampling directly from \(\mathcal{N}(\mu, \sigma^2)\), we sample a random noise vector \(\epsilon\) from a static standard normal distribution:

\[\epsilon \sim \mathcal{N}(0, \mathbb{I})\]

We then construct \(z\) deterministically by scaling and shifting \(\epsilon\):

\[z = f_{\mu}(x) + f_{\sigma}(x) \odot \epsilon\]

Where \(\odot\) represents element-wise multiplication. By shifting the stochasticity into an external input (\(\epsilon\)), the pathways through \(\mu\) and \(\sigma\) become fully deterministic, allowing gradients to flow effortlessly from the decoder back to the encoder.


5. VAEs vs. Traditional Autoencoders

It's tempting to look at a VAE and think it's just a regular Autoencoder (AE) with some statistical flavor. However, the operational dynamics are fundamentally different:

Feature Traditional Autoencoder (AE) Variational Autoencoder (VAE)
Latent Bottleneck Maps an input to a discrete vector point. Maps an input to a continuous probability distribution.
Latent Space Structure Unregulated; prone to vast "gaps" and extreme overfitting. Continuous and complete; regularized toward a standard normal distribution.
Generative Capability Poor; sampling from random coordinates yields garbage outputs. Excellent; sampling from \(\mathcal{N}(0, \mathbb{I})\) yields highly structured, authentic novel data.

Without the KL regularization term forcing the latent space to remain cohesive, the encoder network would cheat. It would give every single training image its own isolated island in the latent space, far away from any other image, minimizing reconstruction loss perfectly but completely destroying the model's ability to generalize or generate novel data.


Conclusion

By framing the objective around the Evidence Lower Bound (ELBO), Variational Autoencoders give us a mathematically sound recipe for training deep generative networks via standard stochastic gradient descent (SGD) or Adam.

They provide a structured, smooth latent space where you can linearly interpolate between features (like blending two faces or transitioning smooth digits), making them an essential milestone in the evolution of modern deep learning and generative artificial intelligence.