1 Introduction

Autoencoders learn low-dimensional representation of the data in unsupervised way by aiming to imitate the identity function i.e. reconstruct the original data while having a low-dimensional representation bottleneck in the process. In other words, Autoencoders are neural networks trained to generate output y that is as close to the input x, and an internal layer that gives the representation z(x) for each new input.

Composed of two networks:

Encoder $E_{θ} : X \to Z$ : Maps high-dimensional input to low-dimensional latent representation, usually $∣ Z ∣ \leq ∣ X ∣$ .
Decoder $D_{ϕ} : Z \to Y$ : Outputs data as close to original from latent representation.

Encoder network is generally used to accomplish dimensionality reduction when $∣ Z ∣ ≪ ∣ X ∣$ and can be seen as a non-linear generalization to PCA.

Deterministic AE

Denoising AE

Sparse AE

Contractive AE

Consider random variables x and z, with g being deterministic decoder (generator) and f being deterministic encoder. The process looks like:

x \sim p_{D} \overset{z}{^} = f (x) z \sim p_{z} \overset{x}{^} = g (z)

Our goal is to generate $x$ or in other words sample $x$ from distribution $p_{θ} (x)$ . If we know the distribution to latent variable z, then expressing x is just taking marginal likelihood $p_{θ} (x) = \int_{z} p_{θ} (x ∣ z) p (z) d z$ over latent variables $p (z)$ with the conditional distribution of $p_{θ} (x ∣ z)$ , or if we have access to ground truth latent encoder, we can also write $p (x) = \frac{p ( x , z )}{p ( z ∣ x )}$

Using log likelihood objective $\sum_{i = 1}^{N} lo g p_{θ} (x^{(i)})$ , we can optimize $θ$ by minimizing the NLL.

But generally, z is not known, and primary goal of a generative model is to create output from scratch. So, we need a distribution $p_{z}$ so that z can be sampled from the distribution. You might ask, why can’t an AE be used for this, where we can use the encoder to get the sample z? Firstly, encoder works on an input, and secondly, is optimized to work as a identity lookup table. The latent space might not be structured, and AE doesn’t provide any guarantee about the structure of the latent variable distribution.

2 Likelihood

To calculate the intractable data likelihood $p_{θ} (x)$ , we make following assumptions:

Hypothesis space $p_{θ} (x ∣ z)$ is modelled using a product mixture of distributions (e.g., Gaussians or Bernoullis) with a prior $p_{z}$ (usually Gaussian). Thus, VAEs can be seen as an infinite mixture of Gaussians. To prevent learning infinite set of parameters over discrete latent space, VAE uses a function $g_{θ}$ that outputs the parameters of a continuous latent variable, and smoothens the latent space.

p_{θ} (x) = \int_{z} N (x; g_{θ}^{μ} (z), g_{θ}^{Σ} (z)) N (z; 0, I) d z

Importance sampling using another density $z \sim q_{z}$ . Instead of approximating the integral over all z (because most of the term will be near zero), we want to sample z from the distribution $p_{θ} (z ∣ x)$ whose sample places maximum likelihood on x $p_{θ} (x ∣ z)$ . Optimal $q^{*} = p_{θ} (z ∣ x)$ .

p_{θ} (x) = E_{z \sim p_{z}} [p_{θ} (x ∣ z)] = \int_{z} q_{z} (z) \frac{p _{z} ( z )}{q _{z} ( z )} p_{θ} (x ∣ z) d z = E_{z \sim p_{z}} [\frac{p _{z} ( z )}{q _{z} ( z )} p_{θ} (x ∣ z)]

Use variational inference to estimate p by modeling $q_{ϕ}$ parametrized by $ϕ$ . Posterior distribution $q_{ϕ} (z ∣ x) = N (f_{ϕ}^{μ} (x), f_{ϕ}^{Σ} (x))$ uses approximation function $f$ that outputs the parameters of the distribution and can be seen as probabilistic encoder.

Our goal will be to use the representation of the distribution $p (x)$ to derive a term called the Evidence Lower Bound (ELBO), which gives a lower bound on the evidence. Evidence is written as log likelihood of the observed data: $lo g p (x)$ . ELBO gives a proxy objective that can be optimized with respect to a latent variable model, and in the best case (when true distribution is learned), ELBO exactly equals the evidence.

3 Objective

The estimated posterior $q_{ϕ} (z ∣ x)$ needs to be close to best approximation $p_{θ} (z ∣ x)$ , and is measured using reversed KL divergence $D_{KL} (q_{ϕ} (z ∣ x) ∥ p_{θ} (z ∣ x))$ .

Why use reverse KL?

Doesn’t KL $D (p ∥ q)$ measure how many bits is required to estimate p using q? Then, why are measuring how many bits are required to go from p to q, when instead we are approximating p with q?

D_{KL} (q_{ϕ} (z ∣ x) ∥ p_{θ} (z ∣ x)) = E_{z \sim q_{ϕ}} [lo g q_{ϕ} (z ∣ x) - lo g p_{θ} (z ∣ x)] = E_{z \sim q_{ϕ}} [lo g q_{ϕ} (z ∣ x) - lo g p_{θ} (x ∣ z) - lo g p_{z} (z) + lo g p_{θ} (x)] = E_{z \sim q_{ϕ}} [lo g q_{ϕ} (z ∣ x) - lo g p_{θ} (x ∣ z) - lo g p_{z} (z)] + lo g p_{θ} (x) = E_{z \sim q_{ϕ}} [lo g \frac{q _{ϕ} ( z ∣ x )}{p _{z} ( z )}] - E_{z \sim q_{ϕ}} [lo g p_{θ} (x ∣ z)] + lo g p_{θ} (x) = D_{KL} (q_{ϕ} (z ∣ x) ∥ p_{z} (z)) - E_{z \sim q_{ϕ}} [lo g p_{θ} (x ∣ z)] + lo g p_{θ} (x) Using bayes rule lo g p_{θ} (x) is constant over distribution q_{ϕ}

We can rearrange the terms to get the learning objective. To get optimal parameters $θ^{*}, ϕ^{*}$ , we want to minimize the KL divergence between the two distributions $p_{θ} (z ∣ x), q_{ϕ} (z ∣ x)$ and maximize the log likelihood of generating real data $p_{θ} (x)$ .

L (θ, ϕ) = - lo g p_{θ} (x) + D_{KL} (q_{ϕ} (z ∣ x) ∥ p_{θ} (z ∣ x)) = - reconstruction term E_{z \sim q_{ϕ}} [lo g p_{θ} (x ∣ z)] + prior matching term D_{KL} (q_{ϕ} (z ∣ x) ∥ p_{θ} (z))

This term is known as Variational lower bound or Evidence lower bound (ELBO). For more details on why variational bound, refer to this amazing post. Let’s understand what each term in the objective represents:

Reconstruction term: Measures how well are we able to convert a latent vector $z$ into an observation $x$ .
Prior matching term: Measures how well is the learned encoder $q_{ϕ}$ matching our prior belief over latent variables, $p (z)$ .

Lower bound is achieved because KL-divergence is always non-negative, thus $L \geq - lo g p_{θ} (x)$ .

As briefed in the assumptions, encoder of the VAE is generally chosen to be multivariate gaussian with diagonal covariance, and the prior is selected to be a distribution, we can easily sample from like standard multivariate Gaussian.

q_{ϕ} (z ∣ x) p (z) = N (z; μ_{ϕ} (x), σ_{ϕ}^{2} I) = N (z; 0, I)

While training the model, the backward pass follows:

Computing the KL term using closed form solution.
For computing the likelihood term:
- Encode x using $f_{ϕ} (x) = μ_{z}, σ_{z}^{2}$
- Sample z from the distribution $N (μ_{z}, σ_{z}^{2})$
- Decode z to obtain parameters for likelihood distribution $g_{θ} (z) = μ_{x}, σ_{x}^{2}$
- Measure the reconstruction error by computing the probability distribution $lo g p_{θ} [x ∣ z] = lo g N (x; μ_{x}, σ_{x}^{2}) \propto (x - μ_{x})^{2}$

We want to compute gradient of $E_{z \sim q_{ϕ} (z ∣ x)} [lo g p_{θ} (x ∣ z)]$ with respect to parameter $ϕ$ , but gradient does not follow z as it random due to being sampled from the distribution. The trick is to fix the parameters $μ_{z}, σ_{z}^{2}$ and write z as deterministic variable $z = μ + σ^{2} ⊙ ε$ where $ε \sim N (0, I)$ .

After training the VAE, to use it as a generative model, we can simply sample a latent variable $z \sim p (z)$ from the latent space, and run it through the decoder. VAE are able to learn a compressed, low-dimensional representation of the high-dimensional data manifold as dimension of $z$ is much less than $x$ , and the output can be controlled by carefully editing the latent variable.

4 Hierarchical VAE

Introduced in [2, 3], HVAE can be seen as a generalization of VAE to multiple hierarchies of latent variables. We can generate arbitrary graphical models to condition latent on all other previous latents, which themselves are generated from other higher-level, more abstract latents.

hvae

To consider a simple example, take a Markovian chain of VAEs, where each decoding latent variable $z_{t}$ is dependent on the previous one $z_{t + 1}$ . This can be simply seen as a recursive stack of VAEs and we can model the joint distribution, and the posterior as:

p (x, z_{1 : T}) q_{ϕ} (z_{1 : T} ∣ x) = p (z_{T}) p_{θ} (x ∣ z_{1}) t = 2 \prod T p_{θ} (z_{t - 1} ∣ z_{t}) = q_{ϕ} (z_{1} ∣ x) t = 2 \prod T q_{ϕ} (z_{t} ∣ z_{t - 1})

And similarly, ELBO can be extended:

lo g p (x) = lo g \int p (x, z_{1 : T}) d z_{1 : T} = lo g E_{q_{ϕ} (z_{1 : T} ∣ x)} [\frac{p ( x , z _{1 : T} )}{q _{ϕ} ( z _{1 : T} ∣ x )}] \geq E_{q_{ϕ} (z_{1 : T} ∣ x)} [lo g \frac{p ( x , z _{1 : T} )}{q _{ϕ} ( z _{1 : T} ∣ x )}]

5 References

Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
Sønderby, C. K., Raiko, T., Maaløe, L., Sønderby, S. K., & Winther, O. (2016). Ladder variational autoencoders. Advances in neural information processing systems, 29.
Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., & Welling, M. (2016). Improved variational inference with inverse autoregressive flow. Advances in neural information processing systems, 29.

lonerapier.me

Table of Contents

Backlinks

Explorer

Variational Auto-encoders

1 Introduction

2 Likelihood

3 Objective

4 Hierarchical VAE

5 References