1 Score Based Generative Models

We have shown in VDM that approximating the score function is analogous to predicting the posterior mean $μ_{q}$ or noise $ϵ$ . We can now look at another class of models: Score-based Generative models that directly learn the score function and then sample from the distribution using MCMC methods like Annealed Langevin sampling.

Recall, in EBMs, we represent an arbitrary distribution using Boltzmann energy function, $p_{θ} (x) = \frac{1}{Z ( θ )} exp (- f_{θ} (x))$ , where $f_{θ}$ is arbitrary, flexible and parametrizable function and $Z_{θ}$ is the normalizing or partition function written as $\int exp (- f_{θ} (x)) d x$ . One way to learn such a distribution is Maximum likelihood estimation but requires tackling the constant $Z_{θ}$ which can be computed using MC estimate of the samples, but that might be difficult to get for complex $f_{θ}$ functions.

To make likelihood training feasible, models restrict their architectures or approximate normalizing constant:

Causal convolutions in AR models: likelihood is written using chain rule that avoids normalizing constant
Normalizing flows: uses change-of-variables to solve the problem differently.
VAE: Avoid latent variable integral using ELBO maximization.
MCMC sampling: approximate expectation using real samples or the current best samples.

Learning the score function using neural network approximation $s_{θ}$ is one way to avoid modeling the normalization constant.

\nabla_{x} lo g p (x) = \nabla_{x} lo g exp (- f_{θ} (x)) - \nabla_{x} lo g Z_{θ} = - \nabla_{x} f_{θ} (x) \approx s_{θ} (x)

We can then compute the score either by directly modeling the unnormalized energy function $f_{θ}$ or approximating the score function with a neural network $s_{θ}$ . The neural network is then trained using the fisher divergence measure between estimated score and ground truth score function.

E_{p (x)} [∥ s_{θ} (x) - \nabla lo g p (x) ∥_{2}^{2}]

Why use fisher divergence and not other divergence measures?

Due to the flexibility of fisher divergence, $s_{θ}$ doesn’t need to be any normalized distribution, and simply compares the $ℓ_{2}$ distance between ground truth score and the learned model.

What does score function mean, what does it represent?

Score function is the gradient of the log likelihood of the data $x$ . Geometrically, it defines a vector field over the space of $x$ pointing towards the peaks.

add a plot for langevin dynamics and score function.

After learning the true distribution by modelling the score function, we can just sample from the distribution. Langevin sampling is an MCMC sampling procedure that enables drawing sample from the distribution using just the score function which we’ve learned as $s_{θ}$ , and arbitrary isotropic Gaussian noise. We update the values and iteratively follow the direction until a mode is reached.

x_{i + 1} \leftarrow x_{i} + ϵ \nabla_{x} lo g p (x) + 2 ϵ z_{i}, i = 1, 2, \dots, K

and $z_{i} \sim N (0, I)$ . As $ϵ \to 0$ and $K \to \infty$ , the sequence converges to the mode of the distribution. The extra noise term ensures that the procedure is not entirely deterministic and sufficient exploration is performed at each iteration.

For what scenarios, we can have access to the ground truth score function, and for what, do we not have?

In case, we do not have access, we can use score matching techniques to minimize fisher divergence. Combination of learning the score function using score matching, and sampling using Langevin dynamics is known as SCGM.
naive-score-matching

What are the three main problems of vanilla score matching?

From manifold hypothesis, we know that x lies on a low-dimensional manifold in a high-dimensional space. Computing log of the probability of the points not in low-dimensional manifold is undefined, which means, even if we have access to ground truth score function, it leads to numerical instability errors when learning on low-probability points.
Model is trying to estimate the expectation of L2 norm of difference between learned score function and ground truth. In low-density regions, where few data points are available, probability assigned to the point will be very low, and model will not gain any significant information from the input. Sampling from Langevin dynamics involves starting from a random point and iteratively following the score. Starting sample is highly likely to be in low-density regions. Due to inaccurate scores in those regions, model may never find the mode, and the final generated sample may be suboptimal.
Langevin sampling may not consider the weight of the mixture of distributions. Suppose the distribution is p(x)=cp1(x)+c2p2(x). Computing the score using gradient of log probability erases the weight of the distribution. The learned score function may then be agnostic to the different weights, and Langevin sampling then leads to a different peak irrespective of the strengths in the combined distribution.

How to mitigate this problem? How can the problem be summed in one sentence? The problem is due to lack of signal in low-density regions. This is equivalent of using an unseen image from the true distribution to the network.

The solution is quite simple, yet elegant, and follows the same process as VDM, i.e. to perturb data points with noise, and train a score based model on the noisy data. This fixes two of the three previously stated problems. Suppose, we perturb a high-dimensional distribution with additional Gaussian noise: $\tilde{x} = x + σ z, z \sim N (0, I)$ .

Due to the support of Gaussian being the entire space, a perturbed sample is no longer confined to high-density region, $p_{σ} (x) > 0, \forall x$ . Mathematically, this is equivalent of taking a convolution with a Gaussian kernel: $p_{σ} (\tilde{x}) = \int p (x) N (\tilde{x}; x, σ^{2} I) d x$
Adding Large Gaussian noise smoothens out the peaks of the distribution more aggressively, and model can get training signal from low-density areas.

To fix the third problem is a little tricky, because adding single big noise could alter the distribution and the model could potentially be learning an incorrect distribution altogether. The solution to this is actually, quite similar to VDMs again, i.e. add multiple noise at successive levels that result in intermediate distributions. Distributions with smaller noise respect the weighing coefficients in a mixture of distributions while larger noise the model can learn global structure to sufficiently guide the score matching objective.

Suppose, we always perturb the data with isotropic Gaussians.

Take L Gaussians with increasing standard deviations: $σ_{1} < σ_{2} < \dots < σ_{L}$ .
Perturb the data distribution $p (x)$ with each of the noisy Gaussian to obtain a sequence of more noisy distributions: $p_{σ_{i}} (x) = \int p (z) N (x; z, σ_{i}^{2} I) d z, i = 1, 2, \dots, L$
Drawing samples from each of the distribution is trivial, $x \sim p (x)$ , and computing $x + σ_{i} z, z \sim N (0, I)$
Train a neural network approximator to learn the score function for all noise level simultaneously. The objective is a weighted sum of fisher divergences of each noisy distribution $ar g min_{θ} \sum_{t = 1}^{T} λ (t) E_{p_{σ_{t}} (x_{t})} [∥ s_{θ} (x, t) - \nabla lo g p_{σ_{t}} (x_{t}) ∥_{2}^{2}]$

After training, produce samples by running Annealed Langevin dynamics which is just Langevin dynamics that runs for each distribution $t = T, T - 1, \dots, 1$ in sequence, and initialization for each sampling is the output of the previous sampler. You can note how the noise reduces with each new sampler, and the most recent sample is used as the initializer to carry-forward the information learned from previous step. This can be interpreted as reverse diffusion process of a VDM, where an isotropic noisy vector is gradually refined towards lesser noise levels.

Score matching techniques

Explicit score matching

Implicit score matching

Denoising score matching

sliced score matching

2 References

Generative Modeling by Estimating Gradients of the Data Distribution | Yang Song

Song, Yang, and Stefano Ermon. “Generative modeling by estimating gradients of the data distribution.” Advances in neural information processing systems 32 (2019).

3 Score Based Generative Modelling Using SDEs

Write a table that compares between diffusion and SDE formulation of diffusion and compares forward and reverse process with an SDE.

A quick primer on SDEs

Time reversal of diffusion/reverse SDE

Why do we want to reverse an SDE? What do we aim to get by tracing back the forward trajectory of an SDE?
- We can’t trace back the trajectory exactly, and get back the initial position, but what we can do aim is, at initial time $t_{0}$ , the particles should have the position as per the forward SDE’s initial distribution.
- Or even stronger assumption, at any point t, the position of the particles match the distribution of the forward trajectory.
- We say that distribution is matched when the probability mass is conserved as the process progresses.
- Andersen’s theorem provides the solution to reverse an SDE and is mathematically stated as $d x = [f (x, t) - g (t)^{2} \nabla_{x} lo g p_{t} (x)] d t + g (t) d \overset{ˉ}{w}$ , where $\overset{ˉ}{w} = w_{t - 1} - w_{t}$ .

Take an ODE, reverse it using sign change

sampling methods for ODE
- first-order method: Euler’s method
- second-order method: Runge-Kutta

For an SDE,

sample using Euler-Maruyama method
reverse using Andersen’s theorem
Ornstein-Uhlenbeck: linear drift + constant diffusion.

We can take the DDPM process and write it in the form of continuous time SDE. To see the process, let’s start with forward diffusion: $x_{i} = 1 - β_{i} x_{i - 1} + β_{i} z_{i - 1}$ , where $z_{i} \sim N (0, I)$ . Any iterative process can bee converted into an ODE. For an SDE conversion, we’ll follow similar approach.

For N-step diffusion process:

Define a step size $Δ t = \frac{1}{N}$
Consider auxiliary noise level ${\overset{ˉ}{β}_{i}}_{i = 1}^{N}$ where $β_{i} = \frac{β ˉ _{i}}{N}$ . We define $β_{i}$ this way because as $lim_{N \to \infty} \frac{β ˉ _{i}}{N} \approx β (t) Δ t$ .
Similarly define $x_{i} = x (\frac{i}{N}) = x (t + Δ t)$ and $z_{i} = z (\frac{i}{N}) = z (t + Δ t)$
Following the diffusion process,

x_{i} x (t + Δ t) = 1 - β_{i} x_{i - 1} + β_{i} z_{i - 1} = 1 - β (t + Δ t) Δ t x (t) + β (t + Δ t) Δ t z (t) \approx (1 - \frac{1}{2} β (t + Δ t) Δ t) x (t) + β (t + Δ t) Δ t z (t) \approx x (t) - \frac{1}{2} β (t) Δ t x (t) + β (t) Δ t z (t) (Putting β_{i} = \overset{ˉ}{β}_{i} / N and substituting) (substituting β (t + Δ t) \to β (t))

As $Δ t \to 0$ , $z (t) = w (t + Δ t) - w (t) = d w$

d x = - \frac{1}{2} β (t) x (t) d t + β (t) d w

Reverse sampling equation of DDPM can be written as SDE as

d x = - β (t) [\frac{x}{2} + \nabla_{x} lo g p_{t} (x)] d t + β (t) d \overset{ˉ}{w}

Following similar iterative update scheme, this can be converted into DDPM reverse sampling equation: $x_{i - 1} \approx \frac{1}{1 - β _{i}} [x_{i} + \frac{β _{i}}{2} \nabla_{x} lo g p_{i} (x_{i})] + β_{i} z_{i}$ .

Why DDPM SDE is called Variance-Preserving SDE?

Why SMLD SDE is called Variance exploding SDE?

lonerapier.me

Table of Contents

Backlinks

Explorer

Score Matching

1 Score Based Generative Models

2 References

3 Score Based Generative Modelling Using SDEs