Diffusion Models and (Many) Differential Equations | Katie Keegan

Reading about the math of diffusion models gives you a lot of DEs. The most common (in this context) one might be stochastic differential equation (SDEs) - most SBDM-based papers will probably tell you towards the beginning something about a forward and reverse SDE when introducing how diffusion models work in continuous time. You’ll then maybe hear something about a Fokker-Planck PDE. Finally, if you go down enough of a rabbit hole, or if you’re studying for a generative modeling exam (not that I’d know anything about that), or if you’re into consistency models lately :), you might have heard something about probability flow ODEs (PF-ODEs).

At least to me, the relationship between all of these things is not immediately clear. Unless you made it further in graduate-level probability theory than I do (which is actually probably likely, so to speak - I lasted a month of auditing before sneaking out), the idea that these concepts are all describing the same thing from different perspectives might not be very obvious. For my own sake, I’ve written down my understanding of the relationship between all of these here. This is done at a moderately high level, with links at each step for mathematical derivations that I choose not to include in here. I hope these will be useful to others!

SDEs

As a quick note, I first thought that SDEs would be the ``hardest” of the three DEs to talk about, and that I’d build up in the ODE $\to$ PDE $\to$ SDE order. It turns out that the other way around is much easier (with a slight return to SDE territory at the end for completeness). Hopefully I can convince you to agree with me :)

First, let’s break down what we want. Diffusion models learn to transform between a nice, tractable, easy-to-sample distribution (which will pretty much always be a standard Gaussian) and something complex, intractable, and impossible to directly sample from. This transformation needs calculus. The calculus we know and love is deterministic. But, by definition, we are trying to learn a distribution, so we need some kind of ``probabilistic calculus.”

This ``probabilistic calculus” is called stochastic calculus, and people have cared about this for a long time. If you’re a stock (stonk???) trader, stochastic calculus is quite important in forecasting financial quantities while accounting for the inherent (seemingly random) variability.

There is a very nice tutorial here which explains what a stochastic processes is and shows some cool examples. If you’re willing to accept that stochastic calculus exists and that it has built up nice ways of dealing with the evolution of random variables in time, let’s move on.

In diffusion models, we design a very simple, first-order SDE (specifically, this is an Ito SDE) which transforms our data $x \in \mathbb{R}^{d}$ in a stochastic way. We choose this forward SDE to look like this:

\[dx = f(x,t)dt + g(t)dW_{t}\]

We define our SDE to start at some time $t=0$ and stops at some $t=T$. The only randomness in this whole differential equation is in $W_{t}$, which is just describing standard Brownian motion in $\mathbb{R}^{d}$ (colloquially, this is a fairly simple way of imposing a standard amount of randomness). We scale the influence of this randomness in $g(t)$, which is the (usually scalar) diffusion coefficient. Here, $f(x,t)$ is a deterministic drift (no randomness here). I hope this sounds less scary than SDEs might seem - after all, we’re just saying that when we nudge $x$ a bit, this nudging will happen according to some deterministic drift and some scaled randomness.

This gives us a way of describing the dynamics of $x$ (equivalent mathematically to a particle in $\mathbb{R}^{d}$)This assumes that $x$ is just some point in $\mathbb{R}^{d}$. The whole point of generative modeling, though, is that $x$ doesn’t exist in a vacuum. Rather, it is a sample from a data distribution, $x \sim p(x)$, where $p(x)$ is the data’s probability density. This introduces another probabilistic component - but strangely enough, brings us back to PDE language.

PDEs

We’ve imposed some stochastic forward dynamics of our particle $x$ over time. However, if we think of evolving all of the particles over our data density $p(x)$ in time, a PDE naturally falls out. Let’s see why.

It intuitively makes sense that when we evolve each particle $x \sim p(x)$ in time according to the above SDE, we will obtain a new distribution, $p_{t}(x)$, at each time $t$. Whenever we apply any transformation to a probability distribution, however, we have to remember a crucial ground rule: probability distributions always integrate to 1. Otherwise, it wouldn’t be a proper probability distribution! This means that if we think of each particle in $p_{t}(x)$ as some tiny amount of mass, all that the SDE should be allowed to do is move all of this (probability) mass around - not create or destroy it. This condition is a conservation law, very loosely given by

\[\partial_{t} p_{t}(x) = -(\textup{outflow of mass due to drift)} + (\textup{random spreading due to noise $W_{t}$}).\]

This is precisely what the famous Fokker-Planck PDE does! There is some detailed mathematical work which is necessary to get from the general Ito process/SDE above to the Fokker-Planck PDE. Some good mathematical references which accomplish this are here: a blog post a paper which focuses on the FP-PDE between the diffusion SDE and the PF-ODE.

With this, our PDE becomes

\[\partial_{t}p_{t}(x) = -\nabla \cdot (f(x,t) \cdot p_{t}(x)) + \frac{1}{2} g(t)^{2} \nabla^{2}p_{t}(x).\]

Some things can be notationally simplified here. First of all, we can write $\nabla^{2}p_{t}(x) = \Delta p_{t}$, which is the Laplacian. We can also drop a few $x$’s and $t$’s since we hopefully are pretty comfortable with each function’s dependence on each of these variables by now. Our simplified version is now

\[\partial_{t}p_{t} = - \nabla (f \cdot p_{t}) + \frac{1}{2} g_{t}^{2} \Delta p_{t}.\]

ODEs

If you’re reading this, you’ve probably at least heard of score-based diffusion models (SBDMs). You may even know the definition of the score:

\[s(x,t) = s_{t}(x)= \nabla_{x}\log(p_{t}(x)).\]

If you’re already familiar with SBDMs, you know that this score $s_{t}$ is particularly crucial in the reverse SDE, as this is how one ultimately samples from the complex distribution. The score also certainly appears in the forward SDE and associated FP-PDE (as we will show below), and its role becomes even more apparent as as we move into PF-ODE territory.

By simply applying a derivative rule from standard calculus and rearranging), we may note

\[s_{t} = \frac{\nabla_{x}p_{t}}{p_{t}} \implies \nabla_{x} p_{t} = s_{t} \cdot p_{t}.\]

From this, we may write the Laplacian in terms of the score:

\[\Delta p_{t} = \nabla_{x} (\nabla p_{t}) = \nabla(p_{t} \nabla \log (p_{t})) = \nabla_{x}(s_{t}p_{t}).\]

We now do a little bit of term-collecting and rearranging of the FP-PDE:

\[\begin{align*} \partial_{t}p_{t} &= - \nabla_{x} \cdot (f p_{t}) + \frac{1}{2}g_{t}^{2} \Delta p_{t}\\ &= - \nabla_{x} \cdot (fp_{t}) + \frac{1}{2}g_{t}^{2} \left( \nabla_{x}(s_{t} p_{t}) \right)\\\ &= -\nabla_{x} \cdot \left [ (f-\frac{1}{2}g_{t}^{2} s_{t}) \cdot p_{t} \right ]. \end{align*}\]

Remember that we assume $g_{t}$ is scalar in $x$, so this moving-around within a gradient which is computed with respect to $x$ is fine.

Now, if I were a real mathematician, this is the part where I say that “the astute reader will notice that this is a continuity equation.” Unfortunately, I don’t think I am a real mathematician, and I’m certainly not an astute reader (as this connection was certainly not immediate to me). I’d rather say something like this: PDE folks were kind enough to notice or develop a nice link between a frequently-used equation called the continuity equation in their field (there’s even a Wikipedia page!) and the forward time evolution of probability distributions, which we can now exploit (mwahaha!) for generative modeling.

In general, a continuity equation will follow the form

\[\partial_{t} \rho_{t} + \nabla_{x} \cdot \left [ v(x,t) \cdot \rho_{t} \right ]= 0,\]

where $\rho_{t}$ is often referring to a fluid density in classical PDE applications.

So, since our FP-PDE was rearranged and simplified to

\[\partial_{t}p_{t} + \nabla_{x} \cdot \left [ (f-\frac{1}{2}g_{t}^{2} s_{t}) \cdot p_{t} \right ] = 0,\]

we observe (hopefully very little astuteness required!) that our rearrangement follows a continuity equation form with a velocity field of the form

\[v(x,t) = (f-\frac{1}{2}g_{t}^{2} s_{t}).\]

This is an ordinary differential equation (ODE)! In particular, it is a probability flow ODE (PF-ODE), which interprets the probability density in the same way that one would interpret a fluid density and evolves it accordingly. Our drift and diffusion coefficients are still here, but they are all contained within this velocity field. All of this only describes a nice way of building up the connections between the forward SDE, the FP-PDE, and the PF-ODE. Finally, let us wrap things up by returning to the reverse-time SDE.

SDE (The Sequel)

As a brief aside, diffusion modeling papers will often start off the background section with both the forward and reverse SDEs. Something I wondered when I started reading these papers more critically was this: ``There is clearly a lot of nice math that comes out of the forward SDE, and it’s awfully nice that we have a closed-form equation describing the reversed-time SDE. Do all forward SDEs have a corresponding reverse-time SDE? Is the world really that nice?”

Sadly, it is not. However, the folks who designed diffusion models (and all of the folks who have studied their mathematical properties afterwards) have ensured that this is the case. One can show that that the reverse-time SDE one sees so often in these papers is truly valid through imposing some terminal time $t=T$, reformulation of the time variable so that it operates in reverse time, $t = T-s$, and setting the forward and reverse Fokker-Planck PDEs to be equal to each other. This blog post does an excellent job of doing this. One can also, alternatively, derive the reverse-time SDE from scratch. Plenty of resources do this as well, although these require a bit more knowledge of continuous-time Markov chain and Markov process properties. However, the reverse SDE that is postulated to begin with should be extremely familiar, now that we’ve derived the PF-ODE. Let’s take a look (we’re just presenting this as it is written in here, not deriving anything):

\[d x = \left[f(x,t) - g(t)^{2}\nabla_{x}\log p_{t}(x)\right]dt + g(t)d\bar{W}_{t},\]

where $ \bar{W}{t} $ is now the *reverse-time* Wiener process. The score $ s{t} = \nabla_{x}\log p_{t}(x) $ has appeared yet again!

The reverse-time SDE now begins the simulation in time at the tractable Gaussian distribution and runs time backwards from the terminal state $t=T$ back to the distribution of interest. However, other than the score, everything else in the SDE is something that we designed - we impose $f(x,t)$ and $g(t)$ based on our own choices (e.g. whether we choose a variance-exploding or variance-preserving SDE), and the $\bar{W}_{t}$ is also something that we know and can simulate, but the score is the one part that requires that we fit to data (or at least the data that is generated by propagating each data sample $x$ along a path induced by forward SDE).

Conclusion

To wrap things up, we’ve shown (at a very high-level, with lots of pointers towards resources for those who really want to get in the weeds of the math) a few things about diffusion models. First, we’ve shown that the famous forward diffusion SDE (which is a relatively simple one, at least in the world of SDEs) on particles can extend to a Fokker-Planck PDE over probability distributions. Then, with some rearranging, we can interpret the FP-PDE as a probability flow ODE (PF-ODE) which relies on the score of the distribution $p_{t}$. Finally, we revisited SDE World and noticed that many of the elements which arose out of the PF-ODE are (hopefully as expected) crucial in the reverse sampling case. In all cases, hopefully it is clear how mathematically important learning the score is in these models (and why they are, after all, called score-based diffusion models, or SBDMs): it gives us the velocity field for the PF-ODE and a solution to the reverse-time SDE.

I hope this is useful, and thank you for taking the time to read this!