Depending on who you talk to, Energy-Based Models (EBMs) are the next big thing (or have already been the current big thing???) in generative modeling. However, they struggle painfully in higher dimensions.

When you first read about them, you’ll see some references to topics from statistical physics and things like the Boltzmann distribution. This got me curious - what exactly is the motivation of EBMs from physics, and why is their formulation already useful? If they’ve been useful since the 1800s (!!!) in physics, what makes them intractable (at least as written) in generative modeling applications?

The Physics Origin: Canonical Ensembles

To answer the above questions, we first need to look at a topic called canonical ensembles. Energy-based models are sometimes referred to as something like “learning by canonical ensemble.” This is something that was a little confusing to me - the term “ensemble” usually means something a bit different in machine learning, so I initially thought we were training a group of models. It turns out that this is just a terminology difference between fields: in physics, people might use the term “ensemble” to mean “probability space” (which is pretty nonintuitive to me, at least!). There’s a few notes about this jargon difference around the internet.

Historically, the notion of a canonical ensemble was first studied by Boltzmann in 1884 but was more formalized by Gibbs in 1902 (and both of these names will come up plenty throughout this post). This was around a pretty important few decades in the history of thermodynamics and how objects interact with systems with heat. The canonical ensemble, in particular, emerged as a way to assign probabilities to the possible different states of a system with $N$ particles, volume $V$, and the presence of a heat bath with absolute temperature $T$. I like the way that these lecture notes put it:

“In statistical mechanics, we study the possible microstates of a system. We never know exactly which microstate the system is in. Nor do we care. We are interested only in the behavior of a system based on the possible microstates it could be, that share some macroscopic proporty (like volume V ; energy E, or number of particles N). The possible microstates a system could be in are known as the ensemble of states for a system.”

The canonical ensemble fixes $N$, $V$, and $T$, but allows the total energy $E$ to vary. As written by Boltzmann, the canonical ensemble (or equivalently, the probability space under these fixed parameters and conditions) is given by

$P = \frac{e^{-\frac{E_{i}}{k_{B}T}}}{Z},$ where $E_{i}$ is the energy of the $i$-th possible microstate of the system, $k_{B}$ is Boltzmann’s constant, and $Z$ is an incredibly important (like lowkey the reason for this blog post) function called the partition function. In this context, Z is defined as follows: $Z = \sum_{i}^{n}e^{-\frac{E_{i}}{k_{B}T}}.$ This partition function is really just here to normalize the numerator. You will hopefully notice that all that it’s doing is summing over all possible states that the numerator could possibly be in. Thus, its role here is just as a normalizing constant so that $p$ can be properly interpreted as a probability distribution. In general, a probability distribution of this form

$p(x) := \frac{e^{-E(x)}}{Z}, Z := \int_{x \in \mathcal{X}} e^{-E(x)} dx$ is called the Boltzmann distribution. Here, $\mathcal{X}$ represents the space of all possible configurations $x$ which could be passed into the numerator.

This formulation ended up being an extremely convenient and flexible function in a lot of areas. For example, a Gaussian distribution can also be represented as a Boltzmann distribution! As a reminder, a Gaussian looks like

$p(x) = \frac{1}{\sqrt{2\pi\sigma^{2}}}e^{-\frac{1}{2\sigma^{2}}(x - \mu)^{2}}.$ The energy function here is $E(x) = \frac{1}{2\sigma^{2}}(x - \mu)^{2}$, and one can compute (I’d say this is left as an exercise to the reader if I was feeling sassy) $Z = \int_{x \in \mathcal{X}} e^{-\frac{1}{2\sigma^{2}}(x - \mu)^{2}} dx = \sqrt{2 \pi \sigma^{2}}.$

While looking this up, I learned that there is actually a nice physical connection between Gaussians (or at least the energy function you get when looking at them as Boltzmann distributions) and harmonic oscillations of springs. There’s lots of stuff about this on the internet, but there is a comment in this StackExchange thread about this which is concise and interesting.

Gaussians aren’t the only example of Boltzmann distributions in the wild. In fact, they have been used to model even more complex probabilistic phenomena, from quantum physics to economics. In these cases, the partition function $Z$ unfortunately becomes harder and harder to compute (Gaussians, while ubiquitous, are quite idealized as far as probability distributions go). Luckily, other fields (or the same field? Names like Gibbs show up in both thermodynamics as well as in statistical sampling techniques) have developed ways to approximate tricky integrals like $Z$, and so methods such as Monte Carlo sampling are traditionally used to turn what would otherwise be an intractable integral into a tractable (or at least approximate-able? Is that a word? Approximable?) one.

% \section{A Unifying Model for Probability Distributions: Boltzmann Machines}

The Generative Modeling Technique: Energy-Based Models

Naturally, the generative modelers got ahold of Boltzmann distributions and canonical ensembles. It’s hopefully not too hard to see the initial attraction of this framework: we get an unnormalized probability distribution, figure out how to normalize it, and then we immediately have our probability distribution! In particular, our parameterization of $p(x)$ would look like this:

\[p_{\theta}(x) = \frac{e^{-E_{\theta}(x)}}{Z_{\theta}},\]

where

$Z_{\theta} = \int_{x \in \mathcal{X}} e^{-E_{\theta}(x)} dx$ and $E_{\theta}(x) = \textup{NeuralNetwork}_{\theta}(x).$ At first glance, this is nothing but the earlier Boltzmann distribution with the energy function parameterized by a neural network. This may be true, but getting this denominator $Z_{\theta}$ has suddenly gotten incredibly more difficult. To get this integral, we’d need to pass in every single possible instance that $x$ could take and evaluate it, which is impossible! Since integrating over all of the possible inputs to the neural network is intractable, one could reasonably ask: well, we’ve had tough numerators before in other fields! Why can’t the solutions for those other problems, like MC-based approaches for approximation integral estimation, work well here?

The answer is twofold. First, MC-based methods don’t scale too nicely to higher dimensions. For things like image generation, where even a very small image would still live in a vector space of at least 1000 dimensions, approximating this integral will quickly become infeasible.

Second, even if we could somehow get a good approximation of the integral, we must remember that we’re trying to optimize $p_{\theta}(x)$. That means that every time we update the parameters $\theta$, we would have to recompute the $Z_{\theta}$ which corresponds to our updated $e^{-E_{\theta}(x)}$ in the numerator. Ultimately, this means that we cannot do (standard) maximum likelihood estimation (MLE) estimation to ``fit” a $p_{\theta}(x)$ to a true $p(x)$ (or at least what we can figure out about it based on our data). This necessitates some careful thought on how to efficiently train an EBM.

Pa(rt)i(tio)n-free Training Strategies

So, as we’ve seen, we can’t do our usual MLE and minimization of Kullback-Liebler (KL) divergence between our estimated and true probability distributions. However, generative modelers have an arsenal of some familiar techniques that you end up seeing quite often. First, let’s consider the MCMC case and at least see how we can make it more amenable to the objective of learning the data distribution. Remember that we wish to maximize the likelihood over the data distribution. “Maximize” suggests “gradient!” Also, since logarithms make math nice, let’s throw one of those in there as well. To sum up (so to speak), the gradient with respect to the parameters $\theta$ of the log-likelihood decomposes as

$\nabla_{\theta}\log p_{\theta}(x) = \nabla_{\theta}\log\left(\frac{e^{-E_{\theta}(x)}}{Z_{\theta}}\right) = - \nabla_{\theta}E_{\theta}(x) - \nabla_{\theta} \log Z_{\theta}.$ The first term, as we’ve discussed ad nauseum, is fine to compute. It turns out, according to some nice and clear derivations on page 3 in here, that

\[\nabla_{\theta} \log Z_{\theta} = \mathbb{E}_{x\sim p_{\theta}(x)} \left[ - \nabla_{\theta} E_{\theta}(x) \right].\]

This gives us an unbiased estimate of $\nabla_{\theta} \log Z_{\theta}$ (that, of course, would be improved by more samples). The “more samples” part is where things get difficult, so this is where the MCMC methods would come in. Even then, letting a full Markov chain run could take quite a long time.

To get around this, people use something called contrastive divergence. Similar to the canonical ensemble, this terminology got me really confused. ``Is this anything like a contrastive loss?” I cried plaintively into the universe. “Why do machine learning people keep using the same word for different things?”

Unlike contrastive learning in the general machine learning sense, where the “contrastive” part comes in when we want to essentially preserve the distances between similar and dissimilar data points when we embed them, contrastive divergence just says: let’s run a block-Gibbs sampling procedure for $K$ steps and use this as our Monte Carlo estimate of the model expectation. I think that there is some nice math to describe why this procedure is called contrastive divergence (I guess it sounds nicer than “using finite steps of a Gibbs sampling procedure to approximate a high-quality sample”) - the original paper has nice math which explains a lot of this, so maybe there is a clue there. My general understanding of why this name is used is that we’re trying to follow the gradient of the contrast (or difference) between the true data distribution and the approximation of the generated data distribution that we get through Gibbs sampling. There’s also a nice visual in these slides which show the loss terms in terms of a positive phase and negative phase combining to form the full loss, and I think you can think of the block-Gibbs estimate as approximating the negative phase.

As a fun fact, Gibbs is also one of the folks who worked on this idea of canonical ensembles over a century ago. I know that statistical physics and thermodynamics aren’t too far away from each other and that it makes sense that the same names would show up quite often in a field derived from their work. Even then, I think it’s pretty cool.

The above approach focuses on how to match $p_{\theta}$ to $p$. Another thing we could try to do is match the scores of each respective distribution, i.e. $\nabla_{x}\log p_{\theta}$ and $\nabla_{x}\log p$. You might have heard the term “score” in the context of score-based diffusion models. I personally thought it was cool that the technique of score-matching comes up in other areas of generative modeling beyond diffusion models. It really is a useful tool! Anyway, the score-matching objective comes out to look like this (more details in these slides):

$\mathcal{L} = \frac{1}{2} \mathbb{E}_{x \sim p(x)} \left [ \left \| \nabla_{x}\log p_{\theta}(x) - \nabla_{x}\log p(x) \right \|_{2}^{2} \right ] \propto \mathbb{E}_{x\sim p(x)} \left[ \| \nabla_{x}E_{\theta}(x)\|_{2}^{2} - \textup{Tr}(\nabla_{x}^{2}E_{\theta}(x))\right ] + \textup{Constant}.$ So, we now have something with no partition function in sight! However, we now have a Hessian that we need to compute, and these are famously nontrivially expensive. That’s sad. However, trace computation is another one of these things that comes up a lot, and there is a great deal of work on how to efficiently estimate the trace of the Hessian. Some excellent blog posts describing the active field of trace estimation are this one and this one.The key behind these ideas is that while we may not have access to the Hessian itself, we can figure out what the Hessian does - i.e. we can efficiently get matrix-vector products even if we can’t get the matrix itself. When we approximate the score-matching loss using the approximation of the Hessian we can get from these Hessian-vector products (which is actually quite easy, thanks to automatic differentiation), we get sliced score matching. Cool!

This is a very high-level overview of the past and present of energy-based models. I’ve intentionally omitted things like Restricted Boltzmann Machines, a kind of EBM which imposes some special structure on how we parameterize $E_{\theta}(x)$ and introduces some latent variables. EBMs are certainly a very flexible parameterization in theory and show much promise - in fact, this idea in some form has been an area of research for almost two decades, predating many recent ideas such as diffusion models.lleviating the computational intractabiblity of the partition function so that they can be successful in generative modeling applications is a very active field of research.

I hope this was useful and interesting! Please let me know if there are any inaccuracies (especially with the physics overview) and any other insights you might have about this area! :)

Katie Keegan

Motivating Energy-Based Models from Statistical Physics

The Physics Origin: Canonical Ensembles

The Generative Modeling Technique: Energy-Based Models

Pa(rt)i(tio)n-free Training Strategies