For every distribution $p(x)$, we can define the energy function $E(x)=-\log p(x)$ then $p(x)=\exp\left({-E(x)}\right)$

Then, for a point $x$ where $E(x)$ is low, $p(x)$ is high and we can say that it is favored for the variable $X$ to be assigned with value $x$. This is inspired from physics where a particle tends to move to a state with low energy.

The energy function can be used to do density estimation of a distribution with observeds data.

Let's assume that we have a true (but unknown) distribution $p(x)$ with observed data $x_1,x_2,...,x_n$ and we want to estimate the density of $p(x)$.

Let the energy function to be a neural network $E_\theta(x)$ (parameterized by $\theta$). Our goal is to tune $\theta$ so that this energy function estimate the density of the distribution well.

We have: \begin{equation} p_\theta(x) = \frac{\exp\left(-E_\theta(x)\right)}{Z(\theta)}, \end{equation} based on the Gibbs distribution where $Z(\theta)$ is the normalizing constant.

\begin{equation} \begin{split} \log p_\theta(x) &= -E_\theta(x)-\log \left(\int \exp\left(-E_\theta(x')\right)dx'\right)\\ \frac{\partial\log p_\theta(x)}{\partial\theta}&=-\frac{\partial E_\theta(x)}{\partial\theta} - \frac{\partial \log \left(\int \exp\left(-E_\theta(x')\right)dx'\right)}{\partial \theta} \\ &= -\frac{\partial E_\theta(x)}{\partial\theta} - \frac{1}{\int \exp\left(-E_\theta(x')\right)dx'}.\frac{\partial \int \exp\left(-E_\theta(x')\right)dx'}{\partial \theta}\\ &= -\frac{\partial E_\theta(x)}{\partial\theta} - \frac{1}{\int \exp\left(-E_\theta(x')\right)dx'}.\int \frac{\partial \exp\left(-E_\theta(x')\right)}{\partial \theta}dx'\\ &= -\frac{\partial E_\theta(x)}{\partial\theta} - \frac{1}{Z(\theta)}.\int \exp\left(-E_\theta(x')\right)\frac{\partial -E_\theta(x')}{\partial \theta}dx'\\ &= -\frac{\partial E_\theta(x)}{\partial\theta} + \int \exp\left(-E_\theta(x')\right)/Z(\theta)\frac{\partial E_\theta(x')}{\partial \theta}dx'\\ &= -\frac{\partial E_\theta(x)}{\partial\theta} + \int p_\theta(x')\frac{\partial E_\theta(x')}{\partial \theta}dx'\\ &= -\frac{\partial E_\theta(x)}{\partial\theta} + \mathbb{E}_{p_\theta(x')}\left[\frac{\partial E_\theta(x')}{\partial \theta}\right]\\ \end{split} \end{equation}Then to maximize the data likelihood, our objective function is: \begin{equation} \frac{\partial\log p_\theta(D)}{\partial\theta}=\sum_{i=1}^{N} \frac{\partial\log p_\theta(x_i)}{\partial\theta} =-\sum_{i=1}^{N}\frac{\partial E_\theta(x_i)}{\partial\theta} + N.\mathbb{E}_{p_\theta(x')}\left[\frac{\partial E_\theta(x')}{\partial \theta}\right] \end{equation}

The first term can be computed easily, however, the second term is intractable and we have to approximate with samples from $p(x')$. Stochastic Gradient Langevin Dynamics can be used for such sampling.

The (intuitive) meaning of the above objective function is that, by maximizing it, we want to minimize $E_\theta(x_i)$ $\forall i$ (low energy state) while maximizing $E_\theta(x')$ of other $x'$'s (high energy state).

In this paper, they showed that learning a discriminative classifier $p_\theta(y|x)$ is similar to estimate the density of the true distribution of $x,y$ with $p_\theta(x,y)$ with a energy-based model.

First let's assume the number of classes is $K$ $(0\leq y\leq K-1)$

Define $E_\theta(x,y)=f_\theta(x)[y]$ where $f$ is a function that outputs K-dimensional vector.

Then: \begin{equation} \log p_\theta(x,y) = \log p_\theta(y|x) + \log p_\theta(x) \end{equation}

where $p_\theta(y|x) = \frac{\exp(f_\theta(x)[y])}{\sum_y^{'}\exp(f_\theta(x)[y']}$

and $p_theta(x)=\frac{\sum_y \exp(f_\theta(x)[y])}{Z(\theta)}$, or in another word, the enery function w.r.t. $x$ is $E_\theta(x)=-\log \sum_y \exp(f_\theta(x)[y])$

Here, the first term is exactly the cross entropy loss of a normal classifier. For the second term, we can maximize it by the gradient formular as the above section. Note that we can maximize $\log p_\theta(x,y)$ directly by the gradient formular in the above section. However, since we are doing MCMC sampling with a finite number of steps to approximate the final expectation, that estimator is biased. Since we are mainly interested in $p_\theta(y|x)$, we want to leave it out of that biased estimator and use the estimator to maximize $\log p_\theta(x)$ only (which is less important).