Suppose that we have a generative model $p_\theta(x,z) = p_\theta(z)p_\theta(x|z)$, with observed variable $x$ ($D=\{x_1,x_2,...,x_n\}$) and latent variable $z$. The goal is to learn the parameter $\theta$ to maximize the (marginal) log likelihood of the observed data (e.g. maximize $\log p_\theta(X) = \sum_{i=1}^n \log p_\theta(x_i) = \sum_{i=1}^n \log \int p_\theta(x_i|z)p_\theta(z)dz$ ).

Expectation Maximization (EM) and Variational Auto-Encoder (VAE) are both methods to solve this problem.

EM is an iterative algorithm that consist of 2 steps: expectation and maximization. The algorithm is:

- Initialize $\theta^{(1)}$
- At step $t$:
- Infer the posterior $p_{\theta^{(t)}}(z|x)$, $x \in D$
- Choose $\theta^{(t+1)}$ as: \begin{equation} \theta^{(t+1)} = argmax_\theta \sum_{x\in D} \mathbb{E}_{p_{\theta^{(t)}}(z|x)}[\log p_\theta(x,z)] \end{equation}

By performing these two steps repeatedly, $\sum_{x\in D}\log p_\theta(x)$ will converge to a (local) maximum (show proof).

Usually, the E step is to compute the expectation $\mathbb{E}_{p_{\theta^{(t)}}(z|x)}[\log p_\theta(x,z)]$, while the M step is to maximize this expectation. We will write the steps in this equivalent form, though, since it will relate to the VAE algorithm.

This algorithm, however, requires $p_\theta(z|x)$ to be tractable. If it is not, we might have to use approximate distribution (similar to the below VAE).

In VAE, we introduce an amortized inference network $q_\phi(z|x)$. The goal is to maximize the Evidence Lowever Bound:

\begin{equation} \begin{split} ELBO &= \sum_{x \in D} \left[\log p_\theta(x) - KL[q_\phi(z|x)||p_\theta(z|x)]\right]\\ &= \sum_{x \in D} \left[\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - KL[q_\phi(z|x)||p_\theta(z)]\right] \end{split} \end{equation}By maximizing the $ELBO$ w.r.t. $\theta$ and $\phi$ simultaneously, we will also maximize $\sum_{x\in D} \log p_\theta(x)$ (which has lower bound $ELBO$), which is our goal. Also, $q_\phi(z|x)$ will approximate $p_\theta(z|x)$

See my note on Variational Inference for more details on how to estimate the gradient of $ELBO$ w.r.t. $\theta$ and $\phi$.

Em and VAE both solve the same problem (as described above). Moreover, there are (incredibly) more similarities between them.

Let's look at the optimization w.r.t. $\phi$ and $\theta$ separately,

- When maximizing ELBO w.r.t. $\phi$, we minimize $KL[q_\phi(z|x)||p_\theta(z|x)]$. On another word, $q_\phi(z|x)$ will approximate $p_\theta(z|x)$. This is similar to step 1 of EM, but instead of inferring $p_\theta(z|x)$ directly, we approximate it by $q_\phi$.
- When maximizing ELBO w.r.t. $\theta$, we maximize $\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)+\log p_\theta(z)]=\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x,z)]$. This is similar to step 2 of EM.

In [ ]:

```
```