DIVA

This is a note (and potentially a rant) on the paper DIVA: Domain Invariant Variational Autoencoders, and some thoughts on information bottleneck.

DIVA (Domain Invariant VAE) is proposed with the following generative model: \begin{equation} p(x,y,d,z_y,z_d,z_x) = p(d)p(y)p(z_x)p_{\theta_d}(z_d|d)p_{\theta_y}(z_y|y)p_\theta(x|z_y,z_d,z_x) \end{equation}

where $d$ is the domain, $y$ is the class, $z_d$ is domain-specific, $z_y$ is class-specific, and $z_x$ is all other variation in $x$. The claim here is that $z_y$ only contain class-specific information, it should be domain-invariant. Therefore, if we can infer $p(z_y|x)$, we can use this domain-invariant feature to give prediction about the class.

The inference network is: \begin{equation} q(z_y,z_d,z_x|x) = q_{\phi_y}(z_y|x)q_{\phi_d}(z_d|x)q_{\phi_x}(z_x|x) \end{equation}

The lower bound is: \begin{equation} L=\mathbb{E}_{q(z_y,z_d,z_x|x)}[\log p_\theta(x|z_y,z_d,z_x)] - \beta\mathrm{KL}[q_{\phi_y}(z_y|x)||p_{\theta_y}(z_y|y)] - \beta\mathrm{KL}[q_{\phi_d}(z_d|x)||p_{\theta_d}(z_d|d)] - \beta\mathrm{KL}[q_{\phi_x}(z_x|x)||p(x)] \end{equation}

To further encourage the separation of the latent features, they introduce two auxiliary networks $q_{w_d}(d|z_d)$ and $q_{w_y}(y|z_y)$ to predict the domain and the class from the two latent variables. After training, $q_{w_y}(y|z_y)$ will be used to make the prediction of interest.

The final objective function is \begin{equation} L + \alpha_d\mathbb{E}_{q_{\phi_d}(z_d|x)}[\log q_{w_d}(d|z_d)] + \alpha_y\mathbb{E}_{q_{\phi_y}(z_y|x)}[\log q_{w_y}(y|z_y)] \end{equation}

The authors also found out that the model stills work well, with even better domain generalization performance, when we remove $z_d$ entirely. I think this totally makes sense when $z_d$ doesn't take part in our predictive distribution.

However, when looking more closely, we can see that only $z_y$ is important for the prediction. The reconstruction of $x$ barely has any apparent effects on the prediction model.

Therefore, intuitively, if we only keep 2 terms of the objective function: $\mathbb{E}_{q_{\phi_y}(z_y|x)}[\log q_{w_y}(y|z_y)]-\beta\mathrm{KL}[q_{\phi_y}(z_y|x)||p_{\theta_y}(z_y|y)] \;(1)$, the model should still perform well. I have tested this model empirically and observed that it reaches the exact performance as the full model. This is totally sensible since as long as we can learn a class-specific feature $z_y$, we can use it to make the prediction well. Note that the reconstruction of $x$ or the learning of $z_x$ and $z_d$ have no effect on the learning of $z_y$.

This questions the construction of VAE in this problem. It brings too much computation without any apparent benefits. The only upside is that we can use it to make use of unlabelled data in a semi-supervised learning. However, as reported by the authors, the gain is very marginal. I think it is not surprising since $z_x$ can be used solely for the reconstruction of $x$, and that $z_y$ can be ignored completely. In order to force the model to use $z_y$ for the reconstruction (and thus it has to learn $z_y$), we need to constrain $z_x$ such that it doesn't contain class information. Maybe we can minimize $I(Z_d,Y)$

Interestingly enough, the objective function in $(1)$ can be rewritten in the information bottleneck lens as the variational lower bound of: \begin{equation} I(Z_y,Y)-\beta I(Z_y,X) \end{equation}