There are 2 kinds of uncertainty in a prediction of a ML model:

Epistemic uncertainty: is the uncertainty due to the lack of knowledge. In ML terms, this corresponds to the case where the model parameters are poorly determined due to the lack of data, so that the posterior of the parameters is broad.

Aleatoric uncertainty: is due to the stochasticity of the data, e.g. the noise in the data generation/collection process. Even if the model is given enough data, it cannot make perfect predictions because of that noise.

In a classification setting, the followings can be used as measures/proxies for uncertainty:

*1. Prediction entropy*

When the output of a model is a conditional probability distribution $P(y|x)$ (where $x$ is the input and $y$ represents label), then a straight-forward measure of the uncertainty is the entropy of the distribution:

\begin{equation} H[P(y|x)] = -\sum_y P(y|x)\log P(y|x), \quad \text{ in a classification setting, where the output is discrete} \end{equation}However, this measure of uncertainty doesn't distinguishes between epistemic and aleatoric uncertainty.

*2. Information gain between the model paramters and the data (point)*

Recall that the mutual information between 2 random variables $X$ and $Y$ is:

\begin{equation} I(X,Y) = H[P(X)] - \mathbb{E}_{P(y)}\left[H[P(X|Y]\right] = H[P(Y)] - \mathbb{E}_{P(x)}\left[H[P(Y|X]\right] \end{equation}The amount of information we would gain about the model parameters $w$ if we were to receive a label $y$ for a new point $x$, given the dataset $D$ is then given by:

\begin{equation} I(w,y|D,x) = H[p(y|x,D)] - \mathbb{E}_{p(w|D)}\left[H[p(y|x,w]\right] \end{equation}This can be (intuitively) explained as, if the model parameters already explain a data point well (the mutual information is high), then it is less uncertain about that point. This is a measure of the epistemic uncertainty.

The terms in the above equation can be approximated by MC sampling.

*3. Variance of the probs (post-softmax), a proxy*

This is just a proxy, and in fact, it is an approximation of the mutual information described above.

If we have $T$ MC samples of the parameters, where $p_i$ is the post-softmax vector of the prediction given sample $i$, $p_{ij}$ is $p_i$'s $j^{th}$ element and $\hat{p}_j$ is the mean of $(p_{ij})_{i=1}^T$, we have:

\begin{equation} \begin{split} \hat{\sigma}^2 &= \frac{1}{C}\sum_{j=1}^C \frac{1}{T}\sum_{i=1}^T (p_{ij}-\hat{p}_j)^2 \\ &= \frac{1}{C}\sum_{j=1}^C \left(\frac{1}{T}\left(\sum_{i=1}^T p_{ij}^2\right) - \hat{p}_j^2\right) \end{split} \end{equation}And the approximate mutual information is:

\begin{equation} \begin{split} \hat{I} &= H(\hat{p})- \frac{1}{T} \sum_{i=1}^T H(p_i)\\ &= \sum_j \left[\frac{1}{T}\left(\sum_{i=1}^T p_{ij}\log p_{ij}\right) - \hat{p}_j \log \hat{p}_j\right] \end{split} \end{equation}Using the Taylor series for the logarithm we have $\log p \approx p-1$, so: \begin{equation} \begin{split} \hat{I} &\approx \sum_j \left[\frac{1}{T}\left(\sum_{i=1}^T p_{ij}(p_{ij}-1)\right) - \hat{p}_j (\hat{p}_j-1)\right]\\ &= \sum_j \left[\frac{1}{T}\left(\sum_{i=1}^T p_{ij}^2\right) - \hat{p}_j^2 - \frac{1}{T}\left(\sum_{i=1}^T p_{ij}\right) + \hat{p}_j\right]\\ &= \sum_j \left[\frac{1}{T}\left(\sum_{i=1}^T p_{ij}^2\right) - \hat{p}_j^2\right]\\ &= C \hat{\sigma}^2, \quad \text{where C is number of class} \end{split} \end{equation}

Therefore, the variance of the probs (post-softmax) can be used as a proxy for the mutual information (epistemic uncertainty).

In [ ]:

```
```