Regression Probability in Deep Networks (Bayesian Deep Learning Part II)

This post is a continuation of Explaining Dropout (Bayesian Deep Learning Part I). It is continued here for the classification problem.

Operations in the physical world are inherently uncertain, and the consequences of not understanding when to act upon the information you have are severe. Some of the most prevalent algorithms in robotics, such as particle filters for SLAM and Kalman filters for sensor fusion, are popular because they handle this uncertainty. And while no one is going to dispute the power of deep neural networks, now famous for tasks such as image classification, object detection, and text-to-speech, they do not have a well understood metric of uncertainty. You may have noticed this when Google maps happily sent you to Rome instead of Home.

The interpretation of dropout shown in the previous post casts the output of a neural network with dropout as an approximation to the output’s  true distribution. Since it is now a distribution instead of a point estimate, we can derive an expected value and variance.

The derivation for regression (taken from Yarin Gal’s thesis) is a little more straightforward, so I’ll go through that one first.

Expected Value

Since we are dealing with a probability distribution, we don’t get a single output – we must instead find the expected value of the distribution. The formula for an expected value of a continuous distribution is:

    \begin{align*} \mathbb{E}(y) = \int_{-\infty}^{\infty}yp(y)dx \end{align*}

Using our approximating distribution, q_{\theta}(\omega) \approx p(\omega|X,Y), and the chain rule for probability, we get the integral:

    \begin{align*} \mathbb{E}(y) = \int_{-\infty}^{\infty}\int_{-\infty}^{\infty}yp(y|x,\omega)q_{\theta}(\omega)d\omega dy \end{align*}

Where q_{\theta}(\omega) is our approximating distribution, p is our actual distribution for input x, label y, and parameters (weights) \omega. X and Y represent our training data. Note that since our weights are random variables, this becomes a double integral as we need to marginalize them out.

In the previous post, I mentioned that a probability prior would have to be found in order to make the objective functions match. It turns out that this prior amounts to placing independent normal distributions across each weight, with a specific standard deviation (\tau^{-1}). We can then reformulate:

    \begin{align*} \mathbb{E}(y) = \int_{-\infty}^{\infty}\int_{-\infty}^{\infty}yN(y;f^{\omega}(x),\tau^{-1}I) q_{\theta}(\omega)d\omega dy \end{align*}

The expected value of a Gaussian distribution is just its mean, and the mean is determined by our network output (the parameter), f^{\omega}(x). So if we integrate out our y, we find the expected value with one more parameter to marginalize out.

    \begin{align*} \mathbb{E}(y) = \int_{-\infty}^{\infty}f^{\omega}(x) q_{\theta}(\omega)d\omega \end{align*}

If we evaluate every possible combination of weights and average them (since the more likely output will occur more frequently), we can estimate this integral. This is called Monte Carlo Integration.

Since the weights are Bernoulli distributed, a combination of weights is simply one potential set of activations. Or, pragmatically, a forward pass with dropout enabled. Each of these forward passes results is a single Monte Carlo sample, which leaves us with the final estimator:

(1)   \begin{equation*} \mathbb{E}[y] := \frac{1}{T}\sum_{t=1}^Tf^{\hat{\omega}_t}(x) \end{equation*}

For T samples. As T \rightarrow \infty, we approach the true expected value.


While performing inference time dropout improves network performance slightly, the  gain is not sufficient to justify performing multiple inferences for Monte Carlo integration. The real reason this process is valuable is its ability to estimate variance.

We start with the statement that:

(2)   \begin{equation*} Var[y] = \mathbb{E}(yy^T)-\mathbb{E}(y)\mathbb{E}(y)^T \end{equation*}

I note that my version of the equation differs slightly from the one in Gal’s thesis. The only reason for this is that I more traditionally see covariance written as yy^T for column vectors, while Gal is performs his derivation with row vectors. \mathbb{E}(y) is known from previous, but we still need to figure out \mathbb{E}(yy^T). Conveniently, the procedure is much the same as for expected value. We start with the expected value formula, using yy^T instead of y. We note that y and yy^T have the same probability density function.

    \begin{align*} \mathbb{E}[yy^T] = \int\int \left(yy^Tp(y|x,\omega)dy\right)q_{\theta}(\omega)d\omega \end{align*}

We note that the integral with respect to y is just the expected covariance, so we can rearrange 2 such that it represents expected variance.

    \begin{align*} \mathbb{E}[yy^T] = \int\left(Cov_{p(y|x,\omega)}[y]+\mathbb{E}_{p(y|x,\omega)}[y]\mathbb{E}_{p(y|x,\omega)}[y]^T\right)dyq_{\theta}(\omega)d\omega \end{align*}

Since we know \mathbb{E}(y), and the covariance, this becomes:

    \begin{align*} \mathbb{E}[yy^T] = \int\left(\tau^{-1}\bf{I}+f^{\omega}(x)f^{\omega}(x)^T\right)q_{\theta}(\omega)d\omega \end{align*}

The term \tau is actually a characteristic of the network itself. It’s tied to the weight-decay parameter and dropout rate. I haven’t needed to dig my teeth into this yet, so I can’t provide a good explanation of that parameter. HOWEVER, the definition of this term is such that if your network has no weight decay parameter, it can be ignored (as \lambda \rightarrow 0, \tau \rightarrow \infty).

Since our \tau^{-1}\bf{I} term is not dependent on \omega, and q_{\theta}(\omega) integrates to 1, we can pull it out of the integral.  Monte Carlo integration is performed on the remaining term in the same manner as 1. This results in the final unbiased estimator for \mathbb{E}[yy^T]:

    \begin{align*} \mathbb{E}[yy^T] = \tau^{-1}\bf{I} + \frac{1}{T}\sum_{t=1}^T f^{\hat{\omega}_t}(x)f^{\hat{\omega}_t}(x)^T \end{align*}

And substituting this estimate back into 2 we end up with our final estimator for variance:

    \begin{align*} Var[Y] := \tau^{-1}\bf{I} + \frac{1}{T}\sum_{t=1}^T f^{\hat{\omega}_t}(x)f^{\hat{\omega}_t}(x)^T - \mathbb{E}[y]\mathbb{E}[y]^T \end{align*}

Explaining Dropout (Bayesian Deep Learning Part I)

Dropout is a now-ubiquitous regularization technique introduced by Hinton in 2012, and originally provided without any meaningful theoretical grounding. In a network with dropout, neurons are randomly turned on and off at training time, and the outputs are averaged at inference time. Intuitively, this creates a ensemble of classifiers, each of which focuses on slightly different features, thus preventing overfitting. This quora question explains it a bit more in depth in a few different ways.

While intuitively satisfying and generally accepted, this explanation of dropout lacks mathematical rigor. A Bayesian perspective, taken recently by Yarin Gal provides a mathematically grounded explanation of dropout, showing it is equivalent to minimizing the divergence between  true and approximate distributions over the network weights.

I’ve pulled the derivation that shows this for a single hidden layer network from his thesis, and added a bit of commentary based on my understanding of it.

Finding an Approximating Distribution

With \omega as our weights, \theta our variational parameters, and X and Y our training inputs and outputs, we find the best approximating distribution, q_{\theta}(\omega), for our actual distribution, p(\omega|X,Y), by minimizing Kullback-Leibler divergence.   Or, in less words, we seek to minimize:

    \begin{align*} KL(q_{\theta}(\omega)||p(\omega|X,Y)) = \int q_{\theta}(\omega)log\left(\frac{q_{\theta}(\omega)}{p(\omega|X,Y)}\right)d\omega \end{align*}

Via the magic of Bayes’ theorem and logarithm rules, we can reach an alternate form of the loss:

    \begin{align*} \mathcal{L}_{VI} = -\int q_{\theta}(\omega)log(p(Y|X,\omega))d\omega + KL(q_{\theta}(\omega)||p(\omega)) + C \end{align*}

We note that our Y and X are discrete, and reformat our log probability:

    \begin{align*} \mathcal{L}_{VI} = -\sum_{i=1}^N\int q_{\theta}(\omega)log(p(y_i|f^{\omega}(x_i))d\omega + KL(q_{\theta}(\omega)||p(\omega)) + C \end{align*}

Where f^{\omega} represents the model output for a given input and weight parameters, and x_i and y_i represent single inputs and outputs from our training set, N is the number of training samples. As a constant, C can (and will be) dropped from our optimization procedure.

Further, we note we can reformulate our final loss function using a subset of the training data:

    \begin{align*} \hat{\mathcal{L}}_{VI} = -\frac{N}{M}\sum_{i\in S}\int q_{\theta}(\omega)log(p(y_i|f^{\omega}(x_i))d\omega + KL(q_{\theta}(\omega)||p(\omega)) \end{align*}

Where M is the size of our set S. The \frac{N}{M} term works as a normalizer, making sure our scale doesn’t change. This is important to show because dropout generates a sample population, not a predictable iteration through all datapoints.

Utilizing the pathwise derivative estimator, we arrive at a new loss function:

    \begin{align*} \label{MCLoss} \hat{\mathcal{L}}_{MC}(\theta) = -\frac{N}{M}\sum_{i\in S}log(p(y_i|f^{g(\theta,\epsilon)}(x_i)) + KL(q_{\theta}(\omega)||p(\omega)) \end{align*}

where \epsilon is drawn from an as of yet undefined distribution.

Reformulating Dropout

We’ve reformulated our KL minimization, now we need to reformulate dropout to match. We start with a standard regularized loss function, where M_1, M_2 and b represent weights and biases with no dropout:

    \begin{align*} \mathcal{L}_d = \frac{1}{M}\sum_{i\in S}\frac{1}{2}||y_i-f^{M1,M2,b}(x)||^2 + \lambda_1||M_1||^2 + \lambda_2||M_2||^2+\lambda_3||b||^2 \end{align*}

In Consistent inference of probabilities in layered networks: Predictions and generalizations, it is shown that the squared error term is equivalent to:

    \begin{align*} \frac{1}{2}||y-f^{M1,M2,b}(x)||^2 = -\frac{1}{\tau}log(p(y|f^{M_1,M_2,b}(x))+C \end{align*}

If we designate a function g(\theta,\hat{\epsilon}) - \{diag(\hat{\epsilon}_1)M_1, diag(\hat{\epsilon}_2)M_2,b\}, which corresponds to a dropout layer if the epsilon are a Bernoulli distribution, our fully formed loss function becomes:

    \begin{align*} \mathcal{L}_d = \frac{1}{M}\sum_{i\in S}log(p(y|f^{g(\theta,\hat{\epsilon})}(x)) + \lambda_1||M_1||^2 + \lambda_2||M_2||^2+\lambda_3||b||^2 \end{align*}

What this means

If you look closely, you can see how similar \mathcal{L}_d and \hat{\mathcal{L}}_{MC} are. Essentially, if we can select a prior, p(\omega), where the KL convergence is equal to our L2 penalty, they match to a scale.

Of course, due to the fact that it’s implemented via enabling and disabling neurons, we don’t ACTUALLY have to worry about this in implementation. But, mathematically, it turns out that by setting the prior to independent multivariate normals with appropriate variance, we can make this true.

Most importantly, though, this means that we can link mathematics related to the approximating distribution and KL estimation to our dropout, which becomes important when we talk about the idea of Bayesian Deep Learning (continued here if you want to read about regression, or here for classification).