Explaining Dropout (Bayesian Deep Learning Part I)

Dropout is a now-ubiquitous regularization technique introduced by Hinton in 2012, and originally provided without any meaningful theoretical grounding. In a network with dropout, neurons are randomly turned on and off at training time, and the outputs are averaged at inference time. Intuitively, this creates a ensemble of classifiers, each of which focuses on slightly different features, thus preventing overfitting. This quora question explains it a bit more in depth in a few different ways.

While intuitively satisfying and generally accepted, this explanation of dropout lacks mathematical rigor. A Bayesian perspective, taken recently by Yarin Gal provides a mathematically grounded explanation of dropout, showing it is equivalent to minimizing the divergence between  true and approximate distributions over the network weights.

I’ve pulled the derivation that shows this for a single hidden layer network from his thesis, and added a bit of commentary based on my understanding of it.

Finding an Approximating Distribution

With \omega as our weights, \theta our variational parameters, and X and Y our training inputs and outputs, we find the best approximating distribution, q_{\theta}(\omega), for our actual distribution, p(\omega|X,Y), by minimizing Kullback-Leibler divergence.   Or, in less words, we seek to minimize:

    \begin{align*} KL(q_{\theta}(\omega)||p(\omega|X,Y)) = \int q_{\theta}(\omega)log\left(\frac{q_{\theta}(\omega)}{p(\omega|X,Y)}\right)d\omega \end{align*}

Via the magic of Bayes’ theorem and logarithm rules, we can reach an alternate form of the loss:

    \begin{align*} \mathcal{L}_{VI} = -\int q_{\theta}(\omega)log(p(Y|X,\omega))d\omega + KL(q_{\theta}(\omega)||p(\omega)) + C \end{align*}

We note that our Y and X are discrete, and reformat our log probability:

    \begin{align*} \mathcal{L}_{VI} = -\sum_{i=1}^N\int q_{\theta}(\omega)log(p(y_i|f^{\omega}(x_i))d\omega + KL(q_{\theta}(\omega)||p(\omega)) + C \end{align*}

Where f^{\omega} represents the model output for a given input and weight parameters, and x_i and y_i represent single inputs and outputs from our training set, N is the number of training samples. As a constant, C can (and will be) dropped from our optimization procedure.

Further, we note we can reformulate our final loss function using a subset of the training data:

    \begin{align*} \hat{\mathcal{L}}_{VI} = -\frac{N}{M}\sum_{i\in S}\int q_{\theta}(\omega)log(p(y_i|f^{\omega}(x_i))d\omega + KL(q_{\theta}(\omega)||p(\omega)) \end{align*}

Where M is the size of our set S. The \frac{N}{M} term works as a normalizer, making sure our scale doesn’t change. This is important to show because dropout generates a sample population, not a predictable iteration through all datapoints.

Utilizing the pathwise derivative estimator, we arrive at a new loss function:

    \begin{align*} \label{MCLoss} \hat{\mathcal{L}}_{MC}(\theta) = -\frac{N}{M}\sum_{i\in S}log(p(y_i|f^{g(\theta,\epsilon)}(x_i)) + KL(q_{\theta}(\omega)||p(\omega)) \end{align*}

where \epsilon is drawn from an as of yet undefined distribution.

Reformulating Dropout

We’ve reformulated our KL minimization, now we need to reformulate dropout to match. We start with a standard regularized loss function, where M_1, M_2 and b represent weights and biases with no dropout:

    \begin{align*} \mathcal{L}_d = \frac{1}{M}\sum_{i\in S}\frac{1}{2}||y_i-f^{M1,M2,b}(x)||^2 + \lambda_1||M_1||^2 + \lambda_2||M_2||^2+\lambda_3||b||^2 \end{align*}

In Consistent inference of probabilities in layered networks: Predictions and generalizations, it is shown that the squared error term is equivalent to:

    \begin{align*} \frac{1}{2}||y-f^{M1,M2,b}(x)||^2 = -\frac{1}{\tau}log(p(y|f^{M_1,M_2,b}(x))+C \end{align*}

If we designate a function g(\theta,\hat{\epsilon}) - \{diag(\hat{\epsilon}_1)M_1, diag(\hat{\epsilon}_2)M_2,b\}, which corresponds to a dropout layer if the epsilon are a Bernoulli distribution, our fully formed loss function becomes:

    \begin{align*} \mathcal{L}_d = \frac{1}{M}\sum_{i\in S}log(p(y|f^{g(\theta,\hat{\epsilon})}(x)) + \lambda_1||M_1||^2 + \lambda_2||M_2||^2+\lambda_3||b||^2 \end{align*}

What this means

If you look closely, you can see how similar \mathcal{L}_d and \hat{\mathcal{L}}_{MC} are. Essentially, if we can select a prior, p(\omega), where the KL convergence is equal to our L2 penalty, they match to a scale.

Of course, due to the fact that it’s implemented via enabling and disabling neurons, we don’t ACTUALLY have to worry about this in implementation. But, mathematically, it turns out that by setting the prior to independent multivariate normals with appropriate variance, we can make this true.

Most importantly, though, this means that we can link mathematics related to the approximating distribution and KL estimation to our dropout, which becomes important when we talk about the idea of Bayesian Deep Learning (continued here if you want to read about regression, or here for classification).

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.