Auto-Encoding Variational Bayes

In my previous post on variational inference, I gave a quick foundation of the theory behind variational auteoncoders (VAEs). The dominant lay understanding of VAEs is that they’re models which allow you to take map an image to a smaller latent space and manipulate that image in meaningful ways by changing the values of the latent space. Most machine learning tutorials are centered around this interpretation. However, I feel this short-sells their meaning (and makes people say “just use a GAN instead”, which is disappointing). It is more correct, and interesting, to say that VAEs are an implementation of variational inference which allows use of arbitrarily complex functions (read: neural networks) as approximating distributions.

Like the previous post, a lot of this information was pulled from the Stanford Probabilistic Graphical Models Course, but it’s also worth reading the original (if opaque) Kingma & Welling Paper, watching this video (starting at 7 minutes or so), and reading this StackOverflow post.

Throughout this post, x refers to an input, z refers to latent variables, p(*) refers to an actual distribution, and q(*) refers to an approximating distribution.

In a variational autoencoder, we have a (true) equation p(z|x)=\frac{p(x,z)}{p(x)}, and want to estimate our p(x).

In order to do this, the first thing we do is re-formulate the VAE objective. We still use the upper bound:

    \begin{equation*} $log(Z(\theta)) = KL(q||p)-J(q)$ \end{equation*}

Which, in our variables, is:

    \begin{align*} log(p(x)) &= KL(q(z|x)||p(z|x)) - KL(q(z|x)||p(x,z)) \\ &= KL(q(z|x)||p(z|x)) + \mathbb{E}[(log(p(x,z))-log(q(z|x)))] \end{align*}

We note that KL is always greater than or equal to zero, so we remove the intractable first KL term. We then expand p(x,z) with the probability chain rule (and logarithm product rule).

    \begin{equation*} log(p(x)) \geq \mathbb{E}[log(p(x|z))+log(p(z))-log(q(z|x))] \end{equation*}

Exploit the linearity of expectation:

    \begin{equation*} log(p(x)) \geq \mathbb{E}[(log(p(x|z)))]+\mathbb{E}[log(p(z))-log(q(z|x)))] \end{equation*}

Flip the signs and collect the right two terms as a KL divergence (as q(z|x) marginalizes out, this is a straightforward algebraic manipulation of the KL divergence):

    \begin{equation*} log(p(x)) \geq \mathbb{E}[log(p(x|z))] - KL(q(z|x)||p(z)) \end{equation*}

Which is our objective function for the variational autoencoder. We can pause here for a minute and take a look at how this lines up with the common interpretation of our variational autoencoder. We note that there is a q(z|x), which corresponds to our encoder network, a p(z), which corresponds to our prior (a set of independent multivariate unit gaussians), and a p(x|z), which corresponds to our decoder network. The expected value operator is also significant, as it means we never actually need to know the value of the probability. The expected value is simply the output of the decoder network when all latent variables are at their mean value, and making the probability of x higher under this distribution corresponds to just making the input and output images match more closely!

There is one (not small) detail missing before we can perform gradient descent. Since z are random variables, we can’t differentiate through them to train the encoder and decoder network. This is what the reparameterization trick is for. In the reparameterization trick, we produce two outputs in our latent space (mean and standard deviation) and treat them as deterministic. That is, these are the outputs we want to learn. But we have a third, random, variable that we multiply sigma by which is drawn from a unit normal distribution. Since this isn’t learned, we can still backpropagate through the network as desired despite having random variables within it.

A rapid look at variational inference with KL divergence

I’ve been spending some time reading up on variational autoencoders (VAEs), which are a paradigm of machine learning that draws some interesting parallels with Bayesian Inference. In order to understand why VAEs worked (aside from just saying neural networks and refusing to think further), I had to get at least a surface level understanding of variational methods.

This information is rephrased from the Stanford Course on Probabilistic Graphical Models here.

Variational inference seeks to cast (often intractable) inference problems as optimization problems. There are some differences between how various fields refer to inference, but in this case inference signifies extracting information from the probabilistic model. This information may be the probability of a given variable after we sum everything else out (marginal inference) or the most likely assignment to the variables in the model (maximum a-posteriori).

Performing variational inference requires an approximating family of distributions, Q, and an optimization objective, J(q). One type of variational inference, called “mean field” uses the product of many univariate distributions as the approximating family. However, there are many other options (for example, VAEs use neural networks as approximating families).

For the optimization objective, J(q), we typically use the KL divergence, which is defined as:

    \begin{equation*} KL(q||p) = \Sigma q(x) log(\frac{q(x)}{p(x)}) \end{equation*}

The KL divergence is not symmetric (KL(q||p) \neq KL(p||q)) but is always greater than 0 when p \neq q and 0 when p = q. This second point is important.

As a specific problem, we show p(x) to be a joint distribution of many variables, which (for purely pedagogical reasons) is the normalized version of \tilde{p}(x). \theta represents the parameters of the distribution, and Z is the normalizing distribution.

    \begin{equation*} $p(x_1,...,x_n;\theta) = \frac{\tilde{p}(x_1,...,x_n;\theta)}{Z(\theta)}$ \end{equation*}

From here, we note that we can’t calculate p(x) or Z, but we can still play with our optimization objective.

    \begin{equation*} J(q) = \Sigma q(x)log(\frac{q(x)}{\tilde{p}(x)}) \end{equation*}

We note that p(x) = \frac{\tilde{p(x)}}{Z(\theta)}, so it follows that our optimization objective can be expressed as:

    \begin{equation*} J(q) = \Sigma q(x) log(\frac{q(x)}{p(x)}) - log(Z(\theta)) \end{equation*}

And re-formed as:

    \begin{equation*} log(Z(\theta)) = KL(q||p)-J(q) \end{equation*}

The KL divergence between q and p is still intractable. HOWEVER, since KL divergences are always greater than or equal to zero, our KL divergence between q and \tilde{p} (i.e., J(q)) is tractable, and is a lower bound on the probability of log(Z(\theta)). This is called the Variational Lower Bound (VLBO) or more commonly (in my reading) the Evidence Lower Bound (ELBO).

How is this equation useful? It shows up in the mathematics for the variational autoencoder, which I describe in the next post. It also shows up all sorts of other places that I haven’t covered.

Help! My PyTorch script returns GPU Out of Memory when running on CPU!

I often find myself training a neural network on my while simultaneously attempting to implement changes for a different experiment. In this case, I typically run the training on the GPU and perform debugging on the CPU.

I do this primarily to avoid the dreaded out of memory error. However, if you’re really maxing out memory usage on your GPU, the error may still appear. This is due to the fact that PyTorch writes information to the default GPU device regardless of where you place your tensors and model.

To prevent this, you need to empty the environment variable CUDA_VISIBLE_DEVICES, which could be done via your bashrc or a manual export:


. However, it’s simplest to hide your GPU devices when you call your Python program, by running it with the following syntax:


Classification Probability in Deep Networks (Bayesian Deep Learning Part III)

In a previous post, I described Yarin Gal’s interpretation of dropout. This implementation allows us to generate an approximate distribution across the output of a deep neural network via Monte Carlo sampling, implemented through inference time dropout. I previously described how to use this distribution for uncertainty in regression, where the concepts of probability apply in a fairly straightforward way. In this post, I will describe how these distributions are used in classification problems.

While regression seeks to map an input to a numerical output, classification seeks to map an input to a categorical output (cat, dog, bird, etc).  These categorical outputs do not lend themselves to expected values and covariances in the same way a regression problem would, so we have to determine another way to characterize the distribution.

First, let’s think about what our Monte Carlo integration is giving us. Every inference gives us a distribution across our categorical outputs, and all of these distributions are combined to one output distribution. An example distribution may look like this:

This distribution has the highest probability for cat, with dog and squirrel running close behind. This distribution is neither Gaussian nor unimodal, which means that our mean and covariance statistics will not function in this case (though if the categorical outputs are contiguous, such statistics may be justifiable). So our guiding question is: how do we properly describe our confidence in this distribution’s output?

Gal suggests three measures for determining the spread of the distribution. The first, variation ratio, is calculated as 1 - \frac{f_x}{T}, where T is the number of samples, and f_x is the number of times the network’s highest value is the same as the mode of highest values across all forward passes. By counting the number of times the mode answer is selected, it approximates 1-p(y=c|X,D_{train}).

A second measure, predictive entropy, averages the distribution across all of our Monte Carlo samples, then runs them through a softmax operator, giving a biased estimator of the form:

    \begin{align*} \mathbb{H}[y|x,D_{train}] := -\sum_cp(y=c|x,D_{train})logp(y=c|x,D_{train}) \end{align*}

Where our p(y=c|x,D_{train}) is the softmaxed average output for each class, c.

Closely related is the mutual information metric. This metric provides a measure of how consistent the model is across multiple MC passes (Gal uses the term “model’s confidence”), while variation ratio and predictive entropy measure how likely the output is to be correct (“uncertainty in the output”). The equation for mutual information is:

    \begin{align*} \mathbb{I}[y,\omega|x,D_{train}:=&\mathbb{H}[y|x,D_{train}]-\mathbb{E}_{p(\omega|D_{train})}[\mathbb{H}[y|x,D_{train}]]\\ =&\sum_{c}(\frac{1}{T}\sum_tp(y=c|x,\omega_t)log(\frac{1}{T}\sum_tp(y=c|x,\omega_t))\\+&\frac{1}{T}\sum_{c,t}p(y=c|x,\omega_t)log(p(y=c|x,\omega_t)) \end{align*}


These three methods for determining the certainty in classification conclude the high level, and possibly slightly incorrect, tour of the method described in Yarin Gal’s thesis. His proposed approach allows us to develop the critically necessary notion of uncertainty while leveraging the power of deep networks. The downside is, of course, that these approximations require many forward passes. This may not be an issue where power and time are comparatively cheap, such as medical diagnosis. However, with power-limited hardware or real time requirements (cellphones and robots), we may need to reframe networks yet again to get a good answer.

But perhaps by the time we figure that out, Moore’s law would have made it all a moot point.

If you’re interested in this subject, I would definitely recommend reading Yarin Gal’s thesis. One of his former labmates, Alex Kendall, also wrote a quite informative blog post arguing for this line of research. Gal has, of course, published many other papers on the subject. If you’re intimately familiar with Gaussian Processes, you may be more comfortable with these, but I found his thesis to be the most digestable of all I read.

Regression Probability in Deep Networks (Bayesian Deep Learning Part II)

This post is a continuation of Explaining Dropout (Bayesian Deep Learning Part I). It is continued here for the classification problem.

Operations in the physical world are inherently uncertain, and the consequences of not understanding when to act upon the information you have are severe. Some of the most prevalent algorithms in robotics, such as particle filters for SLAM and Kalman filters for sensor fusion, are popular because they handle this uncertainty. And while no one is going to dispute the power of deep neural networks, now famous for tasks such as image classification, object detection, and text-to-speech, they do not have a well understood metric of uncertainty. You may have noticed this when Google maps happily sent you to Rome instead of Home.

The interpretation of dropout shown in the previous post casts the output of a neural network with dropout as an approximation to the output’s  true distribution. Since it is now a distribution instead of a point estimate, we can derive an expected value and variance.

The derivation for regression (taken from Yarin Gal’s thesis) is a little more straightforward, so I’ll go through that one first.

Expected Value

Since we are dealing with a probability distribution, we don’t get a single output – we must instead find the expected value of the distribution. The formula for an expected value of a continuous distribution is:

    \begin{align*} \mathbb{E}(y) = \int_{-\infty}^{\infty}yp(y)dx \end{align*}

Using our approximating distribution, q_{\theta}(\omega) \approx p(\omega|X,Y), and the chain rule for probability, we get the integral:

    \begin{align*} \mathbb{E}(y) = \int_{-\infty}^{\infty}\int_{-\infty}^{\infty}yp(y|x,\omega)q_{\theta}(\omega)d\omega dy \end{align*}

Where q_{\theta}(\omega) is our approximating distribution, p is our actual distribution for input x, label y, and parameters (weights) \omega. X and Y represent our training data. Note that since our weights are random variables, this becomes a double integral as we need to marginalize them out.

In the previous post, I mentioned that a probability prior would have to be found in order to make the objective functions match. It turns out that this prior amounts to placing independent normal distributions across each weight, with a specific standard deviation (\tau^{-1}). We can then reformulate:

    \begin{align*} \mathbb{E}(y) = \int_{-\infty}^{\infty}\int_{-\infty}^{\infty}yN(y;f^{\omega}(x),\tau^{-1}I) q_{\theta}(\omega)d\omega dy \end{align*}

The expected value of a Gaussian distribution is just its mean, and the mean is determined by our network output (the parameter), f^{\omega}(x). So if we integrate out our y, we find the expected value with one more parameter to marginalize out.

    \begin{align*} \mathbb{E}(y) = \int_{-\infty}^{\infty}f^{\omega}(x) q_{\theta}(\omega)d\omega \end{align*}

If we evaluate every possible combination of weights and average them (since the more likely output will occur more frequently), we can estimate this integral. This is called Monte Carlo Integration.

Since the weights are Bernoulli distributed, a combination of weights is simply one potential set of activations. Or, pragmatically, a forward pass with dropout enabled. Each of these forward passes results is a single Monte Carlo sample, which leaves us with the final estimator:

(1)   \begin{equation*} \mathbb{E}[y] := \frac{1}{T}\sum_{t=1}^Tf^{\hat{\omega}_t}(x) \end{equation*}

For T samples. As T \rightarrow \infty, we approach the true expected value.


While performing inference time dropout improves network performance slightly, the  gain is not sufficient to justify performing multiple inferences for Monte Carlo integration. The real reason this process is valuable is its ability to estimate variance.

We start with the statement that:

(2)   \begin{equation*} Var[y] = \mathbb{E}(yy^T)-\mathbb{E}(y)\mathbb{E}(y)^T \end{equation*}

I note that my version of the equation differs slightly from the one in Gal’s thesis. The only reason for this is that I more traditionally see covariance written as yy^T for column vectors, while Gal is performs his derivation with row vectors. \mathbb{E}(y) is known from previous, but we still need to figure out \mathbb{E}(yy^T). Conveniently, the procedure is much the same as for expected value. We start with the expected value formula, using yy^T instead of y. We note that y and yy^T have the same probability density function.

    \begin{align*} \mathbb{E}[yy^T] = \int\int \left(yy^Tp(y|x,\omega)dy\right)q_{\theta}(\omega)d\omega \end{align*}

We note that the integral with respect to y is just the expected covariance, so we can rearrange 2 such that it represents expected variance.

    \begin{align*} \mathbb{E}[yy^T] = \int\left(Cov_{p(y|x,\omega)}[y]+\mathbb{E}_{p(y|x,\omega)}[y]\mathbb{E}_{p(y|x,\omega)}[y]^T\right)dyq_{\theta}(\omega)d\omega \end{align*}

Since we know \mathbb{E}(y), and the covariance, this becomes:

    \begin{align*} \mathbb{E}[yy^T] = \int\left(\tau^{-1}\bf{I}+f^{\omega}(x)f^{\omega}(x)^T\right)q_{\theta}(\omega)d\omega \end{align*}

The term \tau is actually a characteristic of the network itself. It’s tied to the weight-decay parameter and dropout rate. I haven’t needed to dig my teeth into this yet, so I can’t provide a good explanation of that parameter. HOWEVER, the definition of this term is such that if your network has no weight decay parameter, it can be ignored (as \lambda \rightarrow 0, \tau \rightarrow \infty).

Since our \tau^{-1}\bf{I} term is not dependent on \omega, and q_{\theta}(\omega) integrates to 1, we can pull it out of the integral.  Monte Carlo integration is performed on the remaining term in the same manner as 1. This results in the final unbiased estimator for \mathbb{E}[yy^T]:

    \begin{align*} \mathbb{E}[yy^T] = \tau^{-1}\bf{I} + \frac{1}{T}\sum_{t=1}^T f^{\hat{\omega}_t}(x)f^{\hat{\omega}_t}(x)^T \end{align*}

And substituting this estimate back into 2 we end up with our final estimator for variance:

    \begin{align*} Var[Y] := \tau^{-1}\bf{I} + \frac{1}{T}\sum_{t=1}^T f^{\hat{\omega}_t}(x)f^{\hat{\omega}_t}(x)^T - \mathbb{E}[y]\mathbb{E}[y]^T \end{align*}

Explaining Dropout (Bayesian Deep Learning Part I)

Dropout is a now-ubiquitous regularization technique introduced by Hinton in 2012, and originally provided without any meaningful theoretical grounding. In a network with dropout, neurons are randomly turned on and off at training time, and the outputs are averaged at inference time. Intuitively, this creates a ensemble of classifiers, each of which focuses on slightly different features, thus preventing overfitting. This quora question explains it a bit more in depth in a few different ways.

While intuitively satisfying and generally accepted, this explanation of dropout lacks mathematical rigor. A Bayesian perspective, taken recently by Yarin Gal provides a mathematically grounded explanation of dropout, showing it is equivalent to minimizing the divergence between  true and approximate distributions over the network weights.

I’ve pulled the derivation that shows this for a single hidden layer network from his thesis, and added a bit of commentary based on my understanding of it.

Finding an Approximating Distribution

With \omega as our weights, \theta our variational parameters, and X and Y our training inputs and outputs, we find the best approximating distribution, q_{\theta}(\omega), for our actual distribution, p(\omega|X,Y), by minimizing Kullback-Leibler divergence.   Or, in less words, we seek to minimize:

    \begin{align*} KL(q_{\theta}(\omega)||p(\omega|X,Y)) = \int q_{\theta}(\omega)log\left(\frac{q_{\theta}(\omega)}{p(\omega|X,Y)}\right)d\omega \end{align*}

Via the magic of Bayes’ theorem and logarithm rules, we can reach an alternate form of the loss:

    \begin{align*} \mathcal{L}_{VI} = -\int q_{\theta}(\omega)log(p(Y|X,\omega))d\omega + KL(q_{\theta}(\omega)||p(\omega)) + C \end{align*}

We note that our Y and X are discrete, and reformat our log probability:

    \begin{align*} \mathcal{L}_{VI} = -\sum_{i=1}^N\int q_{\theta}(\omega)log(p(y_i|f^{\omega}(x_i))d\omega + KL(q_{\theta}(\omega)||p(\omega)) + C \end{align*}

Where f^{\omega} represents the model output for a given input and weight parameters, and x_i and y_i represent single inputs and outputs from our training set, N is the number of training samples. As a constant, C can (and will be) dropped from our optimization procedure.

Further, we note we can reformulate our final loss function using a subset of the training data:

    \begin{align*} \hat{\mathcal{L}}_{VI} = -\frac{N}{M}\sum_{i\in S}\int q_{\theta}(\omega)log(p(y_i|f^{\omega}(x_i))d\omega + KL(q_{\theta}(\omega)||p(\omega)) \end{align*}

Where M is the size of our set S. The \frac{N}{M} term works as a normalizer, making sure our scale doesn’t change. This is important to show because dropout generates a sample population, not a predictable iteration through all datapoints.

Utilizing the pathwise derivative estimator, we arrive at a new loss function:

    \begin{align*} \label{MCLoss} \hat{\mathcal{L}}_{MC}(\theta) = -\frac{N}{M}\sum_{i\in S}log(p(y_i|f^{g(\theta,\epsilon)}(x_i)) + KL(q_{\theta}(\omega)||p(\omega)) \end{align*}

where \epsilon is drawn from an as of yet undefined distribution.

Reformulating Dropout

We’ve reformulated our KL minimization, now we need to reformulate dropout to match. We start with a standard regularized loss function, where M_1, M_2 and b represent weights and biases with no dropout:

    \begin{align*} \mathcal{L}_d = \frac{1}{M}\sum_{i\in S}\frac{1}{2}||y_i-f^{M1,M2,b}(x)||^2 + \lambda_1||M_1||^2 + \lambda_2||M_2||^2+\lambda_3||b||^2 \end{align*}

In Consistent inference of probabilities in layered networks: Predictions and generalizations, it is shown that the squared error term is equivalent to:

    \begin{align*} \frac{1}{2}||y-f^{M1,M2,b}(x)||^2 = -\frac{1}{\tau}log(p(y|f^{M_1,M_2,b}(x))+C \end{align*}

If we designate a function g(\theta,\hat{\epsilon}) - \{diag(\hat{\epsilon}_1)M_1, diag(\hat{\epsilon}_2)M_2,b\}, which corresponds to a dropout layer if the epsilon are a Bernoulli distribution, our fully formed loss function becomes:

    \begin{align*} \mathcal{L}_d = \frac{1}{M}\sum_{i\in S}log(p(y|f^{g(\theta,\hat{\epsilon})}(x)) + \lambda_1||M_1||^2 + \lambda_2||M_2||^2+\lambda_3||b||^2 \end{align*}

What this means

If you look closely, you can see how similar \mathcal{L}_d and \hat{\mathcal{L}}_{MC} are. Essentially, if we can select a prior, p(\omega), where the KL convergence is equal to our L2 penalty, they match to a scale.

Of course, due to the fact that it’s implemented via enabling and disabling neurons, we don’t ACTUALLY have to worry about this in implementation. But, mathematically, it turns out that by setting the prior to independent multivariate normals with appropriate variance, we can make this true.

Most importantly, though, this means that we can link mathematics related to the approximating distribution and KL estimation to our dropout, which becomes important when we talk about the idea of Bayesian Deep Learning (continued here if you want to read about regression, or here for classification).