## Auto-Encoding Variational Bayes

In my previous post on variational inference, I gave a quick foundation of the theory behind variational auteoncoders (VAEs). The dominant lay understanding of VAEs is that they’re models which allow you to take map an image to a smaller latent space and manipulate that image in meaningful ways by changing the values of the latent space. Most machine learning tutorials are centered around this interpretation. However, I feel this short-sells their meaning (and makes people say “just use a GAN instead”, which is disappointing). It is more correct, and interesting, to say that VAEs are an implementation of variational inference which allows use of arbitrarily complex functions (read: neural networks) as approximating distributions.

Like the previous post, a lot of this information was pulled from the Stanford Probabilistic Graphical Models Course, but it’s also worth reading the original (if opaque) Kingma & Welling Paper, watching this video (starting at 7 minutes or so), and reading this StackOverflow post.

Throughout this post, refers to an input, refers to latent variables, refers to an actual distribution, and refers to an approximating distribution.

In a variational autoencoder, we have a (true) equation , and want to estimate our .

In order to do this, the first thing we do is re-formulate the VAE objective. We still use the upper bound:

Which, in our variables, is:

We note that KL is always greater than or equal to zero, so we remove the intractable first KL term. We then expand with the probability chain rule (and logarithm product rule).

Exploit the linearity of expectation:

Flip the signs and collect the right two terms as a KL divergence (as marginalizes out, this is a straightforward algebraic manipulation of the KL divergence):

Which is our objective function for the variational autoencoder. We can pause here for a minute and take a look at how this lines up with the common interpretation of our variational autoencoder. We note that there is a , which corresponds to our encoder network, a , which corresponds to our prior (a set of independent multivariate unit gaussians), and a , which corresponds to our decoder network. The expected value operator is also significant, as it means we never actually need to know the value of the probability. The expected value is simply the output of the decoder network when all latent variables are at their mean value, and making the probability of higher under this distribution corresponds to just making the input and output images match more closely!

There is one (not small) detail missing before we can perform gradient descent. Since are random variables, we can’t differentiate through them to train the encoder and decoder network. This is what the reparameterization trick is for. In the reparameterization trick, we produce two outputs in our latent space (mean and standard deviation) and treat them as deterministic. That is, these are the outputs we want to learn. But we have a third, random, variable that we multiply sigma by which is drawn from a unit normal distribution. Since this isn’t learned, we can still backpropagate through the network as desired despite having random variables within it.

## A rapid look at variational inference with KL divergence

I’ve been spending some time reading up on variational autoencoders (VAEs), which are a paradigm of machine learning that draws some interesting parallels with Bayesian Inference. In order to understand why VAEs worked (aside from just saying neural networks and refusing to think further), I had to get at least a surface level understanding of variational methods.

This information is rephrased from the Stanford Course on Probabilistic Graphical Models here.

Variational inference seeks to cast (often intractable) inference problems as optimization problems. There are some differences between how various fields refer to inference, but in this case inference signifies extracting information from the probabilistic model. This information may be the probability of a given variable after we sum everything else out (marginal inference) or the most likely assignment to the variables in the model (maximum a-posteriori).

Performing variational inference requires an approximating family of distributions, , and an optimization objective, . One type of variational inference, called “mean field” uses the product of many univariate distributions as the approximating family. However, there are many other options (for example, VAEs use neural networks as approximating families).

For the optimization objective, , we typically use the KL divergence, which is defined as:

The KL divergence is not symmetric () but is always greater than 0 when and 0 when . This second point is important.

As a specific problem, we show to be a joint distribution of many variables, which (for purely pedagogical reasons) is the normalized version of . represents the parameters of the distribution, and is the normalizing distribution.

From here, we note that we can’t calculate or , but we can still play with our optimization objective.

We note that , so it follows that our optimization objective can be expressed as:

And re-formed as:

The KL divergence between and is still intractable. HOWEVER, since KL divergences are always greater than or equal to zero, our KL divergence between and (i.e., ) is tractable, and is a lower bound on the probability of . This is called the Variational Lower Bound (VLBO) or more commonly (in my reading) the Evidence Lower Bound (ELBO).

How is this equation useful? It shows up in the mathematics for the variational autoencoder, which I describe in the next post. It also shows up all sorts of other places that I haven’t covered.

## Help! My PyTorch script returns GPU Out of Memory when running on CPU!

I often find myself training a neural network on my while simultaneously attempting to implement changes for a different experiment. In this case, I typically run the training on the GPU and perform debugging on the CPU.

I do this primarily to avoid the dreaded out of memory error. However, if you’re really maxing out memory usage on your GPU, the error may still appear. This is due to the fact that PyTorch writes information to the default GPU device regardless of where you place your tensors and model.

To prevent this, you need to empty the environment variable CUDA_VISIBLE_DEVICES, which could be done via your bashrc or a manual export:

export CUDA_VISIBLE_DEVICES=

. However, it’s simplest to hide your GPU devices when you call your Python program, by running it with the following syntax:


CUDA_VISIBLE_DEVICES= python your_program.py



## Classification Probability in Deep Networks (Bayesian Deep Learning Part III)

In a previous post, I described Yarin Gal’s interpretation of dropout. This implementation allows us to generate an approximate distribution across the output of a deep neural network via Monte Carlo sampling, implemented through inference time dropout. I previously described how to use this distribution for uncertainty in regression, where the concepts of probability apply in a fairly straightforward way. In this post, I will describe how these distributions are used in classification problems.

While regression seeks to map an input to a numerical output, classification seeks to map an input to a categorical output (cat, dog, bird, etc).  These categorical outputs do not lend themselves to expected values and covariances in the same way a regression problem would, so we have to determine another way to characterize the distribution.

First, let’s think about what our Monte Carlo integration is giving us. Every inference gives us a distribution across our categorical outputs, and all of these distributions are combined to one output distribution. An example distribution may look like this:

This distribution has the highest probability for cat, with dog and squirrel running close behind. This distribution is neither Gaussian nor unimodal, which means that our mean and covariance statistics will not function in this case (though if the categorical outputs are contiguous, such statistics may be justifiable). So our guiding question is: how do we properly describe our confidence in this distribution’s output?

Gal suggests three measures for determining the spread of the distribution. The first, variation ratio, is calculated as , where is the number of samples, and is the number of times the network’s highest value is the same as the mode of highest values across all forward passes. By counting the number of times the mode answer is selected, it approximates .

A second measure, predictive entropy, averages the distribution across all of our Monte Carlo samples, then runs them through a softmax operator, giving a biased estimator of the form:

Where our is the softmaxed average output for each class, .

Closely related is the mutual information metric. This metric provides a measure of how consistent the model is across multiple MC passes (Gal uses the term “model’s confidence”), while variation ratio and predictive entropy measure how likely the output is to be correct (“uncertainty in the output”). The equation for mutual information is:

Conclusion

These three methods for determining the certainty in classification conclude the high level, and possibly slightly incorrect, tour of the method described in Yarin Gal’s thesis. His proposed approach allows us to develop the critically necessary notion of uncertainty while leveraging the power of deep networks. The downside is, of course, that these approximations require many forward passes. This may not be an issue where power and time are comparatively cheap, such as medical diagnosis. However, with power-limited hardware or real time requirements (cellphones and robots), we may need to reframe networks yet again to get a good answer.

But perhaps by the time we figure that out, Moore’s law would have made it all a moot point.

If you’re interested in this subject, I would definitely recommend reading Yarin Gal’s thesis. One of his former labmates, Alex Kendall, also wrote a quite informative blog post arguing for this line of research. Gal has, of course, published many other papers on the subject. If you’re intimately familiar with Gaussian Processes, you may be more comfortable with these, but I found his thesis to be the most digestable of all I read.

## Regression Probability in Deep Networks (Bayesian Deep Learning Part II)

This post is a continuation of Explaining Dropout (Bayesian Deep Learning Part I). It is continued here for the classification problem.

Operations in the physical world are inherently uncertain, and the consequences of not understanding when to act upon the information you have are severe. Some of the most prevalent algorithms in robotics, such as particle filters for SLAM and Kalman filters for sensor fusion, are popular because they handle this uncertainty. And while no one is going to dispute the power of deep neural networks, now famous for tasks such as image classification, object detection, and text-to-speech, they do not have a well understood metric of uncertainty. You may have noticed this when Google maps happily sent you to Rome instead of Home.

The interpretation of dropout shown in the previous post casts the output of a neural network with dropout as an approximation to the output’s  true distribution. Since it is now a distribution instead of a point estimate, we can derive an expected value and variance.

The derivation for regression (taken from Yarin Gal’s thesis) is a little more straightforward, so I’ll go through that one first.

Expected Value

Since we are dealing with a probability distribution, we don’t get a single output – we must instead find the expected value of the distribution. The formula for an expected value of a continuous distribution is:

Using our approximating distribution, , and the chain rule for probability, we get the integral:

Where is our approximating distribution, is our actual distribution for input , label , and parameters (weights) . and represent our training data. Note that since our weights are random variables, this becomes a double integral as we need to marginalize them out.

In the previous post, I mentioned that a probability prior would have to be found in order to make the objective functions match. It turns out that this prior amounts to placing independent normal distributions across each weight, with a specific standard deviation (). We can then reformulate:

The expected value of a Gaussian distribution is just its mean, and the mean is determined by our network output (the parameter), . So if we integrate out our , we find the expected value with one more parameter to marginalize out.

If we evaluate every possible combination of weights and average them (since the more likely output will occur more frequently), we can estimate this integral. This is called Monte Carlo Integration.

Since the weights are Bernoulli distributed, a combination of weights is simply one potential set of activations. Or, pragmatically, a forward pass with dropout enabled. Each of these forward passes results is a single Monte Carlo sample, which leaves us with the final estimator:

(1)

For samples. As , we approach the true expected value.

Variance

While performing inference time dropout improves network performance slightly, the  gain is not sufficient to justify performing multiple inferences for Monte Carlo integration. The real reason this process is valuable is its ability to estimate variance.

(2)

I note that my version of the equation differs slightly from the one in Gal’s thesis. The only reason for this is that I more traditionally see covariance written as for column vectors, while Gal is performs his derivation with row vectors. is known from previous, but we still need to figure out . Conveniently, the procedure is much the same as for expected value. We start with the expected value formula, using instead of . We note that and have the same probability density function.

We note that the integral with respect to is just the expected covariance, so we can rearrange 2 such that it represents expected variance.

Since we know , and the covariance, this becomes:

The term is actually a characteristic of the network itself. It’s tied to the weight-decay parameter and dropout rate. I haven’t needed to dig my teeth into this yet, so I can’t provide a good explanation of that parameter. HOWEVER, the definition of this term is such that if your network has no weight decay parameter, it can be ignored (as , ).

Since our term is not dependent on , and integrates to 1, we can pull it out of the integral.  Monte Carlo integration is performed on the remaining term in the same manner as 1. This results in the final unbiased estimator for :

And substituting this estimate back into 2 we end up with our final estimator for variance:

## Explaining Dropout (Bayesian Deep Learning Part I)

Dropout is a now-ubiquitous regularization technique introduced by Hinton in 2012, and originally provided without any meaningful theoretical grounding. In a network with dropout, neurons are randomly turned on and off at training time, and the outputs are averaged at inference time. Intuitively, this creates a ensemble of classifiers, each of which focuses on slightly different features, thus preventing overfitting. This quora question explains it a bit more in depth in a few different ways.

While intuitively satisfying and generally accepted, this explanation of dropout lacks mathematical rigor. A Bayesian perspective, taken recently by Yarin Gal provides a mathematically grounded explanation of dropout, showing it is equivalent to minimizing the divergence between  true and approximate distributions over the network weights.

I’ve pulled the derivation that shows this for a single hidden layer network from his thesis, and added a bit of commentary based on my understanding of it.

Finding an Approximating Distribution

With as our weights, our variational parameters, and and our training inputs and outputs, we find the best approximating distribution, , for our actual distribution, , by minimizing Kullback-Leibler divergence.   Or, in less words, we seek to minimize:

Via the magic of Bayes’ theorem and logarithm rules, we can reach an alternate form of the loss:

We note that our and are discrete, and reformat our log probability:

Where represents the model output for a given input and weight parameters, and and represent single inputs and outputs from our training set, is the number of training samples. As a constant, can (and will be) dropped from our optimization procedure.

Further, we note we can reformulate our final loss function using a subset of the training data:

Where is the size of our set . The term works as a normalizer, making sure our scale doesn’t change. This is important to show because dropout generates a sample population, not a predictable iteration through all datapoints.

Utilizing the pathwise derivative estimator, we arrive at a new loss function:

where is drawn from an as of yet undefined distribution.

Reformulating Dropout

We’ve reformulated our KL minimization, now we need to reformulate dropout to match. We start with a standard regularized loss function, where and represent weights and biases with no dropout:

In Consistent inference of probabilities in layered networks: Predictions and generalizations, it is shown that the squared error term is equivalent to:

If we designate a function , which corresponds to a dropout layer if the epsilon are a Bernoulli distribution, our fully formed loss function becomes:

What this means

If you look closely, you can see how similar and are. Essentially, if we can select a prior, , where the KL convergence is equal to our L2 penalty, they match to a scale.

Of course, due to the fact that it’s implemented via enabling and disabling neurons, we don’t ACTUALLY have to worry about this in implementation. But, mathematically, it turns out that by setting the prior to independent multivariate normals with appropriate variance, we can make this true.

Most importantly, though, this means that we can link mathematics related to the approximating distribution and KL estimation to our dropout, which becomes important when we talk about the idea of Bayesian Deep Learning (continued here if you want to read about regression, or here for classification).

## Corner Detection and Good Features to Track

In a recent discussion on video tracking, the paper Good Features to Track (GFTT) came up, and my advisor “suggested” I read it. This paper, and feature tracking in general, is heavily based on corner detection literature, so much so that they you never see GFTT mentioned without mentioning corner detection. An application of corner detection and the fundamental goal of GFTT is feature matching, which seeks to match patches of images, generally for the purpose of finding the relationship between the two images (or video frames).

Why is corner detection important in this application? Well, we could probably calculate features for all possible windows in all possible images. But that would be slow, and we would end up with a lot of false positives. Corners have the property that they are invariant to homography (the type of transform with the most degrees of freedom in image processing), meaning a corner in one image of a scene will always be a corner in another image of the same scene. Of course, the corner may disappear from view, or the corner detector may fail, but generating our features around corners makes it much more likely we’ll be able to find the same feature in multiple images.

The image below, which I include without citation because I received it without citation, demonstrates this. Note the directions in which each window can move without changing the image within the window. Note that at the corner there is no way for the window to move without the image within the window changing.

With this in mind, we know what we need to find is the change induced by moving the window, which can be written as an equation:

(1)

Where the calculations are performed over a window ( and ), is a weighting function that can be used to apply more weight to the center of the window (or you can use the identity), represents the value of the image at and , and and are the magnitude of the shift.

At this point the three algorithms – Moravec, Harris, and Shi-Tomasi, become distinct, building on each other. I’ll describe them individually.

Moravec

Moravec’s corner detector was first published in his thesis in 1977. I can’t find his thesis in a five second Google search, so I won’t link it. But it’s out there.

This strategy takes the direct approach to solving 1. Moravec proposes shifting the window to the right 1 pixel, down 1 pixel, down right 1 pixel, down left 1 pixel, and calculating for each transformation. The minimum of these four options is taken, so that an edge, in theory, still exhibits a relatively low score.

Harris

This strategy was introduced in the paper A combined corner and edge detector in 1988 (11 years later!). I also used this derivation as a reference.

Harris noted 3 issues with Moravec’s approach:

• Shifts only occur at every 45 degrees
• Since the window is binary and rectangular, the response is noisy
• Edges and corners are not sufficiently distinct

To resolve these issues, Harris used a bit more mathematical rigor. He noted that an image with slight offsets, as used in 1 is exactly what a Taylor Series Approximation is meant for. So he performed the first order expansion and found:

(2)

which, in matrix form, is:

(3)

Where and represent the partial derivatives of the image with respect to x and y (which can be found using a sobel operator or similar), respectively.

If we substitute this back into 1, we get:

(4)

(5)

Let’s go ahead and expand that (remembering )

(6)

(7)

and are constants, so they can be pulled out of the summation, leaving us with just:

(8)

This is our Structure Tensor, and the Harris detector seeks to find the maximum variance of this matrix. This is simply the PCA problem. Without going into too much detail on PCA (but read about it, super cool, super useful), we know that the eigenvalues capture variance in the direction of the largest variance, and the direction orthogonal to the direction of the largest variance. Since it solves for the direction, we’ve solved the first issue pointed out by Harris. Since the variance magnitudes given by the eigenvalues are for orthogonal directions, we know that two large eigenvalues represents a corner – solving the third issue. The 2nd issue can be resolved by proper application of the weighting function.

While much literature (including Shi-Tomasi) simply selects the minimum eigenvalue, Harris suggests using the trace and determinant instead of eigenvalues via the equation , where k is a weighting factor. This speeds up computation and is based on the fact that the determinant is the product of the eigenvalues, and the trace is the sum of the eigenvalues. With this equation, becomes positive in corners, negative in edges, and small in flat regions.

Shi-Tomasi/Good Features to Track

The Good Features to Track paper was published in 1994. The full derivation is here if you’d like to follow along, but I’m going to give this algorithm a high level treatment.

GFTT is designed for the domain of video tracking, where points of interest are tracked in successive frames of a video. The similarity measure provided by 1 is used in video tracking, but it is changed slightly. Instead of using a spatial offset, they use a temporal offset, and instead of seeking the motion that maximizes change, they seek the motion that minimizes it.

This work substituted an affine transform, which is a more realistic model of motion, for the original linear transform. This does not outperform the linear model for frame-to-frame tracking (I suspect they were disappointed when it didn’t work, but their alternate application has 8,700 citations, so I guess it turned out ok.) What they found was that by comparing the most recent version of the feature to its original version, which allows affine motion to have more of an effect, they could determine whether or not a feature was worth tracking or not. If the value was large, the point has likely disappeared, changed significantly, or never existed at all (i.e., was an artifact of the imaging process).

## A Probabilistic Derivation of the Linear Kalman Filter

Introduction

The Kalman Filter is one of those things – it doesn’t make any sense until you understand it. Then once you understand it, you don’t remember what was difficult. It took me an embarrassingly long time to get a grip on it, and a big part of that is that there are a lot of resources that provide a little bit of information, each with slightly different notation, so I’ll add one more to the pile. This will cover the probabilistic derivation of the Kalman Filter, with no example. I may add an example in a later post.

The Kalman Filter uses measurements and a guess from the previous state to tell us what the state of the system PROBABLY is. It also tells you how much to trust your estimate. How does it do this? By (not-correctly-but-close-enough) assuming that our variables are multivariate Gaussian distribution, and our noise is zero mean Gaussian and doesn’t covary with our states. If you aren’t familiar with Gaussian distributions, you just need to know that they have some very nice mathematical properties, which is why we use them even if they don’t perfectly describe the underlying process. This derivation is built on two properties of Gaussians which show how to condition Gaussian random vector x on Gaussian random vector y.

(1)

(2)

To round out our background knowledge and system description, the equations that describe the system are:

(3)

(4)

Depending of what you’ve been reading, you may note that there is no control term (commonly denoted ) in 3. I’ve left it off for simplicity, but if you understand the derivation well it’s straightforward to add.

Variables

Now that I’ve introduced a little bit of math, I’m going to outline all of the variables that get used. One of the things I’ve found most confusing about the Kalman filter is how many variables there are flying around, so you may find yourself referring back to this quite a bit.

– state and estimate of the state, respectively

– a linear transform that uses the old state to find the new state

– state update noise, which we assume to be zero mean Gaussian and uncorrelated to the state.

– covariance of the state update noise

– covariance of the state

– Kalman gain. For the purpose of this derivation, we can think of the Kalman gain as a notational convenience. This stackexchange post has an interesting perspective on it, and it also appears (via a completely different derivation) in the Minimum Variance Unbiased Estimator.

– our measurement

– describes the relationship between our measurement and the state. (See 4).

– measurement noise, which we assume to be zero mean Gaussian and uncorrelated to the measurement.

– covariance of the measurement noise

– index variable, which increments by 1 at each reading.

– denotes a covariance matrix between its subscripted variables.

Derivation

Recall from the introduction, our goal is to figure out what our state, , is, given our measurements, . If we refer back to 1 and 2, we can see that the problem comes down to filling in the variables of our distribution:

(Which is math notation for the multivariate normal, with the expected values to the left, and covariances to the right).

We designate our as (because ), and use 4 to find our expected value of y, based on x. Note that throughout this section I’ve dropped the variable indices for neatness.

(5)

Using the facts that expected value is a linear operator, C is a known, and v is assumed zero mean Gaussian, we can say:

(6)

Now we need to find the covariances. By definition:

(7)

For the other covariances, we note that .

(8)

We note the definition of Q and P, and also that the noise is assumed not to covary with the state.

(9)

Great! Two down, two to go. These are quick – remember that v and x are assumed independent (i.e., they do not covary).

(10)

(11)

Now we have the matrices describing our distribution:

(12)

And we can apply 1 and 2 to find our updates:

(13)

(14)

.

Remember when I mentioned K just being a notational convenience? Notice the similarities in the previous two equations? The Kalman gain, is simply the common term in these two equations. I’m also going to (clumsily) re-introduce the indices, to give us our Measurement Update Equations.

(15)

(16)

(17)

So why the subscripts and ? In short, we have some knowledge of what we expect the state to be at the new time, and we don’t want to just throw that useful information away. So what we do is update the expected values of and based on our model in the Prediction Step. This means that the information coming in from our measurements doesn’t need to try to compensate for the change in state, it just adjusts the error in the state update. Variables based on the prediction step are reliant on the previous measurement, and are thus denoted while variables based on the measurement update step are reliant only on estimates made at the current timestep, and are thus denoted . Remember represents the current time, so saying means “everything we know about x from information available at the current time” and means “everything we know about x from information available at the last timestep.”

So how do we perform the prediction step? First, we use our model to estimate x at the next timestep. This is the most straightforward part of the whole algorithm, though since it concerns a transition between states, you might see variation in whether it’s denoted or .

(18)

And for P:

(19)

And that ends the derivation.

Summary and Equations

This derivation shows the Kalman filter as an exploitation of the rules of Gaussians. When it comes down to it, the tasks is just to find the information needed to perform the conditioning operation, as shown in 1 and 2. Of course, this is only one derivation of one kind of Kalman Filter. For nonlinear systems, there is the Extended Kalman Filter, Unscented Kalman Filter, and others.

Measurement Update Step

(20)

(21)

(22)

Prediction Step

(23)

(24)