Corner Detection and Good Features to Track

In a recent discussion on video tracking, the paper Good Features to Track (GFTT) came up, and my advisor “suggested” I read it. This paper, and feature tracking in general, is heavily based on corner detection literature, so much so that they you never see GFTT mentioned without mentioning corner detection. An application of corner detection and the fundamental goal of GFTT is feature matching, which seeks to match patches of images, generally for the purpose of finding the relationship between the two images (or video frames).

Why is corner detection important in this application? Well, we could probably calculate features for all possible windows in all possible images. But that would be slow, and we would end up with a lot of false positives. Corners have the property that they are invariant to homography (the type of transform with the most degrees of freedom in image processing), meaning a corner in one image of a scene will always be a corner in another image of the same scene. Of course, the corner may disappear from view, or the corner detector may fail, but generating our features around corners makes it much more likely we’ll be able to find the same feature in multiple images.

The image below, which I include without citation because I received it without citation, demonstrates this. Note the directions in which each window can move without changing the image within the window. Note that at the corner there is no way for the window to move without the image within the window changing.

With this in mind, we know what we need to find is the change induced by moving the window, which can be written as an equation:

(1)   \begin{equation*} E =  \sum_x\sum_y w(x,y)[I(x+u,y+v)-I(x,y)]^2  \end{equation*}

Where the calculations are performed over a window (x and y), w is a weighting function that can be used to apply more weight to the center of the window (or you can use the identity), I(x,y) represents the value of the image at x and y, and u and v are the magnitude of the shift.

At this point the three algorithms – Moravec, Harris, and Shi-Tomasi, become distinct, building on each other. I’ll describe them individually.


Moravec’s corner detector was first published in his thesis in 1977. I can’t find his thesis in a five second Google search, so I won’t link it. But it’s out there.

This strategy takes the direct approach to solving 1. Moravec proposes shifting the window to the right 1 pixel, down 1 pixel, down right 1 pixel, down left 1 pixel, and calculating E for each transformation. The minimum of these four options is taken, so that an edge, in theory, still exhibits a relatively low score.


This strategy was introduced in the paper A combined corner and edge detector in 1988 (11 years later!). I also used this derivation as a reference.

Harris noted 3 issues with Moravec’s approach:

  • Shifts only occur at every 45 degrees
  • Since the window is binary and rectangular, the response is noisy
  • Edges and corners are not sufficiently distinct

To resolve these issues, Harris used a bit more mathematical rigor. He noted that an image with slight offsets, as used in 1 is exactly what a Taylor Series Approximation is meant for. So he performed the first order expansion and found:

(2)   \begin{equation*}I(x+u,y+v) = I(x,y)+I'_x(x,y)(x+u-x)+I'_y(x,y)(y+v-y)\end{equation*}

which, in matrix form, is:

(3)   \begin{equation*}I(x+u,y+v) =I(x,y) + \begin{bmatrix}I'_x(x,y) & I'_y(x,y)\end{bmatrix} \begin{bmatrix}u\\v\end{bmatrix}\end{equation*}

Where I'_x and I'_y represent the partial derivatives of the image with respect to x and y (which can be found using a sobel operator or similar), respectively.

If we substitute this back into 1, we get:

(4)   \begin{equation*}E = \sum_x\sum_y w(x,y) \left[I(x,y) + \begin{bmatrix}I'_x(x,y) & I'_y(x,y)\end{bmatrix} \begin{bmatrix}u\\v\end{bmatrix} - I(x,y)\right]^2\end{equation*}

(5)   \begin{equation*} E = \sum_x\sum_y w(x,y)\left [\begin{bmatrix}I'_x(x,y) & I'_y(x,y)\end{bmatrix} \begin{bmatrix}u\\v\end{bmatrix}\right]^2\end{equation*}

Let’s go ahead and expand that (remembering (AB)^T = B^TA^T)

(6)   \begin{equation*} E = \sum_x\sum_y w(x,y)\left[\begin{bmatrix}I'_x(x,y) & I'_y(x,y)\end{bmatrix} \begin{bmatrix}u\\v\end{bmatrix}\right]^T\left[\begin{bmatrix}I'_x(x,y) & I'_y(x,y)\end{bmatrix} \begin{bmatrix}u\\v\end{bmatrix}\right]\end{equation*}

(7)   \begin{equation*} E = \sum_x\sum_y w(x,y)\left[\begin{bmatrix}u & v\end{bmatrix} \begin{bmatrix}I'_x(x,y)\\I'_y(x,y)\end{bmatrix}\begin{bmatrix}I'_x(x,y) & I'_y(x,y)\end{bmatrix} \begin{bmatrix}u\\v\end{bmatrix}\right]\end{equation*}

U and V are constants, so they can be pulled out of the summation, leaving us with just:

(8)   \begin{equation*} T = \sum_x\sum_y w(x,y)\begin{bmatrix}I'_x(x,y)I'_x(x,y) & I'_x(x,y)I'_y(x,y)\\I'_y(x,y)I'_x(x,y)&I'_y(x,y)I'_y(x,y)\end{bmatrix}\end{equation*}

This is our Structure Tensor, and the Harris detector seeks to find the maximum variance of this matrix. This is simply the PCA problem. Without going into too much detail on PCA (but read about it, super cool, super useful), we know that the eigenvalues capture variance in the direction of the largest variance, and the direction orthogonal to the direction of the largest variance. Since it solves for the direction, we’ve solved the first issue pointed out by Harris. Since the variance magnitudes given by the eigenvalues are for orthogonal directions, we know that two large eigenvalues represents a corner – solving the third issue. The 2nd issue can be resolved by proper application of the weighting function.

While much literature (including Shi-Tomasi) simply selects the minimum eigenvalue, Harris suggests using the trace and determinant instead of eigenvalues via the equation R = det(T) - k*tr(T), where k is a weighting factor. This speeds up computation and is based on the fact that the determinant is the product of the eigenvalues, and the trace is the sum of the eigenvalues. With this equation, R becomes positive in corners, negative in edges, and small in flat regions.

Shi-Tomasi/Good Features to Track

The Good Features to Track paper was published in 1994. The full derivation is here if you’d like to follow along, but I’m going to give this algorithm a high level treatment.

GFTT is designed for the domain of video tracking, where points of interest are tracked in successive frames of a video. The similarity measure provided by 1 is used in video tracking, but it is changed slightly. Instead of using a spatial offset, they use a temporal offset, and instead of seeking the motion that maximizes change, they seek the motion that minimizes it.

This work substituted an affine transform, which is a more realistic model of motion, for the original linear transform. This does not outperform the linear model for frame-to-frame tracking (I suspect they were disappointed when it didn’t work, but their alternate application has 8,700 citations, so I guess it turned out ok.) What they found was that by comparing the most recent version of the feature to its original version, which allows affine motion to have more of an effect, they could determine whether or not a feature was worth tracking or not. If the E value was large, the point has likely disappeared, changed significantly, or never existed at all (i.e., was an artifact of the imaging process).


A Probabilistic Derivation of the Linear Kalman Filter


The Kalman Filter is one of those things – it doesn’t make any sense until you understand it. Then once you understand it, you don’t remember what was difficult. It took me an embarrassingly long time to get a grip on it, and a big part of that is that there are a lot of resources that provide a little bit of information, each with slightly different notation, so I’ll add one more to the pile. This will cover the probabilistic derivation of the Kalman Filter, with no example. I may add an example in a later post.

The Kalman Filter uses measurements and a guess from the previous state to tell us what the state of the system PROBABLY is. It also tells you how much to trust your estimate. How does it do this? By (not-correctly-but-close-enough) assuming that our variables are multivariate Gaussian distribution, and our noise is zero mean Gaussian and doesn’t covary with our states. If you aren’t familiar with Gaussian distributions, you just need to know that they have some very nice mathematical properties, which is why we use them even if they don’t perfectly describe the underlying process. This derivation is built on two properties of Gaussians which show how to condition Gaussian random vector x on Gaussian random vector y.

(1)   \begin{equation*}\mu_{x|y} = \mu_x + \Sigma_{xy}\Sigma_{yy}^{-1}(y-\mu_y) \end{equation*}

(2)   \begin{equation*}\Sigma_{x|y} = \Sigma_{xx} - \Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{yx}\end{equation*}

To round out our background knowledge and system description, the equations that describe the system are:

(3)   \begin{equation*}x_{k+1} = A_{k}x_k + w_k\end{equation*}

(4)   \begin{equation*}y_k = C_kx_k + v_k\end{equation*}

Depending of what you’ve been reading, you may note that there is no control term (commonly denoted Gu) in 3. I’ve left it off for simplicity, but if you understand the derivation well it’s straightforward to add.


Now that I’ve introduced a little bit of math, I’m going to outline all of the variables that get used. One of the things I’ve found most confusing about the Kalman filter is how many variables there are flying around, so you may find yourself referring back to this quite a bit.

x, \hat{x} – state and estimate of the state, respectively

A – a linear transform that uses the old state to find the new state

w – state update noise, which we assume to be zero mean Gaussian and uncorrelated to the state.

R – covariance of the state update noise (ww^T)

P – covariance of the state (xx^T)

K – Kalman gain. For the purpose of this derivation, we can think of the Kalman gain as a notational convenience. This stackexchange post has an interesting perspective on it, and it also appears (via a completely different derivation) in the Minimum Variance Unbiased Estimator.

y – our measurement

C – describes the relationship between our measurement and the state. (See 4).

v – measurement noise, which we assume to be zero mean Gaussian and uncorrelated to the measurement.

Q – covariance of the measurement noise (vv^T)

k – index variable, which increments by 1 at each reading.

\Sigma – denotes a covariance matrix between its subscripted variables.


Recall from the introduction, our goal is to figure out what our state, x, is, given our measurements, y. If we refer back to 1 and 2, we can see that the problem comes down to filling in the variables of our distribution:

~N(\begin{bmatrix} x\\y \end{bmatrix}, \begin{bmatrix} \Sigma_{xx} & \Sigma_{xy} \\ \Sigma_{yx} & \Sigma_{yy} \end{bmatrix})

(Which is math notation for the multivariate normal, with the expected values to the left, and covariances to the right).

We designate our x as \hat{x} (because \mathbb{E}(x) = \hat{x}), and use 4 to find our expected value of y, based on x. Note that throughout this section I’ve dropped the variable indices for neatness.

(5)   \begin{equation*} \hat{y} =  \mathbb{E}(Cx + v)\end{equation*}

Using the facts that expected value is a linear operator, C is a known, and v is assumed zero mean Gaussian, we can say:

(6)   \begin{equation*} \hat{y} =  \mathbb{E}(Cx + v) = \mathbb{E}(C)\mathbb{E}(x) + \mathbb{E}(v) = C\hat{x}\end{equation*}

Now we need to find the covariances. By definition:

(7)   \begin{equation*} \Sigma_{xx} = P \end{equation*}

For the other covariances, we note that \Sigma_{ab} = ab^T.

(8)   \begin{equation*} \Sigma_{yy} = (Cx + v)(Cx + v)^T = (Cx + v)(x^TC^T + v^T) \\= Cxx^TC^T + Cxv^T + vx^TC^T + vv^T \end{equation*}

We note the definition of Q and P, and also that the noise is assumed not to covary with the state.

(9)   \begin{equation*} \Sigma_{yy} = CPC^T + Q \end{equation*}

Great! Two down, two to go. These are quick – remember that v and x are assumed independent (i.e., they do not covary).

(10)   \begin{equation*} \Sigma_{xy} = x(Cx+v)^T = x(x^TC^T+v^T) = xx^TC^T+xv^T = PC^T \end{equation*}

(11)   \begin{equation*} \Sigma_{yx} = (Cx+v)x^T = Cxx^T+vx^T = CP \end{equation*}

Now we have the matrices describing our distribution:

(12)   \begin{equation*}\begin{bmatrix}x\\y\end{bmatrix} \sim N(\begin{bmatrix}\hat{x}\\C\hat{x}\end{bmatrix},\begin{bmatrix}P & PC^T \\ CP  & CPC^T+Q \end{bmatrix}) \end{equation*}

And we can apply 1 and 2 to find our updates:

(13)   \begin{equation*}\hat{x} = \hat{x} + PC^T(CPC^T + Q)^{-1}(y - C\hat{x}}) \end{equation*}

(14)   \begin{equation*}P = P - PC^T(CPC^T + Q)^{-1}(CP)\end{equation*}


Remember when I mentioned K just being a notational convenience? Notice the similarities in the previous two equations? The Kalman gain, K is simply the common term in these two equations. I’m also going to (clumsily) re-introduce the indices, to give us our Measurement Update Equations.

(15)   \begin{equation*}K_k = P_{k|k-1}C_k^T(C_kP_{k|k-1}C_k^T + Q_k)^{-1}\end{equation*}

(16)   \begin{equation*}\hat{x}_{k|k} = \hat{x}_{k|k-1} + K(y_k-C_k\hat{x}_{k|k-1})\end{equation*}

(17)   \begin{equation*}P_{k|k} = P_{k|k-1} - K_kC_kP_{k|k-1}\end{equation*}

So why the subscripts x_{k|k} and x_{k|k-1}? In short, we have some knowledge of what we expect the state to be at the new time, and we don’t want to just throw that useful information away. So what we do is update the expected values of x and P based on our model in theĀ Prediction Step. This means that the information coming in from our measurements doesn’t need to try to compensate for the change in state, it just adjusts the error in the state update. Variables based on the prediction step are reliant on the previous measurement, and are thus denoted x_{k|k-1} while variables based on the measurement update step are reliant only on estimates made at the current timestep, and are thus denoted x_{k|k}. Remember k represents the current time, so saying x_{k|k} means “everything we know about x from information available at the current time” and x_{k|k-1} means “everything we know about x from information available at the last timestep.”

So how do we perform the prediction step? First, we use our model to estimate x at the next timestep. This is the most straightforward part of the whole algorithm, though since it concerns a transition between states, you might see variation in whether it’s denoted x_{k+1|k} or x_{k|k-1}.

(18)   \begin{equation*}\mathbb{E}(X_{k+1|k}) = \mathbb{E}(A_kx_{k|k}+w) =A_k\mathbb{E}(x_{k|k}) + \mathbb{E}(w) = A_k\hat{x_{k|k}}\end{equation*}

And for P:

(19)   \begin{align*}P_{k+1|k} = (A_{k}x_k + w_k)(A_{k}x_k + w_k)^T = (A_{k}x_k + w_k)(x_k^TA_k^T + w_k^T) \\= A_{k}x_kx_k^TA_k^T + A_{k}x_kw_k^T + w_kx_k^TA_k^T + w_kw_k^T = AP_{k|k}A^T + R \end{align*}

And that ends the derivation.

Summary and Equations

This derivation shows the Kalman filter as an exploitation of the rules of Gaussians. When it comes down to it, the tasks is just to find the information needed to perform the conditioning operation, as shown in 1 and 2. Of course, this is only one derivation of one kind of Kalman Filter. For nonlinear systems, there is the Extended Kalman Filter, Unscented Kalman Filter, and others.

Measurement Update Step

(20)   \begin{align*}K_k = P_{k|k-1}C_k^T(C_kP_{k|k-1}C_k^T + Q_k)^{-1}\end{align*}

(21)   \begin{align*}\hat{x}_{k|k} = \hat{x}_{k|k-1} + K(y_k-C_k\hat{x}_{k|k-1})\end{align*}

(22)   \begin{equation*}P_{k|k} = P_{k|k-1} - K_kC_kP_{k|k-1}\end{equation*}

Prediction Step

(23)   \begin{align*}\mathbb{E}(X_{k+1|k}) = A_k\hat{x_{k|k}}\end{align*}

(24)   \begin{align*}P_{k+1|k} = AP_{k|k}A^T + R \end{align*}