Corner Detection and Good Features to Track

In a recent discussion on video tracking, the paper Good Features to Track (GFTT) came up, and my advisor “suggested” I read it. This paper, and feature tracking in general, is heavily based on corner detection literature, so much so that they you never see GFTT mentioned without mentioning corner detection. An application of corner detection and the fundamental goal of GFTT is feature matching, which seeks to match patches of images, generally for the purpose of finding the relationship between the two images (or video frames).

Why is corner detection important in this application? Well, we could probably calculate features for all possible windows in all possible images. But that would be slow, and we would end up with a lot of false positives. Corners have the property that they are invariant to homography (the type of transform with the most degrees of freedom in image processing), meaning a corner in one image of a scene will always be a corner in another image of the same scene. Of course, the corner may disappear from view, or the corner detector may fail, but generating our features around corners makes it much more likely we’ll be able to find the same feature in multiple images.

The image below, which I include without citation because I received it without citation, demonstrates this. Note the directions in which each window can move without changing the image within the window. Note that at the corner there is no way for the window to move without the image within the window changing.

With this in mind, we know what we need to find is the change induced by moving the window, which can be written as an equation:

(1)   \begin{equation*} E =  \sum_x\sum_y w(x,y)[I(x+u,y+v)-I(x,y)]^2  \end{equation*}

Where the calculations are performed over a window (x and y), w is a weighting function that can be used to apply more weight to the center of the window (or you can use the identity), I(x,y) represents the value of the image at x and y, and u and v are the magnitude of the shift.

At this point the three algorithms – Moravec, Harris, and Shi-Tomasi, become distinct, building on each other. I’ll describe them individually.

Moravec

Moravec’s corner detector was first published in his thesis in 1977. I can’t find his thesis in a five second Google search, so I won’t link it. But it’s out there.

This strategy takes the direct approach to solving 1. Moravec proposes shifting the window to the right 1 pixel, down 1 pixel, down right 1 pixel, down left 1 pixel, and calculating E for each transformation. The minimum of these four options is taken, so that an edge, in theory, still exhibits a relatively low score.

Harris

This strategy was introduced in the paper A combined corner and edge detector in 1988 (11 years later!). I also used this derivation as a reference.

Harris noted 3 issues with Moravec’s approach:

  • Shifts only occur at every 45 degrees
  • Since the window is binary and rectangular, the response is noisy
  • Edges and corners are not sufficiently distinct

To resolve these issues, Harris used a bit more mathematical rigor. He noted that an image with slight offsets, as used in 1 is exactly what a Taylor Series Approximation is meant for. So he performed the first order expansion and found:

(2)   \begin{equation*}I(x+u,y+v) = I(x,y)+I'_x(x,y)(x+u-x)+I'_y(x,y)(y+v-y)\end{equation*}

which, in matrix form, is:

(3)   \begin{equation*}I(x+u,y+v) =I(x,y) + \begin{bmatrix}I'_x(x,y) & I'_y(x,y)\end{bmatrix} \begin{bmatrix}u\\v\end{bmatrix}\end{equation*}

Where I'_x and I'_y represent the partial derivatives of the image with respect to x and y (which can be found using a sobel operator or similar), respectively.

If we substitute this back into 1, we get:

(4)   \begin{equation*}E = \sum_x\sum_y w(x,y) \left[I(x,y) + \begin{bmatrix}I'_x(x,y) & I'_y(x,y)\end{bmatrix} \begin{bmatrix}u\\v\end{bmatrix} - I(x,y)\right]^2\end{equation*}

(5)   \begin{equation*} E = \sum_x\sum_y w(x,y)\left [\begin{bmatrix}I'_x(x,y) & I'_y(x,y)\end{bmatrix} \begin{bmatrix}u\\v\end{bmatrix}\right]^2\end{equation*}

Let’s go ahead and expand that (remembering (AB)^T = B^TA^T)

(6)   \begin{equation*} E = \sum_x\sum_y w(x,y)\left[\begin{bmatrix}I'_x(x,y) & I'_y(x,y)\end{bmatrix} \begin{bmatrix}u\\v\end{bmatrix}\right]^T\left[\begin{bmatrix}I'_x(x,y) & I'_y(x,y)\end{bmatrix} \begin{bmatrix}u\\v\end{bmatrix}\right]\end{equation*}

(7)   \begin{equation*} E = \sum_x\sum_y w(x,y)\left[\begin{bmatrix}u & v\end{bmatrix} \begin{bmatrix}I'_x(x,y)\\I'_y(x,y)\end{bmatrix}\begin{bmatrix}I'_x(x,y) & I'_y(x,y)\end{bmatrix} \begin{bmatrix}u\\v\end{bmatrix}\right]\end{equation*}

U and V are constants, so they can be pulled out of the summation, leaving us with just:

(8)   \begin{equation*} T = \sum_x\sum_y w(x,y)\begin{bmatrix}I'_x(x,y)I'_x(x,y) & I'_x(x,y)I'_y(x,y)\\I'_y(x,y)I'_x(x,y)&I'_y(x,y)I'_y(x,y)\end{bmatrix}\end{equation*}

This is our Structure Tensor, and the Harris detector seeks to find the maximum variance of this matrix. This is simply the PCA problem. Without going into too much detail on PCA (but read about it, super cool, super useful), we know that the eigenvalues capture variance in the direction of the largest variance, and the direction orthogonal to the direction of the largest variance. Since it solves for the direction, we’ve solved the first issue pointed out by Harris. Since the variance magnitudes given by the eigenvalues are for orthogonal directions, we know that two large eigenvalues represents a corner – solving the third issue. The 2nd issue can be resolved by proper application of the weighting function.

While much literature (including Shi-Tomasi) simply selects the minimum eigenvalue, Harris suggests using the trace and determinant instead of eigenvalues via the equation R = det(T) - k*tr(T), where k is a weighting factor. This speeds up computation and is based on the fact that the determinant is the product of the eigenvalues, and the trace is the sum of the eigenvalues. With this equation, R becomes positive in corners, negative in edges, and small in flat regions.

Shi-Tomasi/Good Features to Track

The Good Features to Track paper was published in 1994. The full derivation is here if you’d like to follow along, but I’m going to give this algorithm a high level treatment.

GFTT is designed for the domain of video tracking, where points of interest are tracked in successive frames of a video. The similarity measure provided by 1 is used in video tracking, but it is changed slightly. Instead of using a spatial offset, they use a temporal offset, and instead of seeking the motion that maximizes change, they seek the motion that minimizes it.

This work substituted an affine transform, which is a more realistic model of motion, for the original linear transform. This does not outperform the linear model for frame-to-frame tracking (I suspect they were disappointed when it didn’t work, but their alternate application has 8,700 citations, so I guess it turned out ok.) What they found was that by comparing the most recent version of the feature to its original version, which allows affine motion to have more of an effect, they could determine whether or not a feature was worth tracking or not. If the E value was large, the point has likely disappeared, changed significantly, or never existed at all (i.e., was an artifact of the imaging process).