A Probabilistic Derivation of the Linear Kalman Filter


The Kalman Filter is one of those things – it doesn’t make any sense until you understand it. Then once you understand it, you don’t remember what was difficult. It took me an embarrassingly long time to get a grip on it, and a big part of that is that there are a lot of resources that provide a little bit of information, each with slightly different notation, so I’ll add one more to the pile. This will cover the probabilistic derivation of the Kalman Filter, with no example. I may add an example in a later post.

The Kalman Filter uses measurements and a guess from the previous state to tell us what the state of the system PROBABLY is. It also tells you how much to trust your estimate. How does it do this? By (not-correctly-but-close-enough) assuming that our variables are multivariate Gaussian distribution, and our noise is zero mean Gaussian and doesn’t covary with our states. If you aren’t familiar with Gaussian distributions, you just need to know that they have some very nice mathematical properties, which is why we use them even if they don’t perfectly describe the underlying process. This derivation is built on two properties of Gaussians which show how to condition Gaussian random vector x on Gaussian random vector y.

(1)   \begin{equation*}\mu_{x|y} = \mu_x + \Sigma_{xy}\Sigma_{yy}^{-1}(y-\mu_y) \end{equation*}

(2)   \begin{equation*}\Sigma_{x|y} = \Sigma_{xx} - \Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{yx}\end{equation*}

To round out our background knowledge and system description, the equations that describe the system are:

(3)   \begin{equation*}x_{k+1} = A_{k}x_k + w_k\end{equation*}

(4)   \begin{equation*}y_k = C_kx_k + v_k\end{equation*}

Depending of what you’ve been reading, you may note that there is no control term (commonly denoted Gu) in 3. I’ve left it off for simplicity, but if you understand the derivation well it’s straightforward to add.


Now that I’ve introduced a little bit of math, I’m going to outline all of the variables that get used. One of the things I’ve found most confusing about the Kalman filter is how many variables there are flying around, so you may find yourself referring back to this quite a bit.

x, \hat{x} – state and estimate of the state, respectively

A – a linear transform that uses the old state to find the new state

w – state update noise, which we assume to be zero mean Gaussian and uncorrelated to the state.

R – covariance of the state update noise (ww^T)

P – covariance of the state (xx^T)

K – Kalman gain. For the purpose of this derivation, we can think of the Kalman gain as a notational convenience. This stackexchange post has an interesting perspective on it, and it also appears (via a completely different derivation) in the Minimum Variance Unbiased Estimator.

y – our measurement

C – describes the relationship between our measurement and the state. (See 4).

v – measurement noise, which we assume to be zero mean Gaussian and uncorrelated to the measurement.

Q – covariance of the measurement noise (vv^T)

k – index variable, which increments by 1 at each reading.

\Sigma – denotes a covariance matrix between its subscripted variables.


Recall from the introduction, our goal is to figure out what our state, x, is, given our measurements, y. If we refer back to 1 and 2, we can see that the problem comes down to filling in the variables of our distribution:

~N(\begin{bmatrix} x\\y \end{bmatrix}, \begin{bmatrix} \Sigma_{xx} & \Sigma_{xy} \\ \Sigma_{yx} & \Sigma_{yy} \end{bmatrix})

(Which is math notation for the multivariate normal, with the expected values to the left, and covariances to the right).

We designate our x as \hat{x} (because \mathbb{E}(x) = \hat{x}), and use 4 to find our expected value of y, based on x. Note that throughout this section I’ve dropped the variable indices for neatness.

(5)   \begin{equation*} \hat{y} =  \mathbb{E}(Cx + v)\end{equation*}

Using the facts that expected value is a linear operator, C is a known, and v is assumed zero mean Gaussian, we can say:

(6)   \begin{equation*} \hat{y} =  \mathbb{E}(Cx + v) = \mathbb{E}(C)\mathbb{E}(x) + \mathbb{E}(v) = C\hat{x}\end{equation*}

Now we need to find the covariances. By definition:

(7)   \begin{equation*} \Sigma_{xx} = P \end{equation*}

For the other covariances, we note that \Sigma_{ab} = ab^T.

(8)   \begin{equation*} \Sigma_{yy} = (Cx + v)(Cx + v)^T = (Cx + v)(x^TC^T + v^T) \\= Cxx^TC^T + Cxv^T + vx^TC^T + vv^T \end{equation*}

We note the definition of Q and P, and also that the noise is assumed not to covary with the state.

(9)   \begin{equation*} \Sigma_{yy} = CPC^T + Q \end{equation*}

Great! Two down, two to go. These are quick – remember that v and x are assumed independent (i.e., they do not covary).

(10)   \begin{equation*} \Sigma_{xy} = x(Cx+v)^T = x(x^TC^T+v^T) = xx^TC^T+xv^T = PC^T \end{equation*}

(11)   \begin{equation*} \Sigma_{yx} = (Cx+v)x^T = Cxx^T+vx^T = CP \end{equation*}

Now we have the matrices describing our distribution:

(12)   \begin{equation*}\begin{bmatrix}x\\y\end{bmatrix} \sim N(\begin{bmatrix}\hat{x}\\C\hat{x}\end{bmatrix},\begin{bmatrix}P & PC^T \\ CP  & CPC^T+Q \end{bmatrix}) \end{equation*}

And we can apply 1 and 2 to find our updates:

(13)   \begin{equation*}\hat{x} = \hat{x} + PC^T(CPC^T + Q)^{-1}(y - C\hat{x}}) \end{equation*}

(14)   \begin{equation*}P = P - PC^T(CPC^T + Q)^{-1}(CP)\end{equation*}


Remember when I mentioned K just being a notational convenience? Notice the similarities in the previous two equations? The Kalman gain, K is simply the common term in these two equations. I’m also going to (clumsily) re-introduce the indices, to give us our Measurement Update Equations.

(15)   \begin{equation*}K_k = P_{k|k-1}C_k^T(C_kP_{k|k-1}C_k^T + Q_k)^{-1}\end{equation*}

(16)   \begin{equation*}\hat{x}_{k|k} = \hat{x}_{k|k-1} + K(y_k-C_k\hat{x}_{k|k-1})\end{equation*}

(17)   \begin{equation*}P_{k|k} = P_{k|k-1} - K_kC_kP_{k|k-1}\end{equation*}

So why the subscripts x_{k|k} and x_{k|k-1}? In short, we have some knowledge of what we expect the state to be at the new time, and we don’t want to just throw that useful information away. So what we do is update the expected values of x and P based on our model in theĀ Prediction Step. This means that the information coming in from our measurements doesn’t need to try to compensate for the change in state, it just adjusts the error in the state update. Variables based on the prediction step are reliant on the previous measurement, and are thus denoted x_{k|k-1} while variables based on the measurement update step are reliant only on estimates made at the current timestep, and are thus denoted x_{k|k}. Remember k represents the current time, so saying x_{k|k} means “everything we know about x from information available at the current time” and x_{k|k-1} means “everything we know about x from information available at the last timestep.”

So how do we perform the prediction step? First, we use our model to estimate x at the next timestep. This is the most straightforward part of the whole algorithm, though since it concerns a transition between states, you might see variation in whether it’s denoted x_{k+1|k} or x_{k|k-1}.

(18)   \begin{equation*}\mathbb{E}(X_{k+1|k}) = \mathbb{E}(A_kx_{k|k}+w) =A_k\mathbb{E}(x_{k|k}) + \mathbb{E}(w) = A_k\hat{x_{k|k}}\end{equation*}

And for P:

(19)   \begin{align*}P_{k+1|k} = (A_{k}x_k + w_k)(A_{k}x_k + w_k)^T = (A_{k}x_k + w_k)(x_k^TA_k^T + w_k^T) \\= A_{k}x_kx_k^TA_k^T + A_{k}x_kw_k^T + w_kx_k^TA_k^T + w_kw_k^T = AP_{k|k}A^T + R \end{align*}

And that ends the derivation.

Summary and Equations

This derivation shows the Kalman filter as an exploitation of the rules of Gaussians. When it comes down to it, the tasks is just to find the information needed to perform the conditioning operation, as shown in 1 and 2. Of course, this is only one derivation of one kind of Kalman Filter. For nonlinear systems, there is the Extended Kalman Filter, Unscented Kalman Filter, and others.

Measurement Update Step

(20)   \begin{align*}K_k = P_{k|k-1}C_k^T(C_kP_{k|k-1}C_k^T + Q_k)^{-1}\end{align*}

(21)   \begin{align*}\hat{x}_{k|k} = \hat{x}_{k|k-1} + K(y_k-C_k\hat{x}_{k|k-1})\end{align*}

(22)   \begin{equation*}P_{k|k} = P_{k|k-1} - K_kC_kP_{k|k-1}\end{equation*}

Prediction Step

(23)   \begin{align*}\mathbb{E}(X_{k+1|k}) = A_k\hat{x_{k|k}}\end{align*}

(24)   \begin{align*}P_{k+1|k} = AP_{k|k}A^T + R \end{align*}

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.