Kalman Filter is a widely used data processing algorithm. It was originally invented by Kalman and used for spaceship navigation in NASA. It is now applied to many different fields signal processing, noise cancelling and autonomous driving etc. In the previous post, we adopt Kalman Filter to estimate the landmark locations. Generally, Kalman Filter is trying to solve a class of problems for estimating system states like the position of a car, the temperature of a room or the value of a random variable accurately by incorporating external measurements of the states.

To put it in a mathematical form, let \(x_k \in \Re^n\) be the state of the system at time \(k\). The system is governed by a linear stochastic difference equation,

\[x_k = Ax_{k-1} + Bu_{k-1} + w_{k-1}\]

with a measurement \(z \in \Re^m\)

\[z_k = Hx_k + v_k\]

The matrix \(A\) relates the current state with the previous state of the system. \(u\) is an optional control parameter that affects the system state like an air conditioner when estimating the temperature. The matrix \(H\) relates the observation with the current state. Both \(w\) and \(v\) are white noise following Gaussian distribution.

Bases on the previous state of the system, we can have an estimate of the current state by using the first formula. Based on the observation, we can reverse calculate an estimate of the current state. Both of the estimates suffer a lot from the noise. How do we combine two estimates to reduce the prediction error? Kalman gave the following answer.

\[\hat{x}_k = \hat{x}_k^- + K(z_k - H\hat{x}_k^-)\]

where \(\hat{x}_k^- = Ax_{k-1} + Bu_{k-1}\)

The matrix \(K\) is called Kalman gain that minimizes the expected prediction error \(E((x_k - \hat{x}_k)(x_k - \hat{x}_k)^T)\). The posterior estimate after incorporating observation is a linear combination of the the priori estimate and the difference of the actual measurement with the predicted measurement.

To get \(K\) computationally, we just need to compute the derivative of the expected prediction error and set the result to zero and solve for \(K\).

Here we present where the update formula comes in the first place. Actually, it is just a simple Bayesian update.

Suppose we have the state information at time \(k-1\), \(x_{k-1}|z_1 ... z_{k-1}\)following a Gaussian distribution with mean \(x_{k-1}\) and covariance matrix \(P_{k-1}\). Assuming no control parameter \(u\). Our goal is to find the distribution of \(x_{k}|z_1 ... z_{k}\). Since the model is linear, the resulting distribution will still be Gaussian and we just need to find out the mean of covariance matrix. Since \(x_{k-1}\) contains all the observations up to \(k-1\), we just need to find the distribution for \(x_{k}|x_{k-1}, z_{k}\).

Using Bayes rule,

\[p(x_{k}|x_{k-1}, z_{k}) = \frac{p(x_{k}|x_{k-1})p(z_k|x_{k})}{p(z_k)}\]

First step is time update where we use the previous state information to get an estimate of the current one, \(x_k | x_{k-1}\), which follows a Gaussian distribution with mean \(x_{k}^-\) and covariance matrix \(P_k^-\), where

\[x_{k}^- = Ax_{k-1}\]

\[P_k^- = AP_{k-1}A^T + Q\]

The second step is observation update where we apply the Bayes rule to find the posterior distribution. Based on the observation model, we have \(z_k|x_k\) follows a Gaussian distribution with mean \(Hx_k\) and covariance matrix \(R\).

We can find the posterior distribution by first figuring out the joint distribution and isolating \(x_k\) to find its first and second moments.

To find the joint distribution, we could take the sum of the log likelihood of \(p(z_k|x_{k})\) and \(p(x_{k}|x_{k-1})\) and complete the square to find the inverse of the covariance matrix of the joint distribution. The actual computation details are omitted. If you are interested, you could find details in Section 2.3 in Bishop's book.

The resulting mean and covariance matrix of the posterior distribution are

\[x_k = x_{k}^- + K(z_k - Hx_{k}^-)\]

\[P_k = (I - KH)P_k^-\]

where \(K = P_k^-H^T(R+HP_k^-H^T)^{-1}\)

By incorporating the latest observation, we manage to reduce the prediction error (the diagonal terms of \(P_k are smaller than P_k^-\)).

If \(R\) approaches to zero, which means measurements are very accurate, then \(K \to H^{-1}\) and our estimate will be solely based on observation. If \(P_k^-\) approaches zero, then \(K \to 0\) and we have more confidence in our prediction from the previous state, so we will weight the measurement difference less.

Most of the time, the time update and the observation update are non-linear models, in which our posterior distribution will not be Gaussian any more. EKF was proposed to address this issue. The model becomes

\[x_k = f(x_{k-1}, u_{k-1}, w_{k-1})\]

\[z_k = h(x_k, v_k)\]

Since we still want to maintain the Gaussian property, EKF uses Tayler expansion to linearize the model about the current state. After obtaining the linearized model, the rest of the computation is the same as the normal Kalman Filter.

References:

Greg Welch and Gary Bishop's introductory paper on Kalman Filter