# Hidden Markov Models (part I): recurrence equations for filtering and prediction

This semester I will be attending the doctoral course MA8702 – Advanced Modern Statistical Methods  with the excellent Prof. Håvard Rue. It will be course about statistical models defined over sparse structures (chains and graphs). We will start with Hidden Markov Chains and after go to Gaussian Markov Random Fields, Latent Gaussian Models and approximate inference with Integrated Nested Laplace Approximation (INLA). All this models are interesting for my research objective of developing sound latent models for recommender systems and I am really happy of taking this course with this great teacher and researcher. So, I will try to cover some of the material of the course, starting from what we saw in the first lecture: exact recurrence for Hidden Markov Chains and dynamic programming. In other words, general equations for predictions, filtering, smoothing, sampling, mode and marginal likelihood calculation of state-space model with latent variables. We will start by introduction the general model and specifying how to obtain the prediction and filtering equation.

• Markovian property: $\pi(x_1,x_2,\cdots,x_T)=\pi(x_{1:T})=\prod_{t=1}^{t=T}\pi(x_t|x_{t-1})$, with $\pi(x_1|x_0)=\pi(x_1)$
• $y_t$ are observed and $x_t$ are latent, so $\pi(y_t|x_t)$ is always known.
• If we know $x_{t-1}$ than no other variable will add any information to the conditional distribution of $x_t$.

With the last information of the model we will demonstrate an important relationship between $\pi(x_t)$ and $\pi(y_t)$:

Claim 1: $\pi(y_t|x_{1:T})=\pi(y_t|x_t)$ (in other words, for the conditional distribution of observation $y_t$, knowing about other latent states does not add more information when we know latent state $x_t$)

• Proof: By the definition of conditional probability we have, $\pi(y_t|x_{1:T})=\frac{\pi(y_t,x_{1:T})}{\pi(x_{1:T})}$. Using the index $-t$ to indicate the set of indexes $1:T$ with the exception of $t$ we may rearrange the joint distribution $\pi(y_t,x_{1:T})=\pi(x_{-t}|x_t,y_t)\pi(x_t,y_t)$. Note that $y_t$ may be taking off the conditional distribution $\pi(x_{-t}|x_t,y_t)$ since given $x_t$, $y_t$ is irrelevant for the distribution of $x$, so $\pi(x_{-t}|x_t,y_t)=\pi(x_{-t}|x_t)=\frac{\pi(x_{1:T})}{\pi(x_t)}$. Putting it all together

$\frac{\pi(y_t,x_{1:T})}{\pi(x_{1:T})} = \frac{\pi(x_{-t}|x_t,y_t)\pi(x_t,y_t)}{\pi(x_{1:T})}=\pi(x_t,y_t) \frac{\pi(x_{-t}|x_t)}{\pi(x_{1:T})} = \pi(x_t,y_t)\frac{1}{\pi(x_t)}$
$\Rightarrow \pi(y_t|x_{1:T})= \frac{\pi(x_t,y_t)}{\pi(x_t)}=\pi(y_t|x_t)$

We should say that we have two aims, compute:

• Prediction(t): $\pi(x_{t+s}|y_{1:t})$, $s>0$ (usually 1)
• Filtering(t): $\pi(x_t|y_{1:t})$

1) Prediction: starting with prediction for $s=1$ (easy to generalize for $s>0$, since in this case we just need to multiply all the conditional distribution of $x_{t+1}|x_t$, between t and t+s and marginalize in all variable except $x_t$ and $x_{t+s}$)

$\pi(x_{t+1}|y_{1:t}) = \frac{\pi(x_{t+1},y_{1:t})}{\pi(y_{1:t}) }=\pi(y_{1:t})^{-1}\int \pi(x_{t+1},x_t,y_{1:t})d x_t$
$\Rightarrow \pi(x_{t+1}|y_{1:t}) = \pi(y_{1:t})^{-1}\int \pi(x_{t+1}|x_t,y_{1:t}) \pi(x_t|y_{1:t}) \pi(y_{1:t}) dx_t$
$\Rightarrow \pi(x_{t+1}|y_{1:t}) = \int \pi(x_{t+1}|x_t,y_{1:t}) \pi(x_t|y_{1:t}) dx_t = \int \pi(x_{t+1}|x_t) \pi(x_t|y_{1:t}) dx_t$

Notice that we are marginalizing the state variable $x_t$ while we multiple its filtering distribution $\pi(x_t|y_{1:t})$ with the state transtion distribution $\pi(x_{t+1}|x_t)$. The same derivation would be valid for $\pi(x_{t+s}|y_{1:t})$, with the additional step of calculating the transition distribution $\pi(x_{t+s}|x_t) = \int \prod_{i=t}^{i=t+s-1} \pi(x_{i+1}|x_i) d x_{t+1} \cdots d x_{t+s-1}$

2) Filtering:  $\pi(x_t|y_{1:t}) = \pi(y_{1:t})^{-1} \pi(x_t,y_{1:t})$. Focusing on the joint, we have
$\pi(x_t,y_{1:t})= \pi(y_t|x_t,y_{1:t-1}) \pi(x_t|y_{1:t-1}) \pi(y_{1:t-1})$, implying that $\pi(x_t|y_{1:t}) = (\pi(y_{1:t})/\pi(y_{1:t-1}))^{-1} \pi(y_t|x_t,y_{1:t-1}) \pi(x_t|y_{1:t-1})$
Observation 1: $(\pi(y_{1:t})/\pi(y_{1:t-1}))^{-1}=(\pi(y_t|y_{1:t-1}))^{-1}$ and $\pi(y_t|x_t,y_{1:t-1})= \pi(y_t|x_t)$.
Observation 2: $\pi(x_t|y_{1:t-1})$ is the prediction of the previous step (prediction(t-1)).
Now we are able to have a final recurrence equation for the filtering (using the previous time step result of the prediction distribution):
$\Rightarrow \pi(x_t|y_{1:t}) = \pi(y_t|y_{1:t-1})^{-1} \pi(y_t|x_t) \pi(x_t|y_{1:t-1})$

If we maintain a table with the relevant information of the distribution Prediction(t) and Filtering(t) ($\pi(x_{t+s}|y_{1:t})$ and $\pi(x_t|y_{1:t})$ ) we will be able to calculate the predictive distribution and filtering distribution in time $O(TK^2)$ (for a chain of T variables and K discrete values). Note that if we using s>1 than the complexity if $O(TK^{1+s})$. Needless to say that the integrals could be replaced by sums if the distribution is discrete, but we are using integral since we are considering a generic case. Also we described the recurrence in alternate form at each step estimating the prediction and filtering at the same time, but this same computations could be done independently, but we would be just wasting efforts since they can be done at once. Just for the sake of completeness the independent recurrence equations:
1) $\underbrace{\pi(x_{t+1}|y_{1:t})}_{\text{prediction}(t)} = \pi(y_t|y_{1:t-1})^{-1} \int \pi(x_{t+1}|x_t) \pi(y_t|x_t) \underbrace{\pi(x_t|y_{1:t-1})}_{\text{prediction}(t-1)} dx_t$
2) $\underbrace{\pi(x_t|y_{1:t})}_{\text{filtering}(t)} = \pi(y_t|y_{1:t-1})^{-1} \pi(y_t|x_t) \int \pi(x_{t}|x_{t-1}) \underbrace{\pi(x_{t-1}|y_{1:t-1})}_{\text{filtering}(t-1)} dx_{t-1}$

In the next posts of this series we will develop the recurrence equations for smoothing, sampling and marginal likelihood. We will also have an application of this equation in the Kalman Filter setting (Gaussian transition and gaussian noise).

## 1 thought on “Hidden Markov Models (part I): recurrence equations for filtering and prediction”

1. […] the same series HMM (part I): recurrence equations for filtering and prediction) Our starting point is the marginal probability of given all the observations […]