Hidden Markov Models (part II): forward-backward algorithm for marginal conditional probability of the states

(in the same series HMM (part I): recurrence equations for filtering and prediction)

Consider a Hidden Markov Model (HMM) with hidden states $x_t$ (for $t \in {1, 2, \cdots, T}$), initial probability $p(x_1)$, observed states $y_t$, transition probability $p(x_t|x_{t-1})$ and observation model $p(y_t|x_t)$. This model can be factorized as
$p(x_{1:T},y_{1:T}) = p(y_1|x_1)p(x_1)\prod_{t=2}^{t=T}p(y_t|x_t)p(x_t|x_{t-1})$. We will use the notation $X=x_{1:T}$ to represent the set $X=\{x_1,x_2,\cdots,x_T\}$.
In this post we will present the details of the method to find the smoothing distribution $p(x_t|y_{1:T})$ of a HMM, given a set of observations $y_{1:T}$:
Our starting point is the marginal probability $p(x_t|y_{1:T})$ of $x_t$ given all the observations $y_{1:T}$.

\begin{aligned} p(x_t|y_{1:T}) &= \frac{p(x_t,y_{1:T})}{p(y_{1:T})} \\ &= \frac{p(x_t,y_{1:t},y_{(t+1):T})}{p(y_{1:T})}\\ &= \underbrace{p(y_{(t+1):T}|x_t)}_{\beta_t(x_t)}\underbrace{p(x_t,y_{1:t})}_{\alpha_t(x_t)}\frac{1}{p(y_{1:T})} \\ &= \frac{\alpha_t(x_t) \beta_t(x_t)}{p(y_{1:T})} \end{aligned}

This marginal can be computed by using dynamic programming to calculate the values of $\beta_t(x_t)$ and $\alpha_t(x_t)$. Starting with $\alpha_t(x_t)$ we have
\begin{aligned} \alpha_t(x_t) &= p(x_t,y_{1:t}) \\ &= \sum_{x_{t-1}} p(x_t,x_{t-1},y_{1:t}) \\ &= \sum_{x_{t-1}} p(x_t,x_{t-1},y_{1:(t-1)},y_t) \\ &= \sum_{x_{t-1}} p(y_t|x_t,x_{t-1},y_{1:(t-1)})p(x_t|x_{t-1},y_{1:(t-1)})\underbrace{p(x_{t-1},y_{1:(t-1)})}_{\alpha_{t-1}(x_{t-1})} \\ \Rightarrow \underbrace{\alpha_t(x_t)}_{\text{\texttt{alpha[k,t]}, k=}x_t} & = \underbrace{p(y_t|x_t)}_{\text{\texttt{fyx}}} \sum_{x_{t-1}} \underbrace{p(x_t|x_{t-1})}_{\text{\texttt{fxx}}} \underbrace{\alpha_{t-1}(x_{t-1})}_{\text{\texttt{alpha[ks,t-1]}, ks=}x_{t-1}} \end{aligned}

For $\beta_t(x_t)$ we have:
\begin{aligned} \beta_t(x_t) &= p(y_{(t+1):T}|x_t) \\ &= \sum_{x_{t+1}} \frac{p(y_{t+1},y_{(t+2):T},x_t,x_{t+1})} {p(x_t)} \\ &= \sum_{x_{t+1}} \frac{p(y_{(t+2):T}|y_{t+1},x_t,x_{t+1})p(y_{t+1}|x_{t+1},x_t)p(x_{t+1}|x_t)p(x_t)} {p(x_t)} \\ &= \sum_{x_{t+1}} \frac{p(y_{(t+2):T}|x_{t+1})p(y_{t+1}|x_{t+1})p(x_{t+1}|x_t){p(x_t)}} {p(x_t)} \\ &= \sum_{x_{t+1}} \underbrace{p(y_{(t+2):T}|x_{t+1})}_{\beta_{t+1}(x_{t+1})} p(y_{t+1}|x_{t+1})p(x_{t+1}|x_t) \\ \Rightarrow \underbrace{\beta_t(x_t)}_{\text{\texttt{beta[k,t]}, k=}x_t} &= \sum_{x_{t+1}} \underbrace{\beta_{t+1}(x_{t+1})}_{\text{\texttt{beta[ks,t+1]}, ks=}x_{t+1}} \underbrace{p(y_{t+1}|x_{t+1})}_{\text{\texttt{fyx}}} \underbrace{p(x_{t+1}|x_t)}_{\text{\texttt{fxx}}} \end{aligned}

Considering a chain of length $T$ and $K$ number of states, both this recursive calculations can be done in $O(TK^2)$ by storing the intermediate results in two vectors $\text{\texttt{alpha}}$ and $\text{\texttt{beta}}$ using $O(TK)$ space (where $\text{\texttt{fxx}}$ and $\text{\texttt{fyx}}$ are function calculating $p(x_{t+1}|x_t)$ and $p(y_t|x_t)$ respectively). The marginal probability of $x_t$ given the observations $y_{1:T}$ can be calculated in $O(TK)$.