Recursive Least Squares Derived from Offline Estimation

An obvious approach from offline estimation to recursive (online) estimation is to take any offline method and modify it. Here we show a method of how to modify the offline least squares to recursive least squares.

Linear Difference Equation

\[\begin{align} y(t)+a_1y(t-1) + \dots + a_n y(t-n) = b_1u(t-1)+\dots+b_mu(t-m)+v(t) \label{eq:linear} \end{align}\]

where $\{u(t)\},\{y(t)\}$ are input and output signals and $v(t)$ is some disturbance of unspecified character.

Let $q^{-1}$ be backward shift operator, then

\[A(q^{-1})y(t) = B(q^{-1})u(t) + v(t),\]

where $A(q^{-1})$ and $B(q^{-1})$ are polynomials in the delay operator:

\[A(q^{-1}) = 1+a_1q^{-1}+\dots + a_nq^{-n}\\ B(q^{-1}) = 1+b_1q^{-1}+\dots + b_m q^{-m}\]

Introduce the vector of lagged input-output data (regressor),

\[\phi^T(t)=\begin{bmatrix} -y(t-1) &\dots -y(t-n)&u(t-1)&\dots u(t-m) \end{bmatrix}\]

Then, $~\eqref{eq:linear}$ can be rewritten as

\[y(t)=\theta^T\phi(t)+v(t)\]

where $\theta^T=\begin{bmatrix}a_1 & \dots &a_n &b_1 &\dots b_m \end{bmatrix}^T$ is the paramter vector.

If the character of the disturbance term $v(t)$ is not specified, it is natural to use

\[\hat{y}(t\vert \theta)\triangleq\theta^T\phi(t)\]

as the prediction of $y(t)$ having observed previous inputs and outputs.

Offline Identification: The least squares

The parameter vector can be estimated from the measurements of $y(t)$ and $\phi(t)$ with $$t=1,2,\dots,N$. A common way to choose the estimation is to minimize

\[V_N(\theta)=\frac{1}{N}\sum_1^N \alpha_t[y(t)-\theta^T\phi(t)]^2\]

with respect to $\theta$ and $\{\alpha_t\}$ is a sequence of positive numbers allowing to give different weights to different observations. This criterion $V_N(\theta)$ is quadratic in $\theta$ and thus it can be minimized analytically,

\[\begin{align} \hat{\theta}(N) = \left[ \sum_{1}^N \alpha_t\phi(t)\phi(t)^T \right]^{-1} \sum_{1}^N \alpha_t\phi(t)y(t), \label{eq:offline_ls} \end{align}\]

where we assume that the inverse exists. It can be written in a recursive fashion. Let

\[\bar{R}(t) = \sum_{1}^N \alpha_t\phi(t)\phi(t)^T.\]

Then, from $\eqref{eq:offline_ls}$, we can get

\[\sum_{1}^N \alpha_t\phi(t)y(t) = \bar{R}(t-1)\hat{\theta}(t-1).\]

From the definition of $\bar{R}(t)$,

\[\bar{R}(t-1)=\bar{R}(t)-\alpha_t\phi(t)\phi^T(t)\]

Thus $\begin{align} \hat{\theta}(t) &=\bar{R}^{-1}(t)\left[ \sum_{k=1}^{t-1} \alpha_k\phi(k)y(k) + \alpha_t \phi(t)y(t) \right]\\ &=\bar{R}^{-1}(t)\left[ \bar{R}(t-1)\hat{\theta}(t-1)+ \alpha_t \phi(t)y(t) \right]\\ &=\bar{R}^{-1}(t)\left[ \bar{R}(t)\hat{\theta}(t-1)+ \alpha_t \phi(t)[-\phi^T(t)\hat{\theta}(t-1) + y(t)] \right]\\ &=\hat{\theta}(t-1) + \bar{R}^{-1}(t)\phi(t)\alpha_t[y(t) - \hat{\theta}^T(t-1)\phi(t)] \end{align}$

and

\[\bar{R}(t)=\bar{R}(t-1) + \alpha_t\phi(t)\phi^T(t)\]

Sometimes we may prefer to work with

\[R(t) \triangleq \frac{1}{t}\bar{R}(t)\]

Then

\[R(t)=\frac{1}{t} \left [\bar{R}(t-1) + \alpha_t\phi(t)\phi^T(t)\right]=\frac{t-1}{t} R(t-1)+\frac{1}{t}\alpha_t\phi(t)\phi^T(t)\\ =R(t-1) + \frac{1}{t}[ \alpha_t\phi(t)\phi^T(t)-R(t-1) ]\]

In summary, we can write $\begin{align} \hat{\theta}(t)&=\hat{\theta}(t-1) + \frac{1}{t}R^{-1}(t)\phi(t)\alpha_t [y(t)-\theta^T(t-1)\phi(t)],\\ R(t)&=R(t-1) + \frac{1}{t}[ \alpha_t\phi(t)\phi^T(t)-R(t-1)]. \label{eq:offline_rls1} \end{align}$

An Equivalent Form: Recursive Least Squares

Equation $\eqref{eq:offline_rls1}$ is not that suited for computation since a matrix inverse has to be calculated in each time step. It’s more natural to introduce

\[P(t)=\bar{R}^{-1}=\frac{1}{t}R^{-1}(t)\]

and update $P(t)$ directly instead. This can be done by matrix inverse lemma

\[[A+BCD]^{-1}=A^{-1}-A^{-1}B[DA^{-1}B+C^{-1}]^{-1}DA^{-1}.\]

The proof can be done by multiplying the RHS by $$(A+BCD)$.

Let $A=P(t-1)$,$B=\phi(t)$, $C=\alpha_t$ and $D=\phi^T(t)$,

\[\begin{align} P(t)&=\left[P^{-1}(t-1)+\phi(t)\alpha_t\phi^T(t) \right]^{-1}\\ &=P(t-1)-P(t-1)\phi(t) \left[ \phi^T(t)P(t-1)\phi(t)+\frac{1}{\alpha_t} \right]^{-1}\phi^T(t)P(t-1)\\ &=P(t-1)-\frac{P(t-1)\phi(t)\phi^T(t)P(t-1)}{1/\alpha_t + \phi^T(t)P(t-1)\phi(t)} \end{align}\]

Now the inversion of a square matrix of dim $\theta$ is replaced by inversion of a scalar $\alpha_t$.

Thus the recursive least squares can be written as

\[\begin{align} \hat{\theta}(t)&=\hat{\theta}(t-1) + L(t)[y(t) -\hat{\theta}^T(t-1)\phi(t)],\\ L(t)&=\frac{P(t-1)\phi(t)}{1/\alpha_t + \phi^T(t)P(t-1)\phi(t)},\\ P(t)&=P(t-1)-\frac{P(t-1)\phi(t)\phi^T(t)P(t-1)}{1/\alpha_t + \phi^T(t)P(t-1)\phi(t)}. \label{eq:rls} \end{align}\]

Initial Conditions

The only assumption we made is the $\bar{R}(t)$ is invertible and typically it becomes invertible at time $t_0=dim~ \phi(t) = dim~\theta$. Thus strictly speaking, the proper initials values for $\eqref{eq:rls}$ are obtained if starting at the time $t_0$ for which

\[P(t_0)= \left [ \sum_{k}^{t_0} \alpha_k\phi(k)\phi(k)^T \right ]^{-1} \\ \hat{\theta}(t_0) = P(t_0)\sum_{k}^{t_0}\alpha_k \phi(k)y(k).\]

It is more common to start at $t=0$ with some invertible matrix $P(0)$ and a vector $$\hat{\theta}(0)$. Then, the resulting estimates are

\[\hat{\theta}(t) = \left [ P^{-1}(0)+\sum_{k=1}^t \alpha_k\phi(k)\phi^T(k) \right]^{-1}\left [ P^{-1}(0)\hat{\theta}(0)+\sum_{k=1}^t \alpha_k\phi(k)y(k) \right].\]

We can see that the relative importance of the initial values decays over time as the magnitudes of the sums increase. Also, as $P^{-1}(0)\rightarrow0$, the recursive estimate goes to the offline one. A common choice of initial values is to take $P(0)=C \cdot I$ and $\hat{\theta}(0)=0$, where $C$ is some large constant.

Asymptotic Properties

We assume that the data are generated by

\[y(t) = \theta_\circ \phi(t) + v(t)\]

Inserting this equation to $\eqref{eq:offline_ls}$,

\[\begin{align} \hat{\theta}(N) &= \left[ \sum_{t=1}^N \alpha_t\phi(t)\phi(t)^T \right]^{-1} \left \{ \sum_{t=1}^N \alpha_t\left[\phi(t)\phi^T(t)\theta_\circ+\phi(t)v(t)\right] \right\}\\ &= \theta_\circ + \left[ \frac{1}{N} \sum_{t=1}^N \alpha_t\phi(t)\phi(t)^T \right]^{-1} \frac{1}{N}\sum_{t=1}^N \alpha_t\phi(t)v(t) \end{align}\]

According to the law of large numbers, the sum $\frac{1}{N}\sum_{t=1}^N \alpha_t\phi(t)v(t)$ will converge to its expected values as $N$ goes to infinity. The expected values depend on the correlation between the disturbance term $v(t)$ and the data vector $\phi(t)$. It’s zero only when $v(t)$ and $\phi(t)$ are uncorrelated. This is true when

$\{v(t)\}$ is i.i.d with zero means
$n=0$ and $\{u(t)\}$ is independent of the zero-mean noise sequence $\{v(t)\}$

In both cases, the $\hat{\theta}(N)$ approaches to $\theta_\circ$ as $N$ goes to infinity.

Interpretations of RLS

Beforehand, we have shown that how RLS derived from the offline LS version and $\hat{\theta}(t) = \left [ P^{-1}(0)+\sum_{k=1}^t \alpha_k\phi(k)\phi^T(k) \right]^{-1}\left [ P^{-1}(0)\hat{\theta}(0)+\sum_{k=1}^t \alpha_k\phi(k)y(k) \right]$

With $P^{-1}(0)=0$, it minimizes the least sqaures criterion
\[V_t(\theta)=\frac{1}{t}\sum_{k=1}^t \alpha_k[y(k)-\theta^T\phi(k)]^2\]
The estimate $\hat{\theta}$ can be seen as the Kalman filter state estimate for state-space model $\theta(t+1) =\theta(t)\\ y(t)=\phi^T(t)\theta(t)+v(t)$
The RLS is a recursive minimization of $\bar{V}(\theta)=\mathbb{E}\frac{1}{2}[y(t)-\theta^T\phi(t)]^2.$

The factor $\phi(t)[y(t) -\hat{\theta}^T(t-1)\phi(t)]$ is then an estimate of the gradient while
\[t\cdot P(t) =\left [ \frac{1}{t} P^{-1}(0)+\sum_{k=1}^t \alpha_k\phi(k)\phi^T(k) \right]^{-1}\]
is the inverse of an estimate of the second derivate of the criterion. The updating $\hat{\theta}(t)$ is thus updated with the direction of “Newton” and a decaying step size $\frac{1}{t}$. More can be found in another blog post “Recursive Least Squares Derived from Stochastic Approximation Approach “.