Here we focus on minimizing the general criteria $\bar{\mathbb{E}} l(t,\theta,\varepsilon(t,\theta))$ using a general search direction,

\[\begin{align} \begin{aligned} &\varepsilon(t) = y(t)-\hat{y}(t) \\ &R(t) = R(t-1) + \alpha(t) H(t,R(t-1),\hat{\theta}(t-1),\varepsilon(t),\eta(t))\\ &\hat{\theta}(t) = \hat{\theta}(t-1) + \alpha(t) R^{-1}(t) h(t,\hat{\theta}(t-1), , \varepsilon(t) \eta(t))\\ &\xi (t+1)= A(\hat{\theta}(t))\xi(t) + B(\hat{\theta}(t))z(t)\\ &\begin{pmatrix} \hat{y}(t+1) \\ \text{col } \eta(t+1) \end{pmatrix} = C(\hat{\theta}(t)) \xi(t+1). \end{aligned} \label{eq:general_form} \end{align}\]

The model set:

Condition M1: $D_{\mathcal{M}}$ is a compact subset of $R^d$ and contains all eigenvalues of $A(\theta)$ within the unit circle. This is crucial because it ensures stability within the identified model. Specifically, if $\theta$ belongs to $D_{\mathcal{A}}$, then the matrix $A(\theta)$ should have all its eigenvalues strictly inside the unit circle, guaranteeing a stable predictor.
Condition M2: The matrices $A(\theta)$, $B(\theta)$, and $C(\theta)$ are continuously differentiable with respect to $\theta$ in $D_{\mathcal{M}}$. This ensures smoothness, which is necessary for applying various mathematical tools, such as differentiation and integration in the convergence analysis.

The smoothness Conditions for $h$ and $H$

\[\textbf{Cr1}: \ h(t, \theta, \epsilon, \eta) \text{ is differentiable w.r.t. } \theta, \epsilon, \text{ and } \eta, \text{ such that, for some } C < \infty,\\ |h(t, \theta, \epsilon, \eta)| + |\nabla_{\theta} h(t, \theta, \epsilon, \eta)| \leq C(1 + |\epsilon|^2 + |\eta|^2)\\ \text{and}\\ |\nabla_{\epsilon} h(t, \theta, \epsilon, \eta)| + |\nabla_{\eta} h(t, \theta, \epsilon, \eta)| \leq C(1 + |\epsilon| + |\eta|) \text{ for } \theta \in D_{\mathcal{M}}.\] \[\textbf{Cr2}: \ H(t, R, \theta, \epsilon, \eta) \text{ is differentiable w.r.t. } R, \theta, \epsilon, \text{ and } \eta \text{ such that, for some } C < \infty, \\ |H(t, R, \theta, \epsilon, \eta)| + |\nabla_{R} H(t, R, \theta, \epsilon, \eta)| \leq C(1 + |\epsilon|^2 + |\eta|^2 + |R|), \\ |\nabla_{\theta} H(t, R, \theta, \epsilon, \eta)| + |\nabla_{\epsilon} H(t, R, \theta, \epsilon, \eta)| \leq C(1 + |\epsilon|^2 + |\eta|^2), \\ |\nabla_{\eta} H(t, R, \theta, \epsilon, \eta)| + |\nabla_{\theta} H(t, R, \theta, \epsilon, \eta)| \leq C(1 + |\epsilon| + |\eta|), \text{ for } \theta \in D_{\mathcal{M}}.\\ \text{Here } h_{\theta} \text{ denotes the partial derivative w.r.t. } \theta, \text{ etc.}\] \[\textbf{Cr3}: \ \text{The function } l(t, \theta, \epsilon) \text{ is twice-continuously differentiable w.r.t. } \theta \text{ and } \epsilon \text{ and } \\ |\nabla_{\theta} l(t, \theta, \epsilon)| + |\nabla_{\theta}{\theta} l(t, \theta, \epsilon)| \leq C(1 + |\epsilon|)^2, \\ |\nabla_{\epsilon} l(t, \theta, \epsilon)| + |\nabla_{\epsilon \epsilon} l(t, \theta, \epsilon)| \leq C(1 + |\epsilon|), \\ |\nabla_{ \epsilon \epsilon} l(t, \theta, \epsilon)| \leq C \text{ for } \theta \in D_{\mathcal{M}}.\] \[\textbf{Cr4}: \ \text{The function } l(t, \theta, \epsilon) \text{ is three times continuously differentiable w.r.t. } \theta \text{ and } \epsilon \text{ and } \\ |\nabla_{ \epsilon \epsilon} l(t, \theta, \epsilon)| + |\nabla_{\epsilon \epsilon \epsilon} l(t, \theta, \epsilon)| + |\nabla_{ \epsilon \epsilon \theta} l(t, \theta, \epsilon)| \leq C, \\ |\nabla_{\theta \theta} l(t, \theta, \epsilon)| + |\nabla_{\theta\theta\theta} l(t, \theta, \epsilon)| \leq C(1 + |\epsilon|)^2, \\ |\nabla_{\theta\theta \epsilon} l(t, \theta, \epsilon)| \leq C \text{ for } \theta \in D_{\mathcal{M}}.\]

Additional Condition on the Matrix $R(t)$

R1: $R(t)$ generated by the recursive equation should remain symmetric and positive semi-definite.

Gain Sequence Condition $\alpha(t)$

The gain sequence $\alpha(t)$ should asymptotically behave like $\mu/t$ for some $\mu > 0$. $\lim_{t\rightarrow \infty}t\cdot \alpha(t) = \mu >0$ This condition ensures that $R(t)$ remains well-behaved and invertible, which is necessary for the algorithm’s stability.

Conditions on the Data $z(t)$

Condition A1: stable data generation Condition A1 is about ensuring that as we collect more data (as $N \to \infty$), certain averages related to the data and the algorithm converge to well-defined limits. Specifically:

\[\begin{align} \text{(a)} & \quad \lim_{N \to \infty} \frac{1}{N} \sum_{t=1}^{N} h(t, \theta, \epsilon(t, \theta), \eta(t, \theta)) \triangleq f(\theta). \\ \text{(b)} & \quad \lim_{N \to \infty} \frac{1}{N} \sum_{t=1}^{N} H(t, R, \theta, \epsilon(t, \theta), \eta(t, \theta)) \triangleq F(R, \theta). \\ \text{(c)} & \quad \limsup_{N \to \infty} \frac{1}{N} \sum_{t=1}^{N} \left[ 1 + |z(t)|^3 \right] < \infty. \end{align}\]

A1(c): This ensures that the data does not have extremely large values that could disrupt the convergence of the algorithm. It’s a safeguard against outliers or unbounded data. Introduce $h_t\triangleq h(t, \theta, \epsilon(t, \theta))$, then A1(a) will hold w.p.1 if following conditions are satisfied:

\[\frac{1}{N} \sum_{t=1}^N (h_t - Eh_t) \rightarrow 0 \text{ w.p.1 as }N \rightarrow \infty\]

and

\[\frac{1}{N} \sum_{t=1}^N Eh_t \rightarrow f \text{ as } N \rightarrow \infty\]

When $h_t$ is sampled indenpendent, the first condition is known as the strong law of large numbers in probability theory. Although in our application, the sequence is not independent, we can show that it is valid using Cramer and Leadbetter (1967) and with trivial modifications from continous-time case to discrete-time case.

Let $\{x(t)\}$ be sequence of random varibales each of zero mean and suppose that

\[\left| \mathbb{E} x(t)x(s) \right| \leq C \cdot \frac{t^p + s^p}{1 + |t - s|^q}, \quad 0 \leq 2 \cdot p < q < 1.\]

Then

\[\frac{1}{N} \sum_{t=1}^{N} x(t) \to 0 \quad \text{w.p.1 as} \quad N \to \infty.\]

Condition A2: asymptotically mean stationary

\[\mathbb{E} h(t, \theta, \epsilon(t, \theta), \eta(t, \theta)) = f(\theta),\\ \mathbb{E} H(t, R, \theta, \epsilon(t, \theta), \eta(t, \theta)) = F(R, \theta).\]

As before,

\[\bar{\mathbb{E}}f(t)\triangleq \lim_{N \to \infty} \frac{1}{N} \sum_{t=1}^{N} \mathbb{E} f(t),\]

where the expectation is over the stochastic process $\{z(t)\}$ and imply that the limit exists.

To assure the $h_t$ and $h_s$ are asymptotically independent, we introduce

**S1: data points far apart in time are almost independent. **

\[\text{For each } t, s, t \geq s, \text{ there exists a random vector } z_s^0(t)\\ \text{ that belongs to the } \sigma \text{-algebra generated by } z^t \\ \text{ but is independent of } z^s \text{ (for } s = t \text{ take } z_s^0(t) = 0 \text{)}, \text{ such that}\] \[\\mathbb{E} \left| z(t) - z_s^0(t) \right|^4 \leq C \cdot \lambda^{t-s}, \quad C < \infty, \quad \lambda < 1.\]

With Cr1, Cr2, A2 and S2 hold, we can proof that A1 holds w.p.1. See the book Appedix 4.A for details.

Questions I have and answers I guess:

Why $D_{\mathcal{M}}$ a compact set?

In mathematics, particularly in topology and analysis, a compact set is a set that satisfies two key properties:

Boundedness: A set is bounded if there is a real number $M$ such that the distance between any two points in the set is less than $M$.
Closedness: A set is closed if it contains all its limit points, which means that if a sequence of points in the set converges to a point, that point is also within the set.

Reasons:

Stability: The recursive algorithm involves iterative processes where parameters $\theta$ are updated. If $\theta$ is restricted to a compact set $D_{\mathcal{M}}$, it prevents the algorithm from “escaping” to infinity or to an undefined region, which could cause instability in the algorithm.

The model set:

The smoothness Conditions for \(h\) and \(H\)

Additional Condition on the Matrix \(R(t)\)

Gain Sequence Condition \(\alpha(t)\)

Conditions on the Data \(z(t)\)

Questions I have and answers I guess: