Here we focus on minimizing the general criteria \(\bar{\mathbb{E}} l(t,\theta,\varepsilon(t,\theta))\) using a general search direction,

\[\begin{align} \begin{aligned} &\varepsilon(t) = y(t)-\hat{y}(t) \\ &R(t) = R(t-1) + \alpha(t) H(t,R(t-1),\hat{\theta}(t-1),\varepsilon(t),\eta(t))\\ &\hat{\theta}(t) = \hat{\theta}(t-1) + \alpha(t) R^{-1}(t) h(t,\hat{\theta}(t-1), , \varepsilon(t) \eta(t))\\ &\xi (t+1)= A(\hat{\theta}(t))\xi(t) + B(\hat{\theta}(t))z(t)\\ &\begin{pmatrix} \hat{y}(t+1) \\ \text{col } \eta(t+1) \end{pmatrix} = C(\hat{\theta}(t)) \xi(t+1). \end{aligned} \label{eq:general_form} \end{align}\]

The model set:

The smoothness Conditions for \(h\) and \(H\)

\[\textbf{Cr1}: \ h(t, \theta, \epsilon, \eta) \text{ is differentiable w.r.t. } \theta, \epsilon, \text{ and } \eta, \text{ such that, for some } C < \infty,\\ |h(t, \theta, \epsilon, \eta)| + |\nabla_{\theta} h(t, \theta, \epsilon, \eta)| \leq C(1 + |\epsilon|^2 + |\eta|^2)\\ \text{and}\\ |\nabla_{\epsilon} h(t, \theta, \epsilon, \eta)| + |\nabla_{\eta} h(t, \theta, \epsilon, \eta)| \leq C(1 + |\epsilon| + |\eta|) \text{ for } \theta \in D_{\mathcal{M}}.\] \[\textbf{Cr2}: \ H(t, R, \theta, \epsilon, \eta) \text{ is differentiable w.r.t. } R, \theta, \epsilon, \text{ and } \eta \text{ such that, for some } C < \infty, \\ |H(t, R, \theta, \epsilon, \eta)| + |\nabla_{R} H(t, R, \theta, \epsilon, \eta)| \leq C(1 + |\epsilon|^2 + |\eta|^2 + |R|), \\ |\nabla_{\theta} H(t, R, \theta, \epsilon, \eta)| + |\nabla_{\epsilon} H(t, R, \theta, \epsilon, \eta)| \leq C(1 + |\epsilon|^2 + |\eta|^2), \\ |\nabla_{\eta} H(t, R, \theta, \epsilon, \eta)| + |\nabla_{\theta} H(t, R, \theta, \epsilon, \eta)| \leq C(1 + |\epsilon| + |\eta|), \text{ for } \theta \in D_{\mathcal{M}}.\\ \text{Here } h_{\theta} \text{ denotes the partial derivative w.r.t. } \theta, \text{ etc.}\] \[\textbf{Cr3}: \ \text{The function } l(t, \theta, \epsilon) \text{ is twice-continuously differentiable w.r.t. } \theta \text{ and } \epsilon \text{ and } \\ |\nabla_{\theta} l(t, \theta, \epsilon)| + |\nabla_{\theta}{\theta} l(t, \theta, \epsilon)| \leq C(1 + |\epsilon|)^2, \\ |\nabla_{\epsilon} l(t, \theta, \epsilon)| + |\nabla_{\epsilon \epsilon} l(t, \theta, \epsilon)| \leq C(1 + |\epsilon|), \\ |\nabla_{ \epsilon \epsilon} l(t, \theta, \epsilon)| \leq C \text{ for } \theta \in D_{\mathcal{M}}.\] \[\textbf{Cr4}: \ \text{The function } l(t, \theta, \epsilon) \text{ is three times continuously differentiable w.r.t. } \theta \text{ and } \epsilon \text{ and } \\ |\nabla_{ \epsilon \epsilon} l(t, \theta, \epsilon)| + |\nabla_{\epsilon \epsilon \epsilon} l(t, \theta, \epsilon)| + |\nabla_{ \epsilon \epsilon \theta} l(t, \theta, \epsilon)| \leq C, \\ |\nabla_{\theta \theta} l(t, \theta, \epsilon)| + |\nabla_{\theta\theta\theta} l(t, \theta, \epsilon)| \leq C(1 + |\epsilon|)^2, \\ |\nabla_{\theta\theta \epsilon} l(t, \theta, \epsilon)| \leq C \text{ for } \theta \in D_{\mathcal{M}}.\]

Additional Condition on the Matrix \(R(t)\)

R1: \(R(t)\) generated by the recursive equation should remain symmetric and positive semi-definite.

Gain Sequence Condition \(\alpha(t)\)

The gain sequence \(\alpha(t)\) should asymptotically behave like \(\mu/t\) for some \(\mu > 0\). \(\lim_{t\rightarrow \infty}t\cdot \alpha(t) = \mu >0\) This condition ensures that \(R(t)\) remains well-behaved and invertible, which is necessary for the algorithm’s stability.

Conditions on the Data \(z(t)\)

\[\begin{align} \text{(a)} & \quad \lim_{N \to \infty} \frac{1}{N} \sum_{t=1}^{N} h(t, \theta, \epsilon(t, \theta), \eta(t, \theta)) \triangleq f(\theta). \\ \text{(b)} & \quad \lim_{N \to \infty} \frac{1}{N} \sum_{t=1}^{N} H(t, R, \theta, \epsilon(t, \theta), \eta(t, \theta)) \triangleq F(R, \theta). \\ \text{(c)} & \quad \limsup_{N \to \infty} \frac{1}{N} \sum_{t=1}^{N} \left[ 1 + |z(t)|^3 \right] < \infty. \end{align}\]

A1(c): This ensures that the data does not have extremely large values that could disrupt the convergence of the algorithm. It’s a safeguard against outliers or unbounded data. Introduce \(h_t\triangleq h(t, \theta, \epsilon(t, \theta))\), then A1(a) will hold w.p.1 if following conditions are satisfied:

\[\frac{1}{N} \sum_{t=1}^N (h_t - Eh_t) \rightarrow 0 \text{ w.p.1 as }N \rightarrow \infty\]

and

\[\frac{1}{N} \sum_{t=1}^N Eh_t \rightarrow f \text{ as } N \rightarrow \infty\]

When $h_t$ is sampled indenpendent, the first condition is known as the strong law of large numbers in probability theory. Although in our application, the sequence is not independent, we can show that it is valid using Cramer and Leadbetter (1967) and with trivial modifications from continous-time case to discrete-time case.

Let \(\{x(t)\}\) be sequence of random varibales each of zero mean and suppose that

\[\left| \mathbb{E} x(t)x(s) \right| \leq C \cdot \frac{t^p + s^p}{1 + |t - s|^q}, \quad 0 \leq 2 \cdot p < q < 1.\]

Then

\[\frac{1}{N} \sum_{t=1}^{N} x(t) \to 0 \quad \text{w.p.1 as} \quad N \to \infty.\] \[\mathbb{E} h(t, \theta, \epsilon(t, \theta), \eta(t, \theta)) = f(\theta),\\ \mathbb{E} H(t, R, \theta, \epsilon(t, \theta), \eta(t, \theta)) = F(R, \theta).\]

As before,

\[\bar{\mathbb{E}}f(t)\triangleq \lim_{N \to \infty} \frac{1}{N} \sum_{t=1}^{N} \mathbb{E} f(t),\]

where the expectation is over the stochastic process \(\{z(t)\}\) and imply that the limit exists.

To assure the \(h_t\) and \(h_s\) are asymptotically independent, we introduce

\[\text{For each } t, s, t \geq s, \text{ there exists a random vector } z_s^0(t)\\ \text{ that belongs to the } \sigma \text{-algebra generated by } z^t \\ \text{ but is independent of } z^s \text{ (for } s = t \text{ take } z_s^0(t) = 0 \text{)}, \text{ such that}\] \[\\mathbb{E} \left| z(t) - z_s^0(t) \right|^4 \leq C \cdot \lambda^{t-s}, \quad C < \infty, \quad \lambda < 1.\]

With Cr1, Cr2, A2 and S2 hold, we can proof that A1 holds w.p.1. See the book Appedix 4.A for details.

Questions I have and answers I guess:

In mathematics, particularly in topology and analysis, a compact set is a set that satisfies two key properties:

Reasons: