12 Basic Parameter Estimators For ARMA Models

12.1 Method of moments estimation

Consider an \(\ARMA p q\) model \[ (\rX_t - \mu) - \vf_1 (\rX_{t-1} - \mu) - \dots - \vf_p (\rX_{t-p} - \mu) = \rve_t + \tht_1 \rve_{t-1} + \dots + \tht_+ \rve_{t-q}, \] that is \[ \ArPoly{\backs}(\rX_t - \mu) = \maPoly(\backs)\rve_t, \] where \(\set{\rve_t} \sim \WN{0}{\sigma^2}\). Let our goal be to estimate the unknown parameters \(\vf_1, \dots, \vf_p, \tht_1, \dots, \tht_q, \mu, \sigma^2\) and we assume the order \(p,q\) are known. Realize that the mean \(\mu\) is easy to estimate from the data \[ \estim{\rnd \mu} = \Avg \vrX. \] Also, see that covariance function \(\acov\) is also easy to estimate from data \[ \estim{\rnd \acov}(h) = \frac 1 {n - h - 1} \sum_{i = 1}^{n - h} (\rX_i - \Avg \vrX)(\rX_{i+h} - \Avg \vrX). \]

Now we can try to use it for the estimation of the parameters of the model. With the method of moments, we try to choose parameters for which the theoretical moments are equal to the empirical moments.

12.1.1 Yule-Walker estimation for AR model

Consider an \(\ARmod p\) model with mean \(0\), \[ \arPoly(\backs) \rX_t = \rve_t, \quad \text{i.e.} \quad \rX_t - \vf_1 \rX_{t-1} - \dots - \vf_p \rX_{t-p} = \rve_t. \] Now for \(h = 1, \dots, p\), we get \[\begin{align*} \acov(h) & = \Expect{\rX_{t-h} \rX_t} \\ &= \Expect{\rX_{t-h}(\vf_1 \rX_{t-1} + \dots + \vf_p \rX_{t-p} + \rve_t)} \\ &= \acov(h-1)\vf_1 + \dots + \acov(h-p)\vf_p \end{align*}\] and for \(h = 0\) \[\begin{align*} \acov(0) & = \Expect{\rX^2_t} \\ &= \Expect{\rX_t(\vf_1 \rX_{t-1} + \dots + \vf_p \rX_{t-p} + \rve_t)} \\ &= \acov(1)\vf_1 + \dots + \acov(p)\vf_p + \sigma^2. \end{align*}\]

The Yule-Walker equations in matrix form can be now written as \[ \covMat^{(p)} \vvf = \vi \acov^{(p)}, \quad \acov(0) - \scal{\vi \acov^{(p)}}{\vvf} = \sigma^2. \tag{12.1}\] where \[ \covMat^{(p)} = \brackets{\acov(i - j)}_{i,j = 1}^p, \quad \vi \acov^{(p)} = (\acov(1), \dots, \acov(p))\Tr, \quad \vvf = (\vf_1, \dots, \vf_p)\Tr. \] We have already seen Yule-Walker equations (12.1) when calculating the autocorrelation function of \(\ARmod p\), see Section 8.1, where \(\vf_1, \dots, \vf_p\) were known and we solved for \(\acov(h)\). Now we solve for \(\vf_1, \dots, \vf_p\) and also replace \(\acov(h)\) by their corresponding estimators \(\estim{\rnd \acov}(h)\). Clearly, Yule-Walker estimators \(\estim{\vrr \vf}\) of \(\vf_1, \dots, \vf_p\) solve the first equation of (12.1) \[ \estim{\rcovMat}^{(p)} \estim{\vrr \vf} = \estim{\vr \acov}^{(p)} \] and the estimator for \(\sigma^2\) solves \[ \estim{\rnd \sigma}^2 = \estim{\rnd \acov}(0) - \scal{\estim{\vrr \acov}^{(p)}}{\estim{\vrr \vf}}. \] Then, we can finally compute \[ \estim{\vrr \vf} = \brackets{\estim{\rcovMat}^{(p)}}\Inv \estim{\vrr \acov}^{(p)}, \] if the matrix is invertible. If \(p\) is large or models with many different orders need to be estimated, this may get computationally demanding – instead, the Durbin-Levinson recursive algorithm is typically used.

Theorem 12.1 For a causal \(\ARmod p\) process, the Yule-Walker estimators satisfy \[ n^{1/2} (\estim{\vrr \vf} - \vi \vf) \inDistWhen{n \to \infty} \normalD{0, \sigma^2 \brackets{\covMat^{(p)}}\Inv} \] and \(\estim{\rnd \sigma}^2 \inProbWhen{n \to \infty} \sigma^2\).

Let \(\rX_t\) be an \(\ARmod p\) process and \(n\) be large. The distribution of \(n^{1/2} (\estim{\vrr \vf} - \vi \vf)\) is approximately \(\normalD{0, \sigma^2 \brackets{\estim{\rcovMat}^{(p)}}\Inv}\). Now with the approximate probability \(1 - \alpha\), the interval \[ \brackets{ \estim{\rnd \vf}_j - c_{1 - \alpha/2}n^{-1/2} \estim{\rnd \sigma} \brackets{\estim{\rcovMat}^{(p)}}^{1/2}_{jj}, \estim{\rnd \vf}_j + c_{1 - \alpha/2}n^{-1/2} \estim{\rnd \sigma} \brackets{\estim{\rcovMat}^{(p)}}^{1/2}_{jj} } \] covers the true value of \(\vf_j\), where \(c_{1 - \alpha/2}\) is the \((1 - \alpha/2)\)-quantile of the standard normal distribution.

12.1.2 Method of moments estimation for general ARMA models

We can again compute the theoretical ACF, then the expressions depend on the parameters and we can we can equate them to the empirical ACF. This means to solve nonlinear equations. For example, consider \(\MAmod 1\):

we have already seen that \[ \acf(1) = \frac {\tht_1} {1 + \tht_1^2}; \]
then we can estimate \(\tht_1\) by solving \[ \estim{\rnd \acf}(1) = \frac {\estim{\rnd \tht}_1} {1 + \estim{\rnd \tht}_1^2}; \]
and we will use the invertible solution (if it exists).

For higher-order MA or ARMA models, this quickly gets complicated (it is an inefficient estimator, as we will see later).

12.2 Conditional least squares

Consider again the \(\ARmod p\) model \[ \rX_t - \mu = \vf_1(\rX_{t-1} - \mu) + \dots + \vf_p (\rX_{t-p} - \mu) + \rve_t. \]

By subtracting \(\estim{\rnd \mu} = \Avg \vrX\), we can ignore the constant level for now. Then \(\rX_t - \estim{\rnd \mu}\), \(t = p+1, \dots, n\) satisfy a linear regression model:

response is \(\rX_t - \estim{\rnd \mu}\);
covariates are \(\rX_{t-1} - \estim{\rnd \mu}, \dots, \rX_{t-p} - \estim{\rnd \mu}\);
coefficients are \(\vf_1, \dots, \vf_p\);
uncorrelated errors \(\ve_t\) with mean zero and variance \(\sigma^2\) are assumed.

We then estimate the coefficients \(\vf_1, \dots, \vf_p\) by ordinary least squares, that is to minimize the following \[ S_0(\vi \vf) = \sum_{t = p+1}^n \brackets{ (\rX_t - \estim{\rnd \mu}) - \vf_1 (\rX_{t-1} - \estim{\rnd \mu}) - \dots - \vf_p(\rX_{t-p} - \estim{\rnd \mu}) }^2, \] which is called conditional least squares as the first \(p\) values are taken as fixed (and their randomness is ignored). More specifically, in matrix notation we get

response vector (length \(n - p\)) \[ \vr Y = (\rX_{p+1} - \estim{\rnd \mu}, \dots, \rX_n - \estim{\rnd \mu})\Tr, \]
covariate (predictor) matrix (of dimension \((n - p) \times p\)) \[ \vrr \Xi = (\Xi_{ij}), \quad \rnd \Xi_{ij} = \rX_{p+i-j} - \estim{\rnd \mu}, \]
coefficient vector (length \(p\)) \[ \vi \vf = (\vf_1, \dots, \vf_p)\Tr, \]
errors (length \(n - p\)) \[ \vrr e = (\rve_{p+1}, \dots, \rve_n)\Tr. \]

Then the regression model has the form \[ \vr Y = \vrr \Xi \vi \vf + \vr e \] and least-squares criterion reads as follows \[ S_0(\vi \vf) = \norm{\vrr Y - \vrr \Xi \vi \vf}^2. \]

12.2.1 Normal equations and relation to Yule-Walker

Recall that we can solve the following optimization problem \[ \min_{\vi \vf} \norm{\vr Y - \vrr \Xi \vi \vf}^2, \] which corresponds to “normal equations” \[ \vrr \Xi\Tr \vrr \Xi \vi \vf = \vrr \Xi\Tr \vr Y. \]

For \(j \neq k\), the \((j,k)\) entry of \((n -p - 1)^{-1} \vrr \Xi\Tr \vrr \Xi\) is \[ \frac 1 {n - p - 1} \sum_{i = 1}^{n-p} (\rX_{p +i-j} - \Avg \vrX)(\rX_{p+i-k} - \Avg \vrX), \] which is similar to \(\estim{\rnd \acov}(j-k)\). Just as well, for the diagonal entries, we get similarity with \(\estim {\rnd \acov}(0)\). Thus the OLS estimator is similar to the Yule-Walker estimator (if \(p\) is small relative to \(n\)).

12.2.2 Conditional least squares for an autoregression model with unknown mean

We can rewrite the model \[ \rX_t - \mu = \vf_1(\rX_{t-1} - \mu) + \dots + \vf_p (\rX_{t-p} - \mu) + \rve_t. \] as \[ \rX_t = \alpha + \vf_1(\rX_{t-1} - \mu) + \dots + \vf_p (\rX_{t-p} - \mu) + \rve_t, \] where \(\alpha = (1 - \vf_1 - \dots - \vf_p)\mu\). Now we can estimate both \(\vi \vf\) and \(\mu\) (or \(\alpha\) now) with least squares. This gives us a linear model with intercept \(\alpha\) and we can estimate \(\vi \vf, \alpha\) by minimizing \[ S_0(\vi \vf, \alpha) = \sum_{t = p+1}^n \brackets{ \rX_t - \alpha - \vf_1 \rX_{t-1} - \dots - \vf_p \rX_{t-p} }^2, \] then estimate \(\sigma^2\) by \(\estim{\rnd \sigma}^2 = (n - p - 1)\Inv S_0(\estim {\vrr \vf}, \estim{\rnd \alpha})\). Consider now the least squares criterion for \(\ARmod 1\) \[ S_0(\vf_1, \mu) = \sum_{t = 2}^n \brackets{\rX_t - \mu - \vf_1(\rX_{t-1} - \mu)}^2, \] which we can differentiate with respect to \(\mu\) to get \[ \mu = \frac 1 {(n-1)(1 - \vf_1)} \brackets{ \sum_{t = 2}^n \rX_t - \vf_1 \sum_{t = 2}^n \rX_{t-1} }. \] For large \(n\), we can use an approximation \[ \frac 1 {n-1} \sum_{t = 2}^n \rX_t \approx \frac 1 {n-1} \sum_{t = 2}^n \rX_{t-1} \approx \Avg \vrX, \] thus, regardless of \(\vf_1\), \(\estim{\rnd \mu} = \frac 1 {1 - \vf_1} (\avg \vrX - \vf_1 \Avg \vrX) = \Avg \vrX\). What’s more, we can differentiate \(S_0(\vf_1, \Avg \vrX)\) w.r.t. \(\vf_1\) and solve to get \[ \estim{\rnd \vf}_1 = \frac {\sum_{t = 2}^n (\rX_t - \Avg \vrX)(\rX_{t - 1} - \Avg \vrX)} {\sum_{t = 2}^n (\rX_{t - 1} - \Avg \vrX)^2}. \] Except for the term \((\rX_n - \Avg \vrX)^2\) that is missing in the denominator, this is the same as \(\estim{\rnd \acf}(1)\), which is the Yule-Walker estimator for \(\acf(1)\). Hence, for large \(n\), OLS and YW are almost identical.