13 Parameter Estimation in ARMA Models
13.1 Maximum likelihood estimation
Consider an \(\ARMA p q\) model \[ (\rX_t - \mu) - \vf_1 (\rX_{t-1} - \mu) - \dots - \vf_p (\rX_{t-p} - \mu) = \rve_t + \tht_1 \rve_{t-1} + \dots + \tht_q \rve_{t-q}, \] that is \(\ArPoly{\backs}(\rX_t - \mu) = \MaPoly{\backs}\rve_t\), where \(\set{\rve_t} \sim \WN 0 {\sigma^2}\). Now let us assume we observe data \(\rX_1, \dots, \rX_n\) and let our goal be to estimate unknown parameters \[ \vi \beta = (\overbrace{\vf_1, \dots, \vf_p}^{\vi \vf}, \overbrace{\tht_1, \dots, \tht_q}^{\vi \tht}, \mu, \sigma^2) \] given known orders \(p,q\).
Here, \(\vi \beta\) may also include other parameters of the deterministic trend \(\mu_t\), e.g. regression coefficients for external covariates, see xreg
in R
.
Clearly, there are multiple possible approaches to the estimation:
- Method of moments: we can equate the theoretical moments to their empirical estimates and solve for the parameters (e.g. Yule-Walker);
- Least squares: we can minimize the sum of squared distances between the observed values and their model-based expectations;
- Maximum likelihood estimation.
Thus let the joint density of \(\rX_1, \dots, \rX_n\) be \(f(x_1, \dots, x_n; \vi \beta) = f(\vi x; \vi \beta)\) and we shall strive to find the value of \(\vi \beta\) for which the likelihood \(\Likely{\vi \beta} = f(\rX_1, \dots, \rX_n; \vi \beta)\) is maximal.
This approach of using MLE brings some advantages like efficiency (low variance), optimality, clear justification of our results and a unified framework. Also often, the Gaussian assumption is reasonable! But, on the other hand, it leads to possibly difficult optimization (the solution is given implicitly and as such it requires iterative numerical procedure). Furthermore, we need to choose a good starting point (and often use other estimators for this alone).
13.1.1 Gaussian likelihood function
Let us assume the process is now Gaussian and data \(\vr X = (\rX_1, \dots, \rX_n)\Tr\) are jointly Gaussian with mean \(\mu\) and covariance matrix \(\covMat\). The likelihood function is then \[ \Likely{\vi \beta} = \frac 1 {(2\pi)^{n/2} \Norm{\covMat}^{1/2}} \exp \brackets{- \frac{(\vr X - \mu)\Tr \covMat\Inv (\vr X - \mu)} 2}, \] which the MLE approach maximizes.
13.1.1.1 Maximum likelihood estimation for \(\ARmod 1\) process
Consider now a case where \(\rX_t\), \(t = 1, \dots, n\) follow an \(\ARmod 1\) model, i.e. \(\rX_t = \mu + \vf_1 (\rX_{t - 1} - \mu) + \rve_t\), and rewrite it again as a linear model \[ \rX_t = \alpha + \vf_1 \rX_{t-1} + \rve_t, \] where \(\alpha = \mu(1 - \vf_1)\). The covariance matrix of \((\rX_1, \dots, \rX_n)\Tr\) has then entries \[ \CovMat_{ij} = \frac {\sigma^2}{1 - \vf_1^2} \vf_1^{\absval{i - j}}. \] Note that this leads to a difficult matrix inverse and determinant computation in the likelihood function \[ \frac 1 {(2\pi)^{n/2} \Norm{\covMat}^{1/2}} \exp \brackets{- \frac{(\vr X - \mu)\Tr \covMat\Inv (\vr X - \mu)} 2}. \]
Notice the difference from situations with independent data!
For fixed parameter values, it is still difficult to evaluate the likelihood, hence we use conditioning \[\begin{align*} f(x_1, \dots, x_n; \vi \beta) &= f(x_n \condp x_{n-1}, \dots, x_1; \vi \beta) f(x_{n-1}, \dots, x_1; \vi \beta) \\ &= \brackets{ \prod_{i = 2}^n f(x_i\condp x_{i-1}, \dots, x_1; \vi \beta) } f(x_1; \vi \beta). \end{align*}\] For Gaussian data, all conditional distributions are still Gaussian and for \(i \geq 2\), the conditional expectation and variance are \[ \Expect{\rX_i \condp \rX_{i-1}, \dots, \rX_1} = \alpha + \vf_1 \rX_{i-1}, \quad \Variance{\rX_i \condp \rX_{i-1}, \dots, \rX_1}= \sigma^2, \] hence \(f(x_i\condp x_{i-1}, \dots, x_1; \vi \beta) \sim \normalD{\alpha + \vf_1 x_{i-1}, \sigma^2}\). What’s more, the (unconditional) distribution of \(\rX_1\) is, too, Gaussian with mean \(\mu = \alpha / (1 - \vf_1)\) and variance (by causality) \[ \Expect{\rX_1 - \mu}^2 = \Expect{\sum_{j = 0}^\infty \vf_1^j \rve_{1 - j}}^2 = \frac {\sigma^2}{1 - \vf_1^2}. \] Hence \(f(x_1) \sim \normalD{\alpha / (1 - \vf_1), \sigma^2 / (1 - \vf_1^2)}\) and put together, the likelihood is \[\begin{align*} \Likely{\vi \beta} =& \brackets{ \prod_{i = 2}^n \frac 1 {(2\pi)^{1/2} \sigma} \exp \brackets{- \frac {(\rX_i - \alpha - \vf_1 \rX_{i-1})^2} {2\sigma^2}} } \\ &\times \frac 1 {(2\pi)^{1/2} \sigma / (1 - \vf_1^2)^{1/2}} \exp \brackets{- \frac {(\rX_1 - \alpha / (1 - \vf_1))^2} {2\sigma^2 / (1 - \vf_1^2)}}. \end{align*}\] As one can see, now it is easy to compute the likelihood given a set of parameter values. To solve for optimal \(\vi \beta\), as usual, we define a log-likelihood (ignoring any terms independent of parameters) to get \[\begin{align*} \Loglikely{\vi \beta} = &- \sum_{i = 2}^n \frac {(\rX_i - \alpha - \vf_1 \rX_{i - 1})^2} {2 \sigma^2} - \frac {(\rX_1 - \alpha / (1 - \vf_1))^2} {2\sigma^2 / (1 - \vf_1^2)} \\ &- n\log \sigma - \frac 1 2 \log(1 - \vf_1^2), \end{align*}\] which is clearly non-linear in parameters and as such needs to be maximized numerically.
Notice that all terms except the one for \(\rX_1\) have the same form in the likelihood function \(\Likely{\vi \beta}\), thus we can use an alternative approach where we consider \(\rX_1\) as fixed. From this, we get a conditional likelihood \[\begin{align*} \Likely{\vi \beta \condp \rX_1} &= f(\rX_2, \dots, \rX_n \condp \rX_1; \vi \beta) \\ &= \prod_{i = 2}^n f(\rX_i\condp \rX_1, \dots, \rX_{i - 1}; \vi \beta) \\ &= \prod_{i = 2}^n \frac 1 {(2\pi)^{1/2} \sigma} \exp \brackets{- \frac {(\rX_i - \alpha - \vf_1 \rX_{i-1})^2} {2\sigma^2}} \end{align*}\] and the conditional log-likelihood similarly becomes \[ \Loglikely{\vi \beta \condp \rX_1} = - \sum_{i = 2}^n \frac {(\rX_i - \alpha - \vf_1 \rX_{i-1})^2} {2\sigma^2} - (n-1) \log \sigma. \] One can notice that this is the conditional least squares problem (or conditional sum) and that we can solve this explicitly (as opposed to the implicit formulation before).
13.1.1.2 Maximum likelihood estimation for an autoregressive process
Now consider an \(\ARmod p\) process, i.e. \[ \rX_t = \alpha \vf_1 \rX_{t - 1} + \dots + \vf_p \rX_{t - p} + \rve_t, \] for which the likelihood comes in the form \[\begin{align*} \Likely{\vi \beta} &= \prod_{i = p+1}^n f(\rX_i \condp \rX_1, \dots, \rX_{i - 1}; \vi \beta) f(\rX_1, \dots, \rX_p; \vi \beta) \\ &= \Likely{\vi \beta \condp \rX_1, \dots, \rX_p} f(\rX_1, \dots, \rX_p; \vi \beta) \end{align*}\] with conditional densities \[ f(\rX_i \condp \rX_1, \dots, \rX_{i - 1}; \vi \beta) \sim \normalD{\alpha + \vf_1 x_{i-1} + \dots + \vf_p x_{i - p}, \sigma^2}. \] The density \(f(\rX_1, \dots, \rX_p; \vi \beta)\) is Gaussian with a complicated covariance matrix. Also, the conditional likelihood \(\Likely{\vi \beta \condp \rX_1, \dots, \rX_p}\) leads to a least squares problem.
13.1.2 Evaluation of the likelihood function in the general case
In \(\ARMA p q\) models, the likelihood is much more complicated due to MA terms and it requires a large non-diagonal matrix to be inverted and its determinant to be computed. Similarly to how conditioning on previous values was useful in AR models, even in the general case, the likelihood becomes a product of simpler terms (densities in this product correspond to independence in data).
Note that we have already seen a transformation of series into uncorrelated variables via innovations.
13.2 Recursive likelihood calculation using innovations
13.2.1 Innovations
Recall the innovations \(\rZ_i = \rX_i - \estim \rX_i\), where \(\estim \rX_i\) is the best linear prediction of \(\rX_i\) from \(\rX_1, \dots, \rX_{i - 1}\) (this corresponds to conditional expectation in the Gaussian case). Predictions \(\estim \rX_i\) can be expressed in the terms of previous observations \(\rX_1, \dots, \rX_{i - 1}\) or in the terms of previous innovations as \[ \estim \rX_i = \mu + \sum_{j = 1}^{i - 1} \vf_{i - 1, j} (\rX_{i-j} - \mu) = \mu + \sum_{j = 1}^{i - 1} \tht_{i - 1, j} (\rX_{i-j} - \estim \rX_{i - j}). \]
The innovations have the following properties:
- zero mean;
- variance (or prediction error) \(\Expect{\rX_i - \estim \rX_i}^2 = v_{i - i} = r_{i - 1} \sigma^2\) for appropriate \(r_{i - 1} > 0\);
- most importantly, they are uncorrelated (even independent in the Gaussian case) because \(\rZ_j \in \linspace{\rX_1, \dots, \rX_j}\) and \(\rZ_k \perp \linspace{\rX_1, \dots, \rX_j}\) for \(j < k\).
The innovations \(\vr Z = (\rnd Z_1, \dots, \rnd Z_n)\Tr\) are obtained from the observations \(\vrX = (\rX_1, \dots, \rX_n)\Tr\) by linear transformation \[ \vr Z = \AA (\vrX - \mu), \] where the \(n\times n\) matrix \(\AA\) is given as \[ \AA = \mtr{ 1 & 0 & \cdots & \cdot & \cdot & \cdot \\ - \tht_{1, 1} & 1 & 0 & \cdots & \cdot & \cdot \\ - \tht_{2, 2} & -\tht_{2,1} & 1 & 0 & \cdots & \cdot \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\ - \tht_{n-2, n-2} & -\tht_{n-2, n-3} & \cdots & -\tht_{n-2, 1} & 1 & 0 \\ - \tht_{n-1, n-1} & -\tht_{n-1, n-2} & \cdots & -\tht_{n-1, 2} & -\tht_{n-1, 1} & 1 \\ }. \] Clearly, the matrix \(\AA\) is regular, hence the correspondence between the data \(\rX_i\) and the innovations \(\rZ_i = \rX_i - \estim \rX_i\) is one-to-one. Thus they contain the same information and have the same likelihood. Since the linear transformation preserves normality, i.e. if \(\rX_1, \dots, \rX_n\) are jointly Gaussian, then \(\rZ_1, \dots, \rZ_N\) are too jointly Gaussian. All in all, \(\rZ_i\) is Gaussian with mean zero and variance \(\sigma^2 r_{i-1}\), i.e. \[ f(z_i; \vi \beta) = \frac 1 {(2\pi)^{1/2} \sigma r^{1/2}_{i-1}} \exp \brackets{- \frac {z_i^2}{2\sigma^2 r_{i-1}}}, \] thus \(\rZ_i\)’s are uncorrelated and by the Gaussian distribution properties they are even independent. Hence, the joint density of \(\vr Z = (\rZ_1, \dots, \rZ_n)\Tr\) is the product of marginal densities \[ \begin{aligned} f(\vi z; \vi \beta) &= f(z_1, \dots, z_n; \vi \beta) = \prod_{i = 1}^n f(z_i; \vi \beta) \\ &= \frac 1 {(2\pi)^{n/2} \sigma^n \prod_{i = 1}^n r_{i-1}^{1/2}} \exp \brackets{-\sum_{i = 1}^n \frac {z_i^2} {2 \sigma^2 r_{i-1}}}. \end{aligned} \] The likelihood then becomes \[\begin{align*} \Likely{\vi \beta} &= \frac 1 {(2\pi)^{n/2} \sigma^n \prod_{i = 1}^n r_{i-1}^{1/2}} \exp \brackets{-\sum_{i = 1}^n \frac {\rZ_i^2} {2 \sigma^2 r_{i-1}}} \\ &= \frac 1 {(2\pi)^{n/2} \sigma^n \prod_{i = 1}^n r_{i-1}^{1/2}} \exp \brackets{-\sum_{i = 1}^n \frac {(\rX_i - \estim \rX_i)^2} {2 \sigma^2 r_{i-1}}}, \end{align*}\] where \(\estim \rX_i\) and \(r_{i-1}\) depend on the covariance structure of the series, i.e. on the parameters of the model. For a given set of parameter values, the value of the likelihood function can be easily computed, provided we can obtain \(\estim \rX_i\) and \(r_{i-1}\). Then, the innovations algorithm can be used to compute the one-step predictions \(\estim \rX_i\) and their errors \(v_{i-1} = \sigma^2 r_{i-1}\) recursively using simple linear operations.
This can be used for any Gaussian series, not only ARMA processes. Also, another recursive approach we might use is the Kálmán filter.
13.2.2 Conditional least squares
For an \(\ARMA p q\) model, one can condition on the first \(p\) (or possibly more) values of the series and then minimize \(\sum_{i = p+1}^n \estim \rve_i^2(\vi \beta)\), where \[\begin{align*} \estim \rve_i(\vi \beta) =& \rX_i - \mu - \vf_1(\rX_{i-1 - \mu}) - \dots - \vf_p(\rX_{i-p} - \mu) \\ &- \tht_1 \estim \rve_{i-1}(\vi \beta) - \dots - \tht_q \estim \rve_{i - q}(\vi \beta) \end{align*}\] for \(i > p\) and \(\estim \rve_i(\vi \beta) = 0\) for \(i \leq p\). Afterward, we can compare the exact likelihood and condition least squares, since the exact likelihood uses the innovations \(Z_i\) (errors of predictions using the complete observed history) and the conditional least squares use the residuals \(\estim \ve_i\) (residuals obtained from the model formula, fixing the initial values).
13.3 Computation
Although the conditional likelihood for AR models can be maximized directly analytically via least squares, other situations require numerical optimization methods. While grid search might be useful sometimes, the general approach is the Newton-Raphson algorithm.
13.3.1 the Newton-Raphson algorithm
Our objective now shall be to maximize \(\Loglikely{\vi \beta}\). First, compute the gradient \[ \gradient \Loglikely{\vi \beta} = \brackets{\partialDeriv{\Loglikely{\vi \beta}} {\beta_1}, \dots, \partialDeriv{\Loglikely{\vi \beta}} {\beta_d}}\Tr \] and afterward solve for \(\gradient \Loglikely{\vi \beta} = \vi 0\), for which we can use a Taylor expansion \[ \vi 0 = \gradient \Loglikely{\vi \beta} \approx \gradient \Loglikely{\vi \beta_{(0)}} - \hess \Loglikely{\vi \beta_{(0)}} \brackets{\vi \beta - \vi \beta_{(0)}}, \] where \(\hess \Loglikely{\vi \beta}\) is the Hessian matrix of \(\loglikely\). Solving the linearized equation yields the next iteration \[ \vi \beta_{(k+1)} = \vi \beta_{(k)} + \hess \Loglikely{\vi \beta_{(k)}}\Inv \gradient \Loglikely{\vi \beta_{(k)}}. \]
Here a good initial estimate is important for the convergence of the iterative process. As such, we typically use the conditional least squares, and then we iterate until convergence under some criterion, for example
- small relative change of the solution, e.g. \[ \frac {\norm{\vi \beta_{(k)} - \vi \beta_{(k-1)}}} {\norm{\vi \beta_{(k-1)}}} < \delta; \]
- small relative change in the value of the objective function, e.g. \[ \frac {\absval{\Loglikely{\vi \beta_{(k)}} - \Loglikely{\vi \beta_{(k-1)}}}} {\absval{\Loglikely{\vi \beta_{(k-1)}}}} < \delta; \]
- small norm of the gradient vector \[ \norm{\gradient \Loglikely{\vi \beta_{(k)}}} < \delta. \]
13.4 Properties of estimators
Theorem 13.1 Under appropriate technical assumptions, the asymptotic distribution of the maximum likelihood estimator is given as follows, \[ n^{1/2} \brackets{(\estim{\vrr \vf}, \estim{\vrr \tht})\Tr - (\vvf, \vi \tht)\Tr} \inDist \normalD{\vi 0, \sigma^2 \vi V\Inv}, \] where the block \(\vi V_{\vvf, \vvf}\) of the matrix \[ \vi V = \mtr{ \vi V_{\vvf, \vvf} & \vi V_{\vvf, \vi \tht} \\ \vi V_{\vi \tht, \vvf} & \vi V_{\vi \tht, \vi \tht} } \] contains the autocovariances up to lag \(p-1\) of the AR process \[ \ArPoly{\backs} \rY_t = \rve_t, \tag{13.1}\] the block \(\vi V_{\vi \tht, \vi \tht}\) contains the autocovariances up to lag \(q-1\) of the AR process \[ \MaPoly{\backs} \rZ_t = \rve_t \tag{13.2}\] and the block \(\vi V_{\vvf, \vi \tht}\) the cross-covariances between the processes (13.1) and (13.2).
Example 13.1 (Maximum likelihood estimation asymptotics for \(\ARMA 1 1\)) Consider the model \[ \rX_t = \vf_1 \rX_{t-1} + \rve_t + \tht_1 \rve_{t-1}, \] for which the limiting distribution of \(n^{1/2} ((\estim{\rnd \vf}_1, \estim{\rnd \tht}_1)\Tr - (\vf_1, \tht_1)\Tr)\) is bivariate Gaussian with mean zero and covariance matrix \(\sigma^2 \vi V^{-1}\). Here \[ \vi V = \mtr{ \vi V_{\vf, \vf} & \vi V_{\vf, \tht} \\ \vi V_{\tht, \vf} & \vi V_{\tht, \tht} } \] and \(V_{\vf, \vf}\) is given the autocovariance of the \(\ARmod 1\) process \(\rY_t = \vf_1 \rY_{t-1} + \rve_t\), that is \[ V_{\vf, \vf} = \acov_{\vr Y}(0) = \frac {\sigma^2} {1 - \vf_1^2}. \] Furthermore, \(V_{\tht, \tht}\) is set by the autocovariance of the \(\ARmod 1\) process \(\rZ_t = - \tht_1 \rZ_{t-1} + \rve_t\), that is \[ V_{\tht, \tht} = \acov_{\vr Z}(0) = \frac {\sigma^2} {1 - \tht_1^2}. \] Last, but not least, the entry \(V_{\vf, \tht}\) is given by the covariance of \(\vr Y\) and \(\vr Z\), that is \[\begin{align*} V_{\vf, \tht} &= \cov (\rY_t, \rZ_t) = \cov (\vf_1 \rY_{t-1} + \rve_t, -\tht_1 \rZ_{t-1} + \rve_t) \\ &= - \vf_1 \tht_1 \cov(\rY_{t-1}, \rZ_{t-1}) + \sigma^2. \end{align*}\] In the light of stationarity we get \(V_{\vf, \tht} = -\vf_1 \tht_1 V_{\vf, \tht} + \sigma^2\), thus \[ V_{\vf, \tht} = \frac {\sigma^2} {1 + \vf_1 \tht_1} = V_{\tht, \vf}. \] Combined, this produces \[ \vi V = \sigma^2 \mtr{ (1 - \vf_1^2)\Inv & (1 + \vf_1 \tht_1)\Inv \\ (1 + \vf_1 \tht_1)\Inv & (1 - \tht_1^2)\Inv }, \] therefore, \((\estim{\rnd \vf}_1, \estim{\rnd \tht}_1)\Tr\) is approximately normal with mean \((\vf_1, \tht_1)\Tr\) and covariance matrix \[ \frac 1 n \mtr{ (1 - \vf_1^2)\Inv & (1 + \vf_1 \tht_1)\Inv \\ (1 + \vf_1 \tht_1)\Inv & (1 - \tht_1^2)\Inv }\Inv. \]
Notice here the invariance with respect to \(\sigma^2\).