9 Prediction of ARMA Processes

9.1 Linear prediction

Let our goal now be to predict future values \(\rX_{n+h}\) based on the data up to time \(n\), that is \(\set{\rX_n,\dots, \rX_1}\). Specifically, we want to

predict the future value (point prediction);
quantify uncertainty (prediction error, prediction intervals).

Thus we need to specify the criteria of quality of predictions, try to find good (optimal) predictions, focus on predictions that are easy to compute, and provide good algorithms.

9.1.1 Mean squared error criterion

Hence we want to predict \(\rnd Y\) given \(\rnd Z_1, \dots, \rnd Z_n\) (mean zero variables with finite second moments). Given a measurable function \(g: \R^n \to \R\) such that \(g(\vr Z)\) is on average close the unobserved value of \(\rnd Y\), we shall minimize the mean squared error \[ \begin{aligned} \Expect{\rnd Y - g(\vr Z)}^2 &= \Expect{\rnd Y - \Expect{\rnd Y \condp \vr Z}}^2 + \Expect{\Expect{\rnd Y \condp \vr Z} - g(\vr Z)}^2 \\ &+ 2\Expect{\brackets{\rnd Y - \Expect{\rnd Y \condp \vr Z}} \brackets{ \Expect{\rnd Y \condp \vr Z} - g(\vr Z)}}. \end{aligned} \]

Clearly, the last term is zero and \(\Expect{\rnd Y - \Expect{\rnd Y \condp \vr Z}}^2 \geq 0\). Thus for all functions \(g\) \[ \Expect{\rnd Y - g(\vr Z)}^2 \geq \Expect{\rnd Y - \Expect{\rnd Y \condp \vr Z}}^2 \] and the minimum is then attained for \(g(\vr Z) = \Expect{\rnd Y \condp \vr Z}\), i.e. the best prediction is the conditional expectation. Unfortunately, this is often complicated, as it depends on the joint distribution, but in the Gaussian case, the conditional expectation is linear. As such, we shall focus on linear predictions.

9.2 Intermezzo

9.2.1 The Hilbert space setting for prediction

As we’ve mentioned earlier, we decided to focus on linear projections, which naturally work in the linear space of mean zero second-order variables. Now consider a Hilbert space with inner product \(\scal{\vr X}{\vr Y} = \expect \vr X \vr Y\) and distance \(\norm{\vr X- \vr Y}^2 = \Expect{\vr X - \vr Y}^2\). Now, given \(\vr Z_1, \dots, \vr Z_n\), candidate linear predictions \(c_1 \vr Z_1 + \dots + c_n \vr Z_n\) form the linear span of \(\vr Z_1, \dots, \vr Z_n\). Unfortunately, we will also need to consider the infinite sets \(\vr Z_1, \vr Z_2, \dots\) We define the linear span as the set of all finite linear combinations \[ \linspace{\vr Z_1, \vr Z_2, \dots} = \set{c_1 \vr Z_{i_1} + \dots + c_m \vr Z_{i_m} : m \in \N, c_1, \dots, c_m \in \R}. \]

Surely, its closure consists of all limits (in the mean square) of convergent sequences and it is a closed subspace of the Hilbert space. It can be shown that the best approximation of an element of the Hilbert space by an element in a subspace is found by orthogonal projection.

Theorem 9.1 Let \(M\) be a closed subspace of a Hilbert space \(H\). Then every \(y \in H\) can be uniquely decomposed as \(y = \estim y + u\) where \(\estim y \in M\) and \(u\) is orthogonal to \(M\) (í.e., \(\scal{u}{z} = 0\) for all \(z \in M\)). Furthermore, \[ \norm{y - \estim y} = \min_{z \in M}\norm{y - z} \] and \[ \norm{y}^2 = \norm{\estim y}^2 + \norm{u}^2. \]

9.3 Linear prediction - cont.

Consider the linear space \(\linspace{\vr Z_1, \dots, \vr Z_n}\) (surely it has finite dimension and is automatically closed) and the linear predictions of form \(\estim{\vr Y} = \sum_{j = 1}^n c_j \vr Z_j\). Now our task is to find constants \(c_1, \dots, c_n\) such that \(\Expect{\vr Y - \estim{\vr Y}}^2\) is minimal, which means to find the orthogonal projection on the \(\linspace{\vr Z_1, \dots, \vr Z_n}\), which satisfies \[ \vr Y - \estim{\vr Y} \perp \linspace{\vr Z_1, \dots, \vr Z_n}. \] In other words, \[ \Expect{\vr Z (\vr Y - \vr Z\Tr \vi c)} = 0, \] that is, \(\variance (\vr Z)\vi c = \cov (\vr Z, \vr Y)\) or in the case of invertibility \(\vi c = \variance (\vr Z)\Inv \cov (\vr Z,\vr Y)\).

Tip

In the linear models, we had \((\vr X\Tr \vr X)\), which corresponds to \(\variance (\vr Z)\), and \((\vr X\Tr \vr Y)\) in the role of \(\cov (\vr Z, \vr Y)\).

The prediction error is then given by \[ \Expect{\vr Y - \vr Z\Tr \vi c}^2 = \variance \vr Y - \cov (\vr Y, \vr Z) \variance (\vr Z)\Inv \cov(\vr Z, \vr Y). \]

If \(\cov(\vr Y, \vr Z) = 0\), the prediction error is \(\variance \vr Y\), which can be interpreted as we have the same amount of uncertainty about \(\vr Y\) as if we did not observe \(\vr Z\) at all and as such \(\vr Z\) provides no additional information about \(\vr Y\). Hence correlation is good for prediction (which is intuitively true). In regression with iid errors, prediction can be done only by extrapolating the fitted mean while in time series we can exploit the association (similarity, dissimilarity) between variables. Also notice that the prediction error is positive (unless \(\vr Y\) is a linear function of \(\vr Z\)) and there will always be some uncertainty, and some additional randomness in \(\vr Y\) that is not contained in \(\vr Z\).

9.3.1 Time-series forecasting

Now, let’s focus on zero-mean stationary time series with finite second moments based on \(\rnd X_1, \dots, \rnd X_n\), we attempt to predict \(\rnd X_{n+1}\) by \(\estim{\rnd X}_{n+1}\) (one step ahead) or \(\rnd X_{n+h}\) by \(\estim{\rnd X}_{n+h}(n)\) (\(h\) steps ahead). Thus we look for \(\estim{\rnd X}_{n+1} = \sum_{j = 1}^n \vf_{nj} \rnd X_{n+1-j}\). With stationarity, the estimating equation for \(\estim{\rnd X}_{n+1}\) becomes \[ \covMat^{(n)} \vvf^{(n)} = \vi \acov^{(n)}, \] where \(\covMat^{(n)}\) is the \((n \times n)\) matrix with entries \(\CovMat^{(n)}_{ij} = \acov(i-j)\) (autocovariance), \(\vi \gamma^{(n)} = (\acov(1), \dots, \acov(n))\Tr\) and \(\vvf^{(n)} = (\vf_{n1}, \dots, \vf_{nn})\Tr\). The non-stationary case can be solved analogously. Also, a similar procedure could be performed for \(h\)-steps predictions ahead. Surely, if \(\covMat^{(n)}\) is invertible, one can compute \[ \vvf^{(n)} = {\covMat^{(n)}}\Inv \vi \acov^{(n)}. \]

Theorem 9.2 For a mean zero stationary \(L^2\) sequence with \(\acov(0) > 0\) and \(\acov(k) \to 0\) as \(k \to \infty\), the matrix \(\covMat^{(n)}\) is non-singular for every \(n\).

9.4 Recursive algorithms for computing predictions

9.4.1 Difficulties in the computation of predictions

Realize that solving \[ \covMat^{(n)} \vvf^{(n)} = \vi \acov^{(n)}, \] may be difficult. While, a special structure of \(\covMat^{(n)}\) (e.g., banded for MA models) may help, it still would be no small feat. If the series is long (e.g., \(n = 10000\)), we need to solve a large linear system, i.e. invert a large matrix (problems with numerical errors, computing time, memory). Hence we would like to avoid inverting matrices. Also when a new observation arrives, the prediction needs to be recomputed from scratch, thus we would like to update previous predictions, without needing to recalculate everything. Combined, this suggests using recursive procedures that use simple operations to update previous results

9.4.2 Innovations

The linear prediction \[ \estim{\rnd X}_{n+1} = \sum_{j = 1}^n \vf_{nj} \rX_{n+1-j} \] is an element of the linear span \(\linspace{\rnd X_1, \dots, \rnd X_n}\). The principal difficulty in solving \(\vvf^{(n)} = \covMat_n^{-1} \vi \acov^{(n)}\) is that a non-trivial matrix must be inverted. Thus we try to re-express (change the basis) the linear span to get a simpler matrix. Now notice that \[ \linspace{\rnd X_1, \dots, \rnd X_n} = \linspace{\rnd X_1 - \estim{\rnd X}_1, \dots, \rnd X_n - \estim{\rnd X}_n}, \] where we set \(\estim{\rnd X}_1 = 0\), as there is no history to predict from. Innovations \(\rnd X_k - \estim{\rnd X}_k\) are differences of the actual observation and its one-step-ahead prediction based on the past. So instead of looking for \[ \estim{\rnd X}_{n+1} = \sum_{j = 1}^n \vf_{nj} \rnd X_{n+1-j}, \] we equivalently search for \[ \estim{\rnd X}_{n+1} = \sum_{j = 1}^n \tht_{nj} \brackets{\rnd X_{n+1-j} - \estim{\rnd X}_{n+h-j}}. \] Notice that the innovations are uncorrelated, i.e. \[ \expect\brackets{\rnd X_j - \estim{\rnd X}_j}\brackets{\rnd X_k - \estim{\rnd X}_k} = 0 \tag{9.1}\] for all \(j < k\), because

\(\rnd X_j - \estim{\rnd X}_j \in \linspace{\rnd X_1, \dots \rnd X_j}\);
\(\rnd X_k - \estim{\rnd X}_k \perp \linspace{\rnd X_1, \dots, \rnd X_j}\) by orthogonality of the projection.

Hence \(\rnd X_1 - \estim{\rnd X}_1, \dots, \rnd X_n - \estim{\rnd X}_n\) are truly uncorrelated, as we claimed, and orthogonal. From the fact, that projections on orthogonal elements are particularly easy, and the fact that the orthogonal projection of \(\rnd Y\) on \(\vr Z^{(n)} = (\rnd Z_n, \dots, \rnd Z_1)\Tr\) is given by \[ \estim{\rnd Y} = \sum_{j = 1}^n \tht_{nj} \rnd Z_{n+j-j} = {\vi \tht^{(n)}}\Inv \vr Z^{(n)}, \] where \[ \vi \tht^{(n)} = \cov \brackets{\vr Z^{(n)}}^{-1} \cov(\vr Z^{(n)}, \rnd Y). \] Now if \(\vr Z\) are orthogonal, then \(\cov(\vr Z^{(n)})\) is diagonal. Thus it is easy to invert and \[ \tht_{nj} = \frac {\cov(\rnd Z_{n+1-j}, \rnd Y)} {\variance \rnd Z_{n+1-j}}. \] As innovations \(\rnd Z_k = \rnd X_{k+1} - \estim{\rnd X}_{k+1}\), \(k = 1, \dots, n\) are orthogonal, denote \[ v_k = \expect \rnd Z_k^2 = \Expect {\rX_{k+1} - \estim{\rX}_{k+1}}^2. \tag{9.2}\] The covariance matrix of the innovations is \(\diagonal \set{v_{n-1}, v_{n-2}, \dots, v_0}\). Prediction coefficients then become \[ \tht_{n,n-k} = v_k^{-1}\expect \rX_{n+1} \rnd Z_k = v_k^{-1}\expect \rX_{n+1} (\rX_{k+1} - \estim{\rX}_{k+1}) \] for \(k = 1, \dots, n\), then \[\begin{align*} \tht_{n, n-k} & = v_k\Inv \Expect{\rX_{n+1} \brackets{\rX_{k+1} - \estim \rX_{k+1}}} \\ &= v_k\Inv \brackets{\acov(n+1, k+1) - \Expect{\rX_{n+1} \estim \rX_{k+1}}}. \end{align*}\] To compute \(\expect X_{n+1} \estim X_{k+1}\) recall that \[ \estim \rX_{k+1} = \sum_{j = 0}^{k-1} \tht_{k, k-j}(\rX_{j+1} - \estim \rX_{j+1}), \tag{9.3}\] hence \[ \expect \rX_{n+1}\estim \rX_{k+1} = \sum_{j = 0}^{k-1} \tht_{k, k-j} \expect \rX_{n+1} (\rX_{j+1} - \estim \rX_{j+1}), \] where for \(j < k\) we get by (9.1) \[\begin{align*} \Expect{\rX_{n+1} (\rX_{j+1} - \estim \rX_{j+1})} &= \Expect{(\rX_{n+1} - \estim \rX_{j+1})(\rX_{j+1} - \estim \rX_{j+1})}\\ &\hphantom{=} + \Expect{\estim \rX_{n+1}(\rX_{j+1} - \estim \rX_{j+1})} \\ &= \Expect{\sum_{h = 0}^{n-1} \tht_{n, n-h}(\rX_{h+1} - \estim \rX_{h+1})(\rX_{j+1} - \estim \rX_{j+1})} \\ &= \tht_{n, n-j}v_j. \end{align*}\] Put together, we get \[ \tht_{n, n-k} = v_k^{-1} \brackets{ \acov(n+1, k+1) - \sum_{j = 0}^{k-1} \tht_{k, k-j} \tht_{n, n-j}v_j }, \tag{9.4}\] and when we then compute \(v_n = \Expect{\rX_{n+1} - \estim \rX_{n+1}}^2\), using orthogonality \[ \expect \rX_{n+1}^2 = \Expect{\rX_{n+1} - \estim \rX_{n+1}}^2 + \expect \estim \rX_{n+1}^2, \] thus by Definition 2.2 and the assumption \(\set{\rX_t}\) has mean zero, together with (9.3) and (9.4), it finally yields \[\begin{align*} v_n & = \expect \rX_{n+1}^2 - \expect \estim \rX_{n+1}^2 \\ &= \acov (n+1, n+1) - \Expect{\sum_{j = 0}^{n-1} \tht_{n, n-j}(\rX_{j+1} - \estim \rX_{j+1})}^2 \\ &= \acov(n+1, n+1) - \sum_{j = 0}^{n-1} \tht^2_{n,n-j} v_j, \end{align*}\] which can be summarized in the following Theorem 9.3.

Theorem 9.3 (Innovations) If \(\set{\rX_t}\) has mean zero and \(\expect \rX_i \rX_j = \acov(i,j)\), where the matrix with entries \(\acov(i,j)\), \(i,j = 1, \dots, n\) is non-singular for each \(n = 1,2,\dots\), then the one-step predictors \(\estim \rX_{n+1}\), \(n \geq 0\), and their mean squared errors \(v_n\), see (9.2), are given by \[ \estim \rX_{n+1} = \begin{cases} 0, & n = 0 \\ \sum_{j = 1}^n \tht_{n,j} (\rX_{n+1-j} - \estim \rX_{n+1-j}), & n \geq 1 \end{cases} \] and \[\begin{align*} v_0 &= \acov(1,1), \\ \tht_{n, n-k} &= v_k^{-1} \brackets{\acov(n+1, k+1) - \sum_{j = 0}^{k-1} \tht_{k,k-j} \tht_{n,n-j} v_j}, \quad k = 0, \dots, n-1, \\ v_n &= \acov(n+1, n+1) - \sum_{j = 0}^{n-1} \tht_{n, n-j}^2 v_j. \end{align*}\]

Important

Thus the one-step prediction can be easily computed (fast) without any matrix operations (but only using basic operations).

Thus this algorithm relies on the recursive computation of \(\tht_{n,j}\) and \(v_n\) using previously computed values, and it features no requirements of matrix inversion, just basic operations, and easy updating. Also, stationarity is not necessary. Essentially, this is the Gram–Schmidt orthogonalization procedure and it is useful in maximum likelihood estimation.

Tip

A similar (but better) algorithm is a Kalman filter, which we will discuss in the next course.

We can also simplify the innovations algorithm for \(\MAmod{q}\) processes, as the coefficients \(\tht_{n, q+1}, \tht_{n, q+2}, \dots\) are zero, because \(\expect X_{n+1} (X_{k+1} - \estim X_{k+1})\) for \(n - k > q\)

9.4.3 Recursive prediction for ARMA model

For \(\MAmod{q}\), the innovations algorithm simplifies due to the simple autocorrelation structure. On the other hand, ARMA autocorrelation is complicated, but a question arises whether we could transform it into an MA model. Thus consider \(\ARMA{p}{q}\) \[ \ArPoly{\backs} \rX_t = \maPoly(\backs) \rve_t, \quad \set{\rve_t} \sim \WN{0}{\sigma^2}. \] For \(m = \max(p,q)\), define the transformed process \[\begin{gather*} \rnd W_t = \sigma^{-1} \rX_t, \quad t = 1, \dots, m \\ \rnd W_t = \sigma^{-1} \ArPoly{\backs}\rX_t, \quad t > m. \end{gather*}\] The linear spans of \(\rX_1, \dots, \rX_n\) and \(\rnd W_1, \dots, \rnd W_n\) are the same, so now we can apply the algorithm to \(\rnd W_t\) to get \[ \estim{\rnd W}_{n+1} = \sum_{j = 1}^{\min(p,q)} \tht_{nj} (\rnd W_{n+1-j} - \estim{\rnd W}_{n+1-j}), \] where the coefficients \(\tht_{nj}\) and mean square error \(r_n\) are found by the recursive procedures. So \(\rnd W_t\) is \(\MAmod{q}\) and thus \(\tht_{nj} = 0\) for \(j > q\). Now we would like to obtain \(\estim \rX_{t}\) from \(\estim{\rnd W}_{t}\). We can project each side of the defining equation for \(\rnd W_t\) on \(\rX_1, \dots, \rX_{t-1}\) to get \[\begin{align*} \estim{\rnd W}_t &= \sigma^{-1} \estim \rX_t, \quad t = 1, \dots, m, \\ \estim{\rnd W}_t &= \sigma^{-1} \brackets{\estim \rX_t - \vf_1 \rX_{t-1} - \dots - \vf_p \rX_{t-p}}, \quad t > m, \end{align*}\] and thus \(\rX_t - \estim \rX_t = \sigma (\rnd W_t - \estim{\rnd W}_t)\), which gives \[\begin{align*} \estim \rX_{n+1} &= \sum_{j = 1}^n \tht_{nj} (\rX_{n+1-j} - \estim \rX_{n+1-j}), \quad 1 \leq n \leq m \\ \estim \rX_{n+1} &= \vf_1 \rX_{n-1} + \dots + \vf_p \rX_{n-p} + \sum_{j = 1}^n \tht_{nj} (\rX_{n+1-j} - \estim \rX_{n+1-j}), \quad n \geq m \end{align*}\] and finally \(v_n = \Expect{\rX_{n+1} - \estim \rX_{n+1}}^2 = \sigma^2 r_n\).

Tip

To summarize, we derived a way to make a one-step prediction for \(\MAmod{q}\) and then used this in ARMA model, where the AR part is easy to predict.

9.4.4 \(h\)-step prediction

Given data \(\rX_1, \dots, \rX_n\), we want to predict \(\rX_{n+h}\) using \(\rX_1, \dots, \rX_n\) – via best linear prediction – which leads us to solve \[ \covMat \vvf^{(h)} = \vi \acov^{(h)}. \] For this task, we can continue the iteration of the innovations algorithm to \(h\)-step prediction \[ \estim \rX_{n+h}(n) = \sum_{j = h}^{n+h-1} \tht_{n+h-1, j} (\rX_{n+h-j} - \estim \rX_{n+h-j}(n+h-j-1)) \] with squared error \[ v_{n+h-1} = \acov(n+h, n+h) - \sum_{j = h-1}^{n+h-2} \tht_{n+h-1, n-j}^2 v_j. \] Values of \(\vi \tht^{(n)}\) are finally obtained by continued iteration.

9.4.4.1 Limiting behavior of the \(h\)-step best linear prediction

Now let \(h \to \infty\). What is then the behavior of the prediction and its error? We have that \[ \vvf^{(h)} = \covMat^{-1} \vi \acov^{(h)}, \] and often \(\gamma_{(h)}\) as \(h \to \infty\) (e.g., for stationary ARMA). Hence \(\vf_{(h)} \to 0\) and \(\estim X_{n+h}(n) \to 0\) as \(h \to \infty\). This implies the convergence to the mean of the series. Under stationarity, the prediction error is \[ \acov(0) - {\vi \acov^{(h)}}\Tr\covMat^{-1} \vi \acov^{(h)} \to \acov(0) \] as \(h \to \infty\).

Warning

Thus long-term forecasts have approximately the same degree of uncertainty as if no data were observed.

9.4.5 Prediction with trend

Now consider a model with a deterministic trend \[ \rY_t = \mu_t + \rX_t, \] where \(\rX_t\) is stationary and, for example, \(\mu_t = \beta_0 + \beta_1 t + \beta_2 s(t)\). Then we have data \(\rY_1, \dots, \rY_n\) and our goal is to predict \(\rY_{n+h}\) by \(\estim \rY_{n+h}(n)\), so we extrapolate the trend (which can be either known or estimated) and then predict by \[ \estim \rY_{n+h}(n) = \mu_{n+h} + \estim \rX_{n+h}(n) \] where \(\estim \rX_{n+h}(n)\) is constructed as the \(h\)-step prediction from \(\rX_t = \rY_t - \mu_t\), \(t = 1, \dots, n\).

9.4.6 Prediction errors and intervals

We can derive the following asymptotic behavior \[ \estim \rY_{n+h}(n) - \mu_{n+h} = \estim \rX_{n+h}(n) \to 0, \quad h \to \infty \] and keep in mind, that in the limit we predict by extrapolating the trend. Error is then given by \[ \estim Y_{n+h}(n) - Y_{n+h} = \estim X_{n+h}(n) - X_{n+h}, \] thus the mean squared prediction error is the same as for \(\estim \rX_{n+h}(n)\) and it converges to the series variance \(\acov(0)\). The trend will be estimated with a parametric estimation error of order \(n−1\), comparatively smaller than the prediction error.

Note

Our prediction was simply constructed as a best linear prediction and as such we made no assumptions on the distribution of our data.

If the data are Gaussian, then \[ \rX_{n+h} \condp (\rX_1, \dots, \rX_n) \sim \normalD{\estim \rX_{n+h}(n), \Expect{\rX_{n+h} - \estim \rX_{n+h}(n)}^2}, \] where the variance is \(v_{n+h-1}\), or equivalently \(\acov(0) - {\vi \acov^{(h)}}\Tr \covMat^{-1} \vi \acov^{(h)}\). Gaussian prediction interval with coverage \(1 - \alpha\) is then given by \[ \brackets{ \estim \rY_{n+h}(n) - c_{1 - \alpha / 2} v^{1/2}_{n+h-1}, \estim \rY_{n+h} (n) + c_{1 - \alpha / 2} v^{1/2}_{n+h-1} } \] with \(c_{1 - \alpha /2}\) being the \((1 - \alpha/2)\)-quantile of \(\normalD{0,1}\).

Tip

It is a good practice to use wider prediction intervals (80% or 90%) – using too strict intervals makes (a bit) less sense.

9.5 Prediction of ARMA processes with infinite past

Recall a \(\ARmod{p}\) model given by \[ \rX_t = \vf_1 \rX_{t-1} + \dots + \vf_p \rX_{t-p} + \rve_t \] and we want to predict \(\rX_{n+1}\) based on \(\rX_1, \dots, \rX_n\), \(n \geq p\). Best linear prediction is then given trivially by the autoregressive formulation \[ \estim \rX_{n+1} = \vf_1 \rX_n + \dots + \vf_p \rX_{n -p + 1}. \]

9.5.1 \(h\)-step prediction in AR process

Now we want to predict \(\rX_{n+h}\) given \(\rX_1, \dots, \rX_n\), \(n \geq p\). We can obtain the projection \(\estim \rX_{n+h}(n) = \BLP_n \rX_{n+h}\) of \(\rX_{n+h}\) on \(\rX_1, \dots, \rX_n\) by projecting first on \(\rX_1, \dots, \rX_{n+h-1}\), then we take the projection \(\BLP_{n+h-1} \rX_{n+h}\) and again project it on \(\rX_1, \dots, \rX_{n+h-2}\), etc., and finally on \(\rX_1, \dots, \rX_n\), from which we obtain \[ \estim \rX_{n+h}(n) = \BLP_n \rX_{n+h} = \dots = \BLP_n \BLP_{n+1} \dots \BLP_{n+h-1} \rX_{n+h}. \] As an example, the two-steps-ahead best prediction is \[ \estim \rX_{n+2}(n) = \vf_1 \estim \rX_{n+1} + \vf_2 \rX_n + \dots + \vf_p \rX_{n-p+2} \] and analogously, the \(h\)-steps-ahead best prediction is \[ \estim \rX_{n+h}(n) = \vf_1 \estim \rX_{n+h-1}(n) + \dots + \vf_p \estim \rX_{n+h-p}, \] where we set \(\estim \rX_t = \rX_t\) for \(t = 1, \dots, n\). Hence we have a straightforward recursive computation.

9.5.2 Prediction with infinite past

Recall that prediction in \(\ARmod{p}\) is autoregression on \(p\) preceding series values. Now also remember that for an invertible \(\MAmod{q}\) process we may write \[ \rve_t = \maPoly(\backs)\Inv \rX_t = \sum_{j = 0}^{\infty} \pi_j \rX_{t-j}, \] which is an autoregression on the infinite past (the process must be invertible). Thus assume now the infinite past \[ \rX_n, \rX_{n-1}, \dots, \rX_1, \rX_0, \rX_{-1}, \dots \] is available and investigate \(\infEstim \rX_{n+h}(n) = \infEstim{\BLP}_n \rX_{n+h}\) the projection of \(\rX_{n+h}\) on the infinite past. Consider a full \(\ARMA{p}{q}\) model \[ \ArPoly{\backs} \rX_t = \MaPoly{\backs}\rve_t \] and let it be causal and invertible, i.e. we can write \[ \rX_{n+h} = \sum_{j = 0}^{\infty}\psi_j \rve_{n+h-j}, \quad \rve_{n+h} = \sum_{j = 0}^{\infty} \pi_j \rX_{n+h-j}. \] By projecting both sides in the \(\ARmod{\infty}\) presentation on the past we get \[ 0 = \estim \rX_{n+h}(n) + \sum_{j = 1}^\infty \pi \estim \rX_{n+h-j}(n), \] therefore \[ \estim \rX_{n+h}(n) = - \sum_{j = 1}^{h-1}\pi_j \estim \rX_{n+h-j} - \sum_{j = h}^\infty \pi_j \rX_{n+h-j}, \] which gives us recursive computation of the prediction. The mean squared prediction error is again \(\Expect {\rX_{n+h} - \estim \rX_{n+h}(n)}^2\). Also, the past \[ \rX_n, \rX_{n-1}, \dots, \rX_1, \rX_0, \rX_{-1}, \dots \] is equivalent to \[ \rve_n, \rve_{n-1}, \dots, \rve_1, \rve_0, \rve_{-1}, \dots \] as the subspaces generated by both sequences coincide due to causality and invertibility. By projecting both sides in the \(\MAmod{\infty}\) representation on the past we get (recall the orthogonality) \[ \estim \rX_{n+h}(n) = \sum_{j = 0}^\infty \psi_j \infEstim{\BLP}_n \rve_{n+h-j} = \sum_{j = h}^\infty \psi_j \rve_{n+h-j}. \] From this, we may obtain that the mean square prediction error is \[ \Expect {\rX_{n+h} - \estim \rX_{n+h}(n)}^2 = \Expect{\sum_{j = 0}^{h-1} \psi_j \rve_{n+h-j}} = \sigma^2 \sum_{j = 0}^{h-1} \psi_j^2. \] As we consider the prediction horizon \(h \to \infty\), from \[ \estim \rX_{n+h}(n) = \sum_{j = h}^\infty \psi_j \rve_{n+h-j} \] and the fact that the \(\psi\)-weights decay exponentially, we get that \(\estim \rX_{n+h}(n)\) quickly converges to the mean (in the \(L^2\) sense). As for the MSPE, we see that \[ \Expect {\rX_{n+h} - \estim \rX_{n+h}(n)}^2 = \sigma^2 \sum_{j = 0}^{h-1} \psi_j^2 \to \sigma^2 \sum_{j = 0}^\infty \psi_j^2 = \acov(0). \]

Caution

Hence the mean squared prediction error quickly converges to the variance of the series. This implies that short-term predictions are good, but long-term predictions behave like without any observed data.

9.5.3 Truncated prediction in ARMA models with finite past

Lastly, consider finite observed past \(\rX_1, \dots, \rX_n\) instead of infinite past and assume \(n\) sufficiently large. We use the same relation as with infinite past \[ \estim \rX_{n+h}(n) = - \sum_{j = 1}^{h-1}\pi_j \estim \rX_{n+h-j} - \sum_{j = h}^\infty \pi_k \rX_{n+h-j}, \] but set \(\rX_t = 0\) for \(t \leq 0\) (also, their weights are small if \(n\) is large), that is we use the truncated forecast \[ \estim \rX_{n+h}(n) = - \sum_{j = 1}^{h-1}\pi_j \estim \rX_{n+h-j} - \sum_{j = h}^{n+h-1} \pi_k \rX_{n+h-j}, \] which again leads to recursive computation.