The richness of the class of (S)AR(I)MA models is a double-edged sword – they are flexible but also difficult to use:
ARMA has two orders: AR order \(p\), MA order \(q\);
ARIMA has one more order: order of difference \(d\);
seasonal ARMA has another two orders: seasonal AR order \(P\), seasonal MA order \(Q\);
SARIMA has one more order: order of seasonal difference \(D\).
Now we need sensible model-building strategies and model adequacy criteria. Thus we will honor the principle of parsimony: models should be as simple as possible but not simpler.
“All models are wrong but some are useful”
Also, recall that over-fitting and/or over-differencing leads to overly complicated models and increased variance of estimates. Thus we should first plot the data. Then we may transform it to make the variance constant if necessary and also possibly remove deterministic components (trends, seasonality) if appropriate. Now we need to model the stochastic component, which can be achieved with
differencing appropriately (until stationary, not too much, use unit root tests if uncertain);
starting with a simple model (white noise);
performing diagnostics of residuals (acf, pacf, Box test);
identifying orders and modifying the model (increase model complexity) to remedy;
continuing until approximately white noise;
possibly checking the model by slightly overfitting.
Last, but not least, we use the model for prediction. But a question might arise whether we should differentiate or not:
if the series is non-stationary, differencing may make it stationary;
if the series is stationary, differencing will preserve stationarity but complicate autocorrelation (non-invertibility).
Recall that the random walk is an AR sequence with \(\arPoly(z) = 1 - z\), hence \(\arPoly\) has a unit root (\(\arPoly(1) = 0\)). The presence of a unit root among the roots of the AR polynomial implies non-stationarity and thus the need to differentiate. Methods exist for hypothesis testing about unit roots with hypotheses:
null hypothesis \(\nullHypo\): \(1\) is a root;
alternative \(\altHypo\): all roots our outside the unit circle.
One can use, for example, the Dickey-Fuller test or Phillips-Perron test, which use non-standard distributions of the test statistics (non-stationary data under the null hypothesis) and in R they can be used with PP.test, or adf.test and pp.test in the package tseries.
11.2 Modal diagnostics
For model diagnostics, we consider the standardized residuals \[
\estim {\rnd e}_t = \frac {\rX_t - \estim \rX_t(\estim{\vrr \beta})} {v_{t-1}(\estim{\vrr \beta})^{1/2}},
\] where
\(\estim{\vrr \beta}\) is an estimator (e.g. maximum likelihood) of all model parameters (AR, MA, coefficients, white noise variance, possibly mean and covariates);
\(v_{t-1}(\estim{\vrr \beta})\) is the prediction error for \(\rX_t\) from \(\rX_1, \dots, \rX_{t-1}\).
If the right model is used, then for large \(n\) the residuals are approximately white noise, so we can plot ACF to check for no correlation or use portmanteau tests (Ljung–Box test) to test the significance of autocorrelations up to some lag (in R use function tsdiag). Also, check for approximate normality of the residuals using the QQ plot, histogram,… (needed for prediction intervals).
11.3 Model order selection
We need an objective, which would quantitatively measure model adequacy because surely bigger models provide a better fit – likelihood or least squares are always better in more complex models. However, there is a price to pay as a more complex model increases the variance of estimates. Also, we need to find a good compromise between fit and variance and penalize for complexity – that is to maximize \[
(\text{model fit}) - (\text{model complexity penalty}),
\] or minimize \[
(\text{model error}) + (\text{model complexity penalty}).
\] To perform this optimization, we search in a set of candidate models to optimize the selection criterion. Now it is needed to choose selection criteria – one of the most widespread is the Akaike’s Information Criterion (\(\AIC\)), which is defined as \[
\AIC = -2 \loglikely(\estim{\vrr \beta}) + 2r,
\] where
\(\estim{\vrr \beta}\) denotes collectively all parameters (AR, MA coefficients, white noise variance, possibly mean and regression coefficients);
\(r = \dim (\vi \beta)\) is the number of parameters (e.g. \(r = 1 + 1 + p + q\) for an \(\ARMA{p}{q}\) model with mean);
\(\loglikely(\vrr \beta) = \log f(\rX_1, \dots, \rX_n; \vrr \beta)\) is the \(\log\)-likelihood.
Our goal is then to select the model that has the smallest value of \(\AIC\) (within a set of candidate models, e.g., with orders up to some limits).
Caution
Note that we use selection criteria for models estimated from the same data, in particular, do not compare models with different orders of differencing!
The use of pure \(\AIC\) is discouraged as it does not penalize large models enough. Hence the corrected \(\AIC\), denoted as \(\AICc\), is recommended \[
\AICc = -2 \loglikely(\estim{\vrr \beta}) + 2r \frac n {n - r - 1}.
\]
Also one can use \(\BIC\) (Bayesian Information Criterion, or Schwarz’s selection rule), which is defined as \[
\BIC = - 2\loglikely(\estim{\vrr \beta}) + r\log n,
\] which again strictly penalizes large models and as such tends to select smaller models.
Tip
As a rule of thumb, \(\AIC\) or \(\AICc\) is recommended to use for forecasting, whereas \(\BIC\) is better used for model estimation.