M9121 Time Series I - 11 SARIMA Model Building and Diagnostics

11.1 Model building and selection

The richness of the class of (S)AR(I)MA models is a double-edged sword – they are flexible but also difficult to use:

ARMA has two orders: AR order \(p\), MA order \(q\);
ARIMA has one more order: order of difference \(d\);
seasonal ARMA has another two orders: seasonal AR order \(P\), seasonal MA order \(Q\);
SARIMA has one more order: order of seasonal difference \(D\).

Now we need sensible model-building strategies and model adequacy criteria. Thus we will honor the principle of parsimony: models should be as simple as possible but not simpler.

“All models are wrong but some are useful”

Also, recall that over-fitting and/or over-differencing leads to overly complicated models and increased variance of estimates. Thus we should first plot the data. Then we may transform it to make the variance constant if necessary and also possibly remove deterministic components (trends, seasonality) if appropriate. Now we need to model the stochastic component, which can be achieved with

differencing appropriately (until stationary, not too much, use unit root tests if uncertain);
starting with a simple model (white noise);
performing diagnostics of residuals (acf, pacf, Box test);
identifying orders and modifying the model (increase model complexity) to remedy;
continuing until approximately white noise;
possibly checking the model by slightly overfitting.

Last, but not least, we use the model for prediction. But a question might arise whether we should differentiate or not:

if the series is non-stationary, differencing may make it stationary;
if the series is stationary, differencing will preserve stationarity but complicate autocorrelation (non-invertibility).

Recall that the random walk is an AR sequence with \(\arPoly(z) = 1 - z\), hence \(\arPoly\) has a unit root (\(\arPoly(1) = 0\)). The presence of a unit root among the roots of the AR polynomial implies non-stationarity and thus the need to differentiate. Methods exist for hypothesis testing about unit roots with hypotheses:

null hypothesis \(\nullHypo\): \(1\) is a root;
alternative \(\altHypo\): all roots our outside the unit circle.

One can use, for example, the Dickey-Fuller test or Phillips-Perron test, which use non-standard distributions of the test statistics (non-stationary data under the null hypothesis) and in R they can be used with PP.test, or adf.test and pp.test in the package tseries.

11.2 Modal diagnostics

For model diagnostics, we consider the standardized residuals \[ \estim {\rnd e}_t = \frac {\rX_t - \estim \rX_t(\estim{\vrr \beta})} {v_{t-1}(\estim{\vrr \beta})^{1/2}}, \] where

\(\estim{\vrr \beta}\) is an estimator (e.g. maximum likelihood) of all model parameters (AR, MA, coefficients, white noise variance, possibly mean and covariates);
\(v_{t-1}(\estim{\vrr \beta})\) is the prediction error for \(\rX_t\) from \(\rX_1, \dots, \rX_{t-1}\).

If the right model is used, then for large \(n\) the residuals are approximately white noise, so we can plot ACF to check for no correlation or use portmanteau tests (Ljung–Box test) to test the significance of autocorrelations up to some lag (in R use function tsdiag). Also, check for approximate normality of the residuals using the QQ plot, histogram,… (needed for prediction intervals).

11.3 Model order selection

We need an objective, which would quantitatively measure model adequacy because surely bigger models provide a better fit – likelihood or least squares are always better in more complex models. However, there is a price to pay as a more complex model increases the variance of estimates. Also, we need to find a good compromise between fit and variance and penalize for complexity – that is to maximize \[ (\text{model fit}) - (\text{model complexity penalty}), \] or minimize \[ (\text{model error}) + (\text{model complexity penalty}). \] To perform this optimization, we search in a set of candidate models to optimize the selection criterion. Now it is needed to choose selection criteria – one of the most widespread is the Akaike’s Information Criterion (\(\AIC\)), which is defined as \[ \AIC = -2 \loglikely(\estim{\vrr \beta}) + 2r, \] where

\(\estim{\vrr \beta}\) denotes collectively all parameters (AR, MA coefficients, white noise variance, possibly mean and regression coefficients);
\(r = \dim (\vi \beta)\) is the number of parameters (e.g. \(r = 1 + 1 + p + q\) for an \(\ARMA{p}{q}\) model with mean);
\(\loglikely(\vrr \beta) = \log f(\rX_1, \dots, \rX_n; \vrr \beta)\) is the \(\log\)-likelihood.

Our goal is then to select the model that has the smallest value of \(\AIC\) (within a set of candidate models, e.g., with orders up to some limits).

Caution

Note that we use selection criteria for models estimated from the same data, in particular, do not compare models with different orders of differencing!

The use of pure \(\AIC\) is discouraged as it does not penalize large models enough. Hence the corrected \(\AIC\), denoted as \(\AICc\), is recommended \[ \AICc = -2 \loglikely(\estim{\vrr \beta}) + 2r \frac n {n - r - 1}. \]

Also one can use \(\BIC\) (Bayesian Information Criterion, or Schwarz’s selection rule), which is defined as \[ \BIC = - 2\loglikely(\estim{\vrr \beta}) + r\log n, \] which again strictly penalizes large models and as such tends to select smaller models.

Tip

As a rule of thumb, \(\AIC\) or \(\AICc\) is recommended to use for forecasting, whereas \(\BIC\) is better used for model estimation.

flowchart TD
    plot[Plot the data.\nIdentify unusual observations.\nUnderstand patterns]
    boxcox[If necessary, use Box-Cox transformation\nto stabilize variance]
    diy[[Select the model order yourself]]
    auto[[Use automated algorithm]]
    autoArima[Use auto.arima to find the best\nARIMA model for your time series]
    diff[If necessary, differentiate the data\nuntil it appears stationary.\nUse unit-root tests if you are unsure]
    acf[Plot ACF/PACF of the differentiated data\nand try to determine possible\ncandidate models]
    try[Try your chosen model/s and use AICc to search for a better model]
    check[Check the residuals from your\nchosen model by plotting the ACF\nof the residuals, and doing a\nportmanteau test of the residuals]
    normalRes[Do the residuals look like white noise?]
    forecast[Calculate forecasts]

    plot --> boxcox
    boxcox --> diy
    boxcox --> auto
    auto --> autoArima
    autoArima --> check
    diy --> diff
    diff --> acf
    acf ---> try
    try --> check
    check --> normalRes
    normalRes -- no --> acf
    normalRes -- yes --> forecast