Step-by-step guide on using the AutoARIMA Model
with Statsforecast
.
statsforecast.models
library brings the
AutoARIMA
function from Python provides an implementation of autoARIMA that allows
to automatically select the optimal parameters for an ARIMA model given
a time series.
p | order of the autoregressive part |
d | degree of first differencing involved |
q | order of the moving average part |
Model | p d q | Differenced | Method |
---|---|---|---|
Arima(0,0,0) | 0 0 0 | White noise | |
ARIMA (0,1,0) | 0 1 0 | Random walk | |
ARIMA (0,2,0) | 0 2 0 | Constant | |
ARIMA (1,0,0) | 1 0 0 | AR(1): AR(1): First-order regression model | |
ARIMA (2, 0, 0) | 2 0 0 | AR(2): Second-order regression model | |
ARIMA (1, 1, 0) | 1 1 0 | Differenced first-order autoregressive model | |
ARIMA (0, 1, 1) | 0 1 1 | Simple exponential smoothing | |
ARIMA (0, 0, 1) | 0 0 1 | MA(1): First-order regression model | |
ARIMA (0, 0, 2) | 0 0 2 | MA(2): Second-order regression model | |
ARIMA (1, 0, 1) | 1 0 1 | ARMA model | |
ARIMA (1, 1, 1) | 1 1 1 | ARIMA model | |
ARIMA (1, 1, 2) | 1 1 2 | Damped-trend linear Exponential smoothing | |
ARIMA (0, 2, 1) OR (0,2,2) | 0 2 1 | Linear exponential smoothing |
AutoARIMA()
function from statsforecast
will do it for you
automatically.
For more information
here
AutoARIMA()
model to model and predict time series has
several advantages, including:
AutoARIMA()
function automates the ARIMA model parameter selection process,
which can save the user time and effort by eliminating the need to
manually try different combinations of parameters.
AutoARIMA()
function can
identify complex patterns in the data that may be difficult to
detect visually or with other time series modeling techniques.
AutoARIMA()
function can help improve the
efficiency and accuracy of time series modeling and forecasting,
especially for users who are inexperienced with manual parameter
selection for ARIMA models.
Daily
, Hourly
and Weekly
data from the M4
competition.
The following table summarizes the results. As can be seen, our
auto_arima
is the best model in accuracy (measured by the MASE
loss)
and time, even compared with the original implementation in R.
dataset | metric | auto_arima_nixtla | auto_arima_pmdarima [1] | auto_arima_r | prophet |
---|---|---|---|---|---|
Daily | MASE | 3.26 | 3.35 | 4.46 | 14.26 |
Daily | time | 1.41 | 27.61 | 1.81 | 514.33 |
Hourly | MASE | 0.92 | — | 1.02 | 1.78 |
Hourly | time | 12.92 | — | 23.95 | 17.27 |
Weekly | MASE | 2.34 | 2.47 | 2.58 | 7.29 |
Weekly | time | 0.42 | 2.92 | 0.22 | 19.82 |
auto_arima
from pmdarima
had a problem with Hourly
data. An issue was opened.
The following table summarizes the data details.
group | n_series | mean_length | std_length | min_length | max_length |
---|---|---|---|---|---|
Daily | 4,227 | 2,371 | 1,756 | 107 | 9,933 |
Hourly | 414 | 901 | 127 | 748 | 1,008 |
Weekly | 359 | 1,035 | 707 | 93 | 2,610 |
Tip Statsforecast will be needed. To install, see instructions.Next, we import plotting libraries and configure the plotting style.
observation_date | IPG3113N | |
---|---|---|
0 | 1972-01-01 | 85.6945 |
1 | 1972-02-01 | 71.8200 |
2 | 1972-03-01 | 66.0229 |
3 | 1972-04-01 | 64.5645 |
4 | 1972-05-01 | 65.0100 |
unique_id
(string, int or category) represents an identifier
for the series.
ds
(datestamp) column should be of a format expected by
Pandas, ideally YYYY-MM-DD for a date or YYYY-MM-DD HH:MM:SS for a
timestamp.
y
(numeric) represents the measurement we wish to forecast.
ds | y | unique_id | |
---|---|---|---|
0 | 1972-01-01 | 85.6945 | 1 |
1 | 1972-02-01 | 71.8200 | 1 |
2 | 1972-03-01 | 66.0229 | 1 |
3 | 1972-04-01 | 64.5645 | 1 |
4 | 1972-05-01 | 65.0100 | 1 |
ds
from the object
type to datetime.
AutoArima
model
2. Data to test our model
For the test data we will use the last 12 months to test and evaluate
the performance of our model.
freq:
a string indicating the frequency of the data. (See panda’s
available frequencies.)
n_jobs:
n_jobs: int, number of jobs used in the parallel
processing, use -1 for all cores.
fallback_model:
a model to be used if a model fails.
arima_string
function to see the parameters that the model has found.
ARIMA(4,0,3)(0,1,1)[12]
, this means that our model contains
, that is, it has a non-seasonal autogressive element, on the
other hand, our model contains a seasonal part, which has an order of
, that is, it has a seasonal differential, and that contains
3 moving average element.
To know the values of the terms of our model, we can use the following
statement to know all the result of the model made.
.get()
function to extract the element and then we are going to save
it in a pd.DataFrame()
.
residual Model | |
---|---|
0 | 0.085694 |
1 | 0.071820 |
2 | 0.066022 |
… | … |
533 | 1.615486 |
534 | -0.394285 |
535 | -6.733548 |
StatsForecast.forecast
method instead of .fit
and .predict
.
The main difference is that the .forecast
doest not store the fitted
values and is highly scalable in distributed environments.
The forecast method takes two arguments: forecasts next h
(horizon)
and level
.
h (int):
represents the forecast h steps into the future. In this
case, 12 months ahead.
level (list of floats):
this optional parameter is used for
probabilistic forecasting. Set the level (or confidence percentile)
of your prediction interval. For example, level=[90]
means that
the model expects the real value to be inside that interval 90% of
the times.
ARIMA
and
Theta
)
unique_id | ds | AutoARIMA | |
---|---|---|---|
0 | 1 | 2016-09-01 | 111.235874 |
1 | 1 | 2016-10-01 | 124.948376 |
2 | 1 | 2016-11-01 | 125.401639 |
3 | 1 | 2016-12-01 | 123.854826 |
4 | 1 | 2017-01-01 | 110.439451 |
unique_id | ds | y | AutoARIMA | |
---|---|---|---|---|
0 | 1 | 1972-01-01 | 85.6945 | 85.608806 |
1 | 1 | 1972-02-01 | 71.8200 | 71.748180 |
2 | 1 | 1972-03-01 | 66.0229 | 65.956878 |
… | … | … | … | … |
533 | 1 | 2016-06-01 | 102.4044 | 100.788914 |
534 | 1 | 2016-07-01 | 102.9512 | 103.345485 |
535 | 1 | 2016-08-01 | 104.6977 | 111.431248 |
unique_id | ds | AutoARIMA | AutoARIMA-lo-95 | AutoARIMA-hi-95 | |
---|---|---|---|---|---|
0 | 1 | 2016-09-01 | 111.235874 | 104.140621 | 118.331128 |
1 | 1 | 2016-10-01 | 124.948376 | 116.244661 | 133.652090 |
2 | 1 | 2016-11-01 | 125.401639 | 115.882093 | 134.921185 |
… | … | … | … | … | … |
9 | 1 | 2017-06-01 | 98.304446 | 85.884572 | 110.724320 |
10 | 1 | 2017-07-01 | 99.630306 | 87.032356 | 112.228256 |
11 | 1 | 2017-08-01 | 105.426708 | 92.639159 | 118.214258 |
h
(for
horizon) and level
.
h (int):
represents the forecast h steps into the future. In this
case, 12 months ahead.
level (list of floats):
this optional parameter is used for
probabilistic forecasting. Set the level (or confidence percentile)
of your prediction interval. For example, level=[95]
means that
the model expects the real value to be inside that interval 95% of
the times.
unique_id | ds | AutoARIMA | |
---|---|---|---|
0 | 1 | 2016-09-01 | 111.235874 |
1 | 1 | 2016-10-01 | 124.948376 |
2 | 1 | 2016-11-01 | 125.401639 |
… | … | … | … |
9 | 1 | 2017-06-01 | 98.304446 |
10 | 1 | 2017-07-01 | 99.630306 |
11 | 1 | 2017-08-01 | 105.426708 |
unique_id | ds | AutoARIMA | AutoARIMA-lo-95 | AutoARIMA-lo-80 | AutoARIMA-hi-80 | AutoARIMA-hi-95 | |
---|---|---|---|---|---|---|---|
0 | 1 | 2016-09-01 | 111.235874 | 104.140621 | 106.596537 | 115.875211 | 118.331128 |
1 | 1 | 2016-10-01 | 124.948376 | 116.244661 | 119.257323 | 130.639429 | 133.652090 |
2 | 1 | 2016-11-01 | 125.401639 | 115.882093 | 119.177142 | 131.626136 | 134.921185 |
… | … | … | … | … | … | … | … |
9 | 1 | 2017-06-01 | 98.304446 | 85.884572 | 90.183527 | 106.425365 | 110.724320 |
10 | 1 | 2017-07-01 | 99.630306 | 87.032356 | 91.392949 | 107.867663 | 112.228256 |
11 | 1 | 2017-08-01 | 105.426708 | 92.639159 | 97.065379 | 113.788038 | 118.214258 |
pd.concat()
, and then be able to use this result for
graphing.
y | unique_id | AutoARIMA | AutoARIMA-lo-95 | AutoARIMA-lo-80 | AutoARIMA-hi-80 | AutoARIMA-hi-95 | |
---|---|---|---|---|---|---|---|
ds | |||||||
2000-05-01 | 108.7202 | 1 | NaN | NaN | NaN | NaN | NaN |
2000-06-01 | 114.2071 | 1 | NaN | NaN | NaN | NaN | NaN |
2000-07-01 | 111.8737 | 1 | NaN | NaN | NaN | NaN | NaN |
… | … | … | … | … | … | … | … |
2017-06-01 | NaN | 1 | 98.304446 | 85.884572 | 90.183527 | 106.425365 | 110.724320 |
2017-07-01 | NaN | 1 | 99.630306 | 87.032356 | 91.392949 | 107.867663 | 112.228256 |
2017-08-01 | NaN | 1 | 105.426708 | 92.639159 | 97.065379 | 113.788038 | 118.214258 |
(n_windows=5)
, forecasting every second months
(step_size=12)
. Depending on your computer, this step should take
around 1 min.
The cross_validation method from the StatsForecast class takes the
following arguments.
df:
training data frame
h (int):
represents h steps into the future that are being
forecasted. In this case, 12 months ahead.
step_size (int):
step size between each window. In other words:
how often do you want to run the forecasting processes.
n_windows(int):
number of windows used for cross validation. In
other words: what number of forecasting processes in the past do you
want to evaluate.
unique_id:
series identifierds:
datestamp or temporal indexcutoff:
the last datestamp or temporal index for the n_windows.y:
true value"model":
columns with the model’s name and fitted value.unique_id | ds | cutoff | y | AutoARIMA | |
---|---|---|---|---|---|
0 | 1 | 2011-09-01 | 2011-08-01 | 93.9062 | 105.235606 |
1 | 1 | 2011-10-01 | 2011-08-01 | 116.7634 | 118.739813 |
2 | 1 | 2011-11-01 | 2011-08-01 | 116.8258 | 114.572924 |
3 | 1 | 2011-12-01 | 2011-08-01 | 114.9563 | 114.991219 |
4 | 1 | 2012-01-01 | 2011-08-01 | 99.9662 | 100.133142 |
unique_id | metric | AutoARIMA | |
---|---|---|---|
0 | 1 | mae | 5.012894 |
1 | 1 | mape | 0.045046 |
2 | 1 | mase | 0.967601 |
3 | 1 | rmse | 5.680362 |
4 | 1 | smape | 0.022673 |