Step-by-step guide on using the AutoRegressive Model
with Statsforecast
.
autoregressive
time series model (AutoRegressive)
is a
statistical
technique used to analyze and predict univariate time
series. In essence, the autoregressive model
is based on the idea that
previous values of the time series can be used to predict future values.
In this model, the dependent variable (the time series) returns to
itself at different moments in time, creating a dependency relationship
between past and present values. The idea is that past values can help
us understand and predict future values of the series.
The autoregressive model
can be fitted to different orders, which
indicate how many past values are used to predict the present value. For
example, an autoregressive model
of order 1 uses only the
immediately previous value to predict the current value, while an
autoregressive model
of order uses the previous
values.
The autoregressive model
is one of the basic models of time series
analysis and is widely used in a variety of fields, from finance and
economics to meteorology and social sciences. The model’s ability to
capture nonlinear dependencies in time series data makes it especially
useful for forecasting and long-term trend analysis.
In a multiple regression model
, we forecast the variable of interest
using a linear combination of predictors. In an autoregression model
,
we forecast the variable of interest using a linear combination
of
past values of the variable. The term autoregression
indicates that it
is a regression of the variable against itself.
Tip Statsforecast will be needed. To install, see instructions.Next, we import plotting libraries and configure the plotting style.
Date | Total | |
---|---|---|
0 | 1986-1-01 | 9034 |
1 | 1986-2-01 | 9596 |
2 | 1986-3-01 | 10558 |
3 | 1986-4-01 | 9002 |
4 | 1986-5-01 | 9239 |
unique_id
(string, int or category) represents an identifier
for the series.
ds
(datestamp) column should be of a format expected by
Pandas, ideally YYYY-MM-DD for a date or YYYY-MM-DD HH:MM:SS for a
timestamp.
y
(numeric) represents the measurement we wish to forecast.
ds | y | unique_id | |
---|---|---|---|
0 | 1986-1-01 | 9034 | 1 |
1 | 1986-2-01 | 9596 | 1 |
2 | 1986-3-01 | 10558 | 1 |
3 | 1986-4-01 | 9002 | 1 |
4 | 1986-5-01 | 9239 | 1 |
ds
is in an object format, we need
to convert to a date format
Dickey Fuller
test
Augmented_Dickey_Fuller
test gives us a p-value
of 0.488664, which tells us that the null
hypothesis cannot be rejected, and on the other hand the data of our
series are not stationary.
We need to differentiate our time series, in order to convert the data
to stationary.
AutoRegressive
model 2. Data to test our model
For the test data we will use the last 12 months to test and evaluate
the performance of our model.
freq:
a string indicating the frequency of the data. (See
pandas’s available
frequencies.)
n_jobs:
n_jobs: int, number of jobs used in the parallel
processing, use -1 for all cores.
fallback_model:
a model to be used if a model fails.
.get()
function to extract the element and then we are going to save
it in a pd.DataFrame()
.
residual Model | |
---|---|
0 | -11998.537347 |
1 | NaN |
2 | NaN |
… | … |
309 | -2718.312961 |
310 | -1306.795172 |
311 | -2713.284999 |
StatsForecast.forecast
method instead of .fit
and .predict
.
The main difference is that the .forecast
doest not store the fitted
values and is highly scalable in distributed environments.
The forecast method takes two arguments: forecasts next h
(horizon)
and level
.
h (int):
represents the forecast h steps into the future. In this
case, 12 months ahead.
level (list of floats):
this optional parameter is used for
probabilistic forecasting. Set the level (or confidence percentile)
of your prediction interval. For example, level=[90]
means that
the model expects the real value to be inside that interval 90% of
the times.
ARIMA
and
Theta
)
unique_id | ds | AutoRegressive | |
---|---|---|---|
0 | 1 | 2012-01-01 | 15905.582031 |
1 | 1 | 2012-02-01 | 13597.894531 |
2 | 1 | 2012-03-01 | 15488.883789 |
… | … | … | … |
9 | 1 | 2012-10-01 | 14087.901367 |
10 | 1 | 2012-11-01 | 13274.105469 |
11 | 1 | 2012-12-01 | 12498.226562 |
unique_id | ds | y | AutoRegressive | |
---|---|---|---|---|
0 | 1 | 1986-01-01 | 9034.0 | 21032.537109 |
1 | 1 | 1986-02-01 | 9596.0 | NaN |
2 | 1 | 1986-03-01 | 10558.0 | NaN |
3 | 1 | 1986-04-01 | 9002.0 | 126172.937500 |
4 | 1 | 1986-05-01 | 9239.0 | 10020.040039 |
unique_id | ds | AutoRegressive | AutoRegressive-lo-95 | AutoRegressive-hi-95 | |
---|---|---|---|---|---|
0 | 1 | 2012-01-01 | 15905.582031 | 2119.586426 | 29691.578125 |
1 | 1 | 2012-02-01 | 13597.894531 | -188.101135 | 27383.890625 |
2 | 1 | 2012-03-01 | 15488.883789 | 1702.888062 | 29274.878906 |
… | … | … | … | … | … |
9 | 1 | 2012-10-01 | 14087.901367 | -1050.068359 | 29225.871094 |
10 | 1 | 2012-11-01 | 13274.105469 | -1886.973145 | 28435.183594 |
11 | 1 | 2012-12-01 | 12498.226562 | -2675.547607 | 27672.001953 |
ds | y | unique_id | AutoRegressive | |
---|---|---|---|---|
0 | 2012-01-01 | 13427 | 1 | 15905.582031 |
1 | 2012-02-01 | 14447 | 1 | 13597.894531 |
2 | 2012-03-01 | 14717 | 1 | 15488.883789 |
… | … | … | … | … |
9 | 2012-10-01 | 13795 | 1 | 14087.901367 |
10 | 2012-11-01 | 13352 | 1 | 13274.105469 |
11 | 2012-12-01 | 12716 | 1 | 12498.226562 |
h
(for
horizon) and level
.
h (int):
represents the forecast h steps into the future. In this
case, 12 months ahead.
level (list of floats):
this optional parameter is used for
probabilistic forecasting. Set the level (or confidence percentile)
of your prediction interval. For example, level=[95]
means that
the model expects the real value to be inside that interval 95% of
the times.
unique_id | ds | AutoRegressive | |
---|---|---|---|
0 | 1 | 2012-01-01 | 15905.582031 |
1 | 1 | 2012-02-01 | 13597.894531 |
2 | 1 | 2012-03-01 | 15488.883789 |
… | … | … | … |
9 | 1 | 2012-10-01 | 14087.901367 |
10 | 1 | 2012-11-01 | 13274.105469 |
11 | 1 | 2012-12-01 | 12498.226562 |
unique_id | ds | AutoRegressive | AutoRegressive-lo-95 | AutoRegressive-hi-95 | |
---|---|---|---|---|---|
0 | 1 | 2012-01-01 | 15905.582031 | 2119.586426 | 29691.578125 |
1 | 1 | 2012-02-01 | 13597.894531 | -188.101135 | 27383.890625 |
2 | 1 | 2012-03-01 | 15488.883789 | 1702.888062 | 29274.878906 |
… | … | … | … | … | … |
9 | 1 | 2012-10-01 | 14087.901367 | -1050.068359 | 29225.871094 |
10 | 1 | 2012-11-01 | 13274.105469 | -1886.973145 | 28435.183594 |
11 | 1 | 2012-12-01 | 12498.226562 | -2675.547607 | 27672.001953 |
(n_windows=5)
, forecasting every second months
(step_size=12)
. Depending on your computer, this step should take
around 1 min.
The cross_validation method from the StatsForecast class takes the
following arguments.
df:
training data frame
h (int):
represents h steps into the future that are being
forecasted. In this case, 12 months ahead.
step_size (int):
step size between each window. In other words:
how often do you want to run the forecasting processes.
n_windows(int):
number of windows used for cross validation. In
other words: what number of forecasting processes in the past do you
want to evaluate.
unique_id:
series identifierds:
datestamp or temporal indexcutoff:
the last datestamp or temporal index for the n_windows.y:
true value"model":
columns with the model’s name and fitted value.unique_id | ds | cutoff | y | AutoRegressive | |
---|---|---|---|---|---|
0 | 1 | 2009-01-01 | 2008-12-01 | 19262.0 | 24295.837891 |
1 | 1 | 2009-02-01 | 2008-12-01 | 20658.0 | 23993.947266 |
2 | 1 | 2009-03-01 | 2008-12-01 | 22660.0 | 21201.121094 |
… | … | … | … | … | … |
57 | 1 | 2011-10-01 | 2010-12-01 | 12893.0 | 19349.708984 |
58 | 1 | 2011-11-01 | 2010-12-01 | 11843.0 | 16899.849609 |
59 | 1 | 2011-12-01 | 2010-12-01 | 11321.0 | 18159.574219 |
metric | AutoRegressive | |
---|---|---|
0 | mae | 962.023763 |
1 | mape | 0.072733 |
2 | mase | 0.601808 |
3 | rmse | 1195.013050 |
4 | smape | 0.034858 |