Intermittent or sparse data has very few non-zero observations. This type of data is hard to forecast because the zero values increase the uncertainty about the underlying patterns in the data. Furthermore, once a non-zero observation occurs, there can be considerable variation in its size. Intermittent time series are common in many industries, including finance, retail, transportation, and energy. Given the ubiquity of this type of series, special methods have been developed to forecast them. The first was from Croston (1972), followed by several variants and by different aggregation frameworks.

StatsForecast has implemented several models to forecast intermittent time series. By the end of this tutorial, you’ll have a good understanding of these models and how to use them.

Outline:

  1. Install libraries
  2. Load and explore the data
  3. Train models for intermittent data
  4. Plot forecasts and compute accuracy

Tip

You can use Colab to run this Notebook interactively

Tip

For forecasting at scale, we recommend you check this notebook done on Databricks.

Install libraries

We assume that you have StatsForecast already installed. If not, check this guide for instructions on how to install StatsForecast

Install the necessary packages using pip install statsforecast

pip install statsforecast -U

Load and explore the data

For this example, we’ll use a subset of the M5 Competition dataset. Each time series represents the unit sales of a particular product in a given Walmart store. At this level (product-store), most of the data is intermittent. We first need to import the data, and to do that, we’ll need datasetsforecast.

pip install datasetsforecast -U
from datasetsforecast.m5 import M5

The function to load the data is M5.load(). It requieres the following argument: - directory: (str) The directory where the data will be downloaded.

This function returns multiple outputs, but only the first one with the unit sales is needed.

df_total, *_ = M5.load('./data')
df_total.head()
unique_iddsy
0FOODS_1_001_CA_12011-01-293.0
1FOODS_1_001_CA_12011-01-300.0
2FOODS_1_001_CA_12011-01-310.0
3FOODS_1_001_CA_12011-02-011.0
4FOODS_1_001_CA_12011-02-024.0

From this dataset, we’ll select the first 8 time series. You can select any number you want by changing the value of n_series.

n_series = 8 
uids = df_total['unique_id'].unique()[:n_series]
df = df_total.query('unique_id in @uids')

We can plot these series using the plot method from the StatsForecast class. This method has multiple parameters, and the required ones to generate the plots in this notebook are explained below.

  • df: A pandas dataframe with columns [unique_id, ds, y].
  • forecasts_df: A pandas dataframe with columns [unique_id, ds] and models.
  • plot_random: (bool = True) Plots the time series randomly.
  • max_insample_length: (int) The maximum number of train/insample observations to be plotted.
  • engine: (str = plotly). The library used to generate the plots. It can also be matplotlib for static plots.
from statsforecast import StatsForecast
StatsForecast.plot(df, plot_random = False, max_insample_length = 100)

Here we only plotted the last 100 observations, but we can visualize the complete history by removing max_insample_length. From these plots, we can confirm that the data is indeed intermittent since it has multiple periods with zero sales. In fact, in all cases but one, the median value is zero.

df.groupby('unique_id')[['y']].median().query('unique_id in @uids')
y
unique_id
FOODS_1_001_CA_10.0
FOODS_1_001_CA_21.0
FOODS_1_001_CA_30.0
FOODS_1_001_CA_40.0
FOODS_1_001_TX_10.0
FOODS_1_001_TX_20.0
FOODS_1_001_TX_30.0
FOODS_1_001_WI_10.0

Train models for intermittent data

Before training any model, we need to separate the data in a train and a test set. The M5 Competition used the last 28 days as test set, so we’ll do the same.

dates = df['ds'].unique()[-28:] # last 28 days

train = df.query('ds not in @dates')
test = df.query('ds in @dates')

StatsForecast has efficient implementations of multiple models for intermittent data. The complete list of models available is here. In this notebook, we’ll use:

To use these models, we first need to import them from statsforecast.models and then we need to instantiate them.

from statsforecast.models import (
    ADIDA,
    CrostonClassic, 
    IMAPA, 
    TSB
)

# Create a list of models and instantiation parameters 
models = [
    ADIDA(), 
    CrostonClassic(), 
    IMAPA(), 
    TSB(alpha_d = 0.2, alpha_p = 0.2)
]

To instantiate a new StatsForecast object, we need the following parameters:

  • df: The dataframe with the training data.
  • models: The list of models defined in the previous step.
  • freq: A string indicating the frequency of the data. See pandas’ available frequencies.
  • n_jobs: An integer that indicates the number of jobs used in parallel processing. Use -1 to select all cores.
sf = StatsForecast(
    df = train, 
    models = models, 
    freq = 'D', 
    n_jobs = -1
)

Now we’re ready to generate the forecast. To do this, we’ll use the forecast method, which requires the forecasting horizon (in this case, 28 days) as argument.

The models for intermittent series that are currently available in StatsForecast can only generate point-forecasts. If prediction intervals are needed, then a probabilisitic model should be used.

horizon = 28 
forecasts = sf.forecast(h=horizon)
forecasts = forecasts.reset_index()
forecasts.head()
unique_iddsADIDACrostonClassicIMAPATSB
0FOODS_1_001_CA_12016-05-230.7918520.8982470.7058350.434313
1FOODS_1_001_CA_12016-05-240.7918520.8982470.7058350.434313
2FOODS_1_001_CA_12016-05-250.7918520.8982470.7058350.434313
3FOODS_1_001_CA_12016-05-260.7918520.8982470.7058350.434313
4FOODS_1_001_CA_12016-05-270.7918520.8982470.7058350.434313

Finally, we’ll merge the forecast with the actual values.

test = test.merge(forecasts, how='left', on=['unique_id', 'ds'])

Plot forecasts and compute accuracy

We can generate plots using the plot described above.

StatsForecast.plot(train, test, plot_random = False, max_insample_length = 100)

To compute the accuracy of the forecasts, we’ll use the Mean Average Error (MAE), which is the sum of the absolute errors divided by the number of forecasts. We’ll create a function to compute the MAE, and for that, we’ll need to import numpy.

import numpy as np 

def mae(y_hat, y_true):
    return np.mean(np.abs(y_hat-y_true))
y_true = test['y'].values
adida_preds = test['ADIDA'].values
croston_preds = test['CrostonClassic'].values
imapa_preds = test['IMAPA'].values
tsb_preds = test['TSB'].values

print('ADIDA MAE: \t %0.3f' % mae(adida_preds, y_true))
print('Croston Classic MAE: \t %0.3f' % mae(croston_preds, y_true))
print('IMAPA MAE: \t %0.3f' % mae(imapa_preds, y_true))
print('TSB   MAE: \t %0.3f' % mae(tsb_preds, y_true))
ADIDA MAE:   0.949
Croston Classic MAE:     0.944
IMAPA MAE:   0.957
TSB   MAE:   1.023

Hence, on average, the forecasts are one unit off.

References

Croston, J. D. (1972). Forecasting and stock control for intermittent demands. Journal of the Operational Research Society, 23(3), 289-303.