Data

This shows an example with just 4 series of the M4 dataset. If you want to run it yourself on all of them, you can refer to this notebook.

import random
import tempfile

import lightgbm as lgb
import matplotlib.pyplot as plt
import numpy as np
import xgboost as xgb
from datasetsforecast.m4 import M4, M4Info
from sklearn.linear_model import LinearRegression
from utilsforecast.plotting import plot_series

from mlforecast.lag_transforms import ExpandingMean, ExponentiallyWeightedMean, RollingMean
from mlforecast.lgb_cv import LightGBMCV
from mlforecast.target_transforms import Differences, LocalStandardScaler
from mlforecast.utils import generate_daily_series
group = 'Hourly'
await M4.async_download('data', group=group)
df, *_ = M4.load(directory='data', group=group)
df['ds'] = df['ds'].astype('int')
ids = df['unique_id'].unique()
random.seed(0)
sample_ids = random.choices(ids, k=4)
sample_df = df[df['unique_id'].isin(sample_ids)]
sample_df
unique_iddsy
86796H196111.8
86797H196211.4
86798H196311.1
86799H196410.8
86800H196510.6
325235H413100499.0
325236H413100588.0
325237H413100647.0
325238H413100741.0
325239H413100834.0

We now split this data into train and validation.

info = M4Info[group]
horizon = info.horizon
valid = sample_df.groupby('unique_id').tail(horizon)
train = sample_df.drop(valid.index)
train.shape, valid.shape
((3840, 3), (192, 3))

source

MLForecast

 MLForecast (models:Union[sklearn.base.BaseEstimator,List[sklearn.base.Bas
             eEstimator],Dict[str,sklearn.base.BaseEstimator]],
             freq:Union[int,str], lags:Optional[Iterable[int]]=None, lag_t
             ransforms:Optional[Dict[int,List[Union[Callable,Tuple[Callabl
             e,Any]]]]]=None,
             date_features:Optional[Iterable[Union[str,Callable]]]=None,
             num_threads:int=1, target_transforms:Optional[List[Union[mlfo
             recast.target_transforms.BaseTargetTransform,mlforecast.targe
             t_transforms._BaseGroupedArrayTargetTransform]]]=None,
             lag_transforms_namer:Optional[Callable]=None)

Forecasting pipeline

TypeDefaultDetails
modelsUnionModels that will be trained and used to compute the forecasts.
freqUnionPandas offset, pandas offset alias, e.g. ‘D’, ‘W-THU’ or integer denoting the frequency of the series.
lagsOptionalNoneLags of the target to use as features.
lag_transformsOptionalNoneMapping of target lags to their transformations.
date_featuresOptionalNoneFeatures computed from the dates. Can be pandas date attributes or functions that will take the dates as input.
num_threadsint1Number of threads to use when computing the features.
target_transformsOptionalNoneTransformations that will be applied to the target before computing the features and restored after the forecasting step.
lag_transforms_namerOptionalNoneFunction that takes a transformation (either function or class), a lag and extra arguments and produces a name.

The MLForecast object encapsulates the feature engineering + training the models + forecasting

fcst = MLForecast(
    models=lgb.LGBMRegressor(random_state=0, verbosity=-1),
    freq=1,
    lags=[24 * (i+1) for i in range(7)],
    lag_transforms={
        48: [ExponentiallyWeightedMean(alpha=0.3)],
    },
    num_threads=1,
    target_transforms=[Differences([24])],
)
fcst
MLForecast(models=[LGBMRegressor], freq=1, lag_features=['lag24', 'lag48', 'lag72', 'lag96', 'lag120', 'lag144', 'lag168', 'exponentially_weighted_mean_lag48_alpha0.3'], date_features=[], num_threads=1)

Once we have this setup we can compute the features and fit the model.


source

MLForecast.fit

 MLForecast.fit
                 (df:Union[pandas.core.frame.DataFrame,polars.dataframe.fr
                 ame.DataFrame], id_col:str='unique_id',
                 time_col:str='ds', target_col:str='y',
                 static_features:Optional[List[str]]=None,
                 dropna:bool=True, keep_last_n:Optional[int]=None,
                 max_horizon:Optional[int]=None, prediction_intervals:Opti
                 onal[mlforecast.utils.PredictionIntervals]=None,
                 fitted:bool=False, as_numpy:bool=False)

Apply the feature engineering and train the models.

TypeDefaultDetails
dfUnionSeries data in long format.
id_colstrunique_idColumn that identifies each serie.
time_colstrdsColumn that identifies each timestep, its values can be timestamps or integers.
target_colstryColumn that contains the target.
static_featuresOptionalNoneNames of the features that are static and will be repeated when forecasting.
If None, will consider all columns (except id_col and time_col) as static.
dropnaboolTrueDrop rows with missing values produced by the transformations.
keep_last_nOptionalNoneKeep only these many records from each serie for the forecasting step. Can save time and memory if your features allow it.
max_horizonOptionalNoneTrain this many models, where each model will predict a specific horizon.
prediction_intervalsOptionalNoneConfiguration to calibrate prediction intervals (Conformal Prediction).
fittedboolFalseSave in-sample predictions.
as_numpyboolFalseCast features to numpy array.
ReturnsMLForecastForecast object with series values and trained models.
fcst = MLForecast(
    models=lgb.LGBMRegressor(random_state=0, verbosity=-1),
    freq=1,
    lags=[24 * (i+1) for i in range(7)],
    lag_transforms={
        48: [ExponentiallyWeightedMean(alpha=0.3)],
    },
    num_threads=1,
    target_transforms=[Differences([24])],
)
fcst.fit(train, fitted=True);

source

MLForecast.save

 MLForecast.save (path:Union[str,pathlib.Path])

Save forecast object

TypeDetails
pathUnionDirectory where artifacts will be stored.
ReturnsNone

source

MLForecast.load

 MLForecast.load (path:Union[str,pathlib.Path])

Load forecast object

TypeDetails
pathUnionDirectory with saved artifacts.
ReturnsMLForecast

source

MLForecast.update

 MLForecast.update
                    (df:Union[pandas.core.frame.DataFrame,polars.dataframe
                    .frame.DataFrame])

Update the values of the stored series.

TypeDetails
dfUnionDataframe with new observations.
ReturnsNone

source

MLForecast.make_future_dataframe

 MLForecast.make_future_dataframe (h:int)

Create a dataframe with all ids and future times in the forecasting horizon.

TypeDetails
hintNumber of periods to predict.
ReturnsUnionDataFrame with expected ids and future times
expected_future = fcst.make_future_dataframe(h=1)
expected_future
unique_idds
0H196961
1H256961
2H381961
3H413961

source

MLForecast.get_missing_future

 MLForecast.get_missing_future (h:int, X_df:~DFType)

Get the missing id and time combinations in X_df.

TypeDetails
hintNumber of periods to predict.
X_dfDFTypeDataframe with the future exogenous features. Should have the id column and the time column.
ReturnsDFTypeDataFrame with expected ids and future times missing in X_df
missing_future = fcst.get_missing_future(h=1, X_df=expected_future.head(2))
pd.testing.assert_frame_equal(
    missing_future,
    expected_future.tail(2).reset_index(drop=True)
)

source

MLForecast.forecast_fitted_values

 MLForecast.forecast_fitted_values
                                    (level:Optional[List[Union[int,float]]
                                    ]=None)

Access in-sample predictions.

TypeDefaultDetails
levelOptionalNoneConfidence levels between 0 and 100 for prediction intervals.
ReturnsUnionDataframe with predictions for the training set
fcst.forecast_fitted_values()
unique_iddsyLGBMRegressor
0H19619312.712.671271
1H19619412.312.271271
2H19619511.911.871271
3H19619611.711.671271
4H19619711.411.471271
3067H41395659.068.280574
3068H41395758.070.427570
3069H41395853.044.767965
3070H41395938.048.691257
3071H41396046.046.652238
fcst.forecast_fitted_values(level=[90])
unique_iddsyLGBMRegressorLGBMRegressor-lo-90LGBMRegressor-hi-90
0H19619312.712.67127112.54063412.801909
1H19619412.312.27127112.14063412.401909
2H19619511.911.87127111.74063412.001909
3H19619611.711.67127111.54063411.801909
4H19619711.411.47127111.34063411.601909
3067H41395659.068.28057458.84664077.714509
3068H41395758.070.42757060.99363679.861504
3069H41395853.044.76796535.33403154.201899
3070H41395938.048.69125739.25732358.125191
3071H41396046.046.65223837.21830456.086172

Once we’ve run this we’re ready to compute our predictions.


source

MLForecast.predict

 MLForecast.predict (h:int,
                     before_predict_callback:Optional[Callable]=None,
                     after_predict_callback:Optional[Callable]=None,
                     new_df:Optional[~DFType]=None,
                     level:Optional[List[Union[int,float]]]=None,
                     X_df:Optional[~DFType]=None,
                     ids:Optional[List[str]]=None)

Compute the predictions for the next h steps.

TypeDefaultDetails
hintNumber of periods to predict.
before_predict_callbackOptionalNoneFunction to call on the features before computing the predictions.
This function will take the input dataframe that will be passed to the model for predicting and should return a dataframe with the same structure.
The series identifier is on the index.
after_predict_callbackOptionalNoneFunction to call on the predictions before updating the targets.
This function will take a pandas Series with the predictions and should return another one with the same structure.
The series identifier is on the index.
new_dfOptionalNoneSeries data of new observations for which forecasts are to be generated.
This dataframe should have the same structure as the one used to fit the model, including any features and time series data.
If new_df is not None, the method will generate forecasts for the new observations.
levelOptionalNoneConfidence levels between 0 and 100 for prediction intervals.
X_dfOptionalNoneDataframe with the future exogenous features. Should have the id column and the time column.
idsOptionalNoneList with subset of ids seen during training for which the forecasts should be computed.
ReturnsDFTypePredictions for each serie and timestep, with one column per model.
predictions = fcst.predict(horizon)

We can see at a couple of results.

results = valid.merge(predictions, on=['unique_id', 'ds'])
fig = plot_series(forecasts_df=results)

Prediction intervals

With MLForecast, you can generate prediction intervals using Conformal Prediction. To configure Conformal Prediction, you need to pass an instance of the PredictionIntervals class to the prediction_intervals argument of the fit method. The class takes three parameters: n_windows, h and method.

  • n_windows represents the number of cross-validation windows used to calibrate the intervals
  • h is the forecast horizon
  • method can be conformal_distribution or conformal_error; conformal_distribution (default) creates forecasts paths based on the cross-validation errors and calculate quantiles using those paths, on the other hand conformal_error calculates the error quantiles to produce prediction intervals. The strategy will adjust the intervals for each horizon step, resulting in different widths for each step. Please note that a minimum of 2 cross-validation windows must be used.
fcst.fit(
    train,
    prediction_intervals=PredictionIntervals(n_windows=3, h=48)
);

After that, you just have to include your desired confidence levels to the predict method using the level argument. Levels must lie between 0 and 100.

predictions_w_intervals = fcst.predict(48, level=[50, 80, 95])
predictions_w_intervals.head()
unique_iddsLGBMRegressorLGBMRegressor-lo-95LGBMRegressor-lo-80LGBMRegressor-lo-50LGBMRegressor-hi-50LGBMRegressor-hi-80LGBMRegressor-hi-95
0H19696116.07127115.95804215.97127116.00509116.13745216.17127116.184501
1H19696215.67127115.55363215.55363215.57863215.76391115.78891115.788911
2H19696315.27127115.15363215.15363215.16245215.38009115.38891115.388911
3H19696414.97127114.85804214.87127114.90509115.03745215.07127115.084501
4H19696514.67127114.55363214.55363214.56245214.78009114.78891114.788911

Let’s explore the generated intervals.

results = valid.merge(predictions_w_intervals, on=['unique_id', 'ds'])
fig = plot_series(forecasts_df=results, level=[50, 80, 95])

If you want to reduce the computational time and produce intervals with the same width for the whole forecast horizon, simple pass h=1 to the PredictionIntervals class. The caveat of this strategy is that in some cases, variance of the absolute residuals maybe be small (even zero), so the intervals may be too narrow.

fcst.fit(
    train,  
    prediction_intervals=PredictionIntervals(n_windows=3, h=1)
);
predictions_w_intervals_ws_1 = fcst.predict(48, level=[80, 90, 95])

Let’s explore the generated intervals.

results = valid.merge(predictions_w_intervals_ws_1, on=['unique_id', 'ds'])
fig = plot_series(forecasts_df=results, level=[90])

Forecast using a pretrained model

MLForecast allows you to use a pretrained model to generate forecasts for a new dataset. Simply provide a pandas dataframe containing the new observations as the value for the new_df argument when calling the predict method. The dataframe should have the same structure as the one used to fit the model, including any features and time series data. The function will then use the pretrained model to generate forecasts for the new observations. This allows you to easily apply a pretrained model to a new dataset and generate forecasts without the need to retrain the model.

ercot_df = pd.read_csv('https://datasets-nixtla.s3.amazonaws.com/ERCOT-clean.csv')
# we have to convert the ds column to integers
# since MLForecast was trained with that structure
ercot_df['ds'] = np.arange(1, len(ercot_df) + 1)
# use the `new_df` argument to pass the ercot dataset 
ercot_fcsts = fcst.predict(horizon, new_df=ercot_df)
fig = plot_series(ercot_df, ercot_fcsts, max_insample_length=48 * 2)

If you want to take a look at the data that will be used to train the models you can call Forecast.preprocess.


source

MLForecast.preprocess

 MLForecast.preprocess (df:~DFType, id_col:str='unique_id',
                        time_col:str='ds', target_col:str='y',
                        static_features:Optional[List[str]]=None,
                        dropna:bool=True, keep_last_n:Optional[int]=None,
                        max_horizon:Optional[int]=None,
                        return_X_y:bool=False, as_numpy:bool=False)

Add the features to data.

TypeDefaultDetails
dfDFTypeSeries data in long format.
id_colstrunique_idColumn that identifies each serie.
time_colstrdsColumn that identifies each timestep, its values can be timestamps or integers.
target_colstryColumn that contains the target.
static_featuresOptionalNoneNames of the features that are static and will be repeated when forecasting.
dropnaboolTrueDrop rows with missing values produced by the transformations.
keep_last_nOptionalNoneKeep only these many records from each serie for the forecasting step. Can save time and memory if your features allow it.
max_horizonOptionalNoneTrain this many models, where each model will predict a specific horizon.
return_X_yboolFalseReturn a tuple with the features and the target. If False will return a single dataframe.
as_numpyboolFalseCast features to numpy array. Only works for return_X_y=True.
ReturnsUniondf plus added features and target(s).
prep_df = fcst.preprocess(train)
prep_df
unique_iddsylag24lag48lag72lag96lag120lag144lag168exponentially_weighted_mean_lag48_alpha0.3
86988H1961930.10.00.00.00.30.10.10.30.002810
86989H1961940.1-0.10.10.00.30.10.10.30.031967
86990H1961950.1-0.10.10.00.30.10.20.10.052377
86991H1961960.10.00.00.00.30.20.10.20.036664
86992H1961970.00.00.00.10.20.20.10.20.025665
325187H4139560.010.01.06.0-53.044.0-21.021.07.963225
325188H4139579.010.010.0-7.0-46.027.0-19.024.08.574257
325189H41395816.08.05.0-9.0-36.032.0-13.08.07.501980
325190H413959-3.017.0-7.02.0-31.022.05.0-2.03.151386
325191H41396015.011.0-6.0-5.0-17.022.0-18.010.00.405970

If we do this we then have to call Forecast.fit_models, since this only stores the series information.


source

MLForecast.fit_models

 MLForecast.fit_models (X:Union[pandas.core.frame.DataFrame,polars.datafra
                        me.frame.DataFrame,numpy.ndarray],
                        y:numpy.ndarray)

Manually train models. Use this if you called MLForecast.preprocess beforehand.

TypeDetails
XUnionFeatures.
yndarrayTarget.
ReturnsMLForecastForecast object with trained models.
X, y = prep_df.drop(columns=['unique_id', 'ds', 'y']), prep_df['y']
fcst.fit_models(X, y)
MLForecast(models=[LGBMRegressor], freq=1, lag_features=['lag24', 'lag48', 'lag72', 'lag96', 'lag120', 'lag144', 'lag168', 'exponentially_weighted_mean_lag48_alpha0.3'], date_features=[], num_threads=1)
predictions2 = fcst.predict(horizon)
pd.testing.assert_frame_equal(predictions, predictions2)

source

MLForecast.cross_validation

 MLForecast.cross_validation (df:~DFType, n_windows:int, h:int,
                              id_col:str='unique_id', time_col:str='ds',
                              target_col:str='y',
                              step_size:Optional[int]=None,
                              static_features:Optional[List[str]]=None,
                              dropna:bool=True,
                              keep_last_n:Optional[int]=None,
                              refit:Union[bool,int]=True,
                              max_horizon:Optional[int]=None, before_predi
                              ct_callback:Optional[Callable]=None, after_p
                              redict_callback:Optional[Callable]=None, pre
                              diction_intervals:Optional[mlforecast.utils.
                              PredictionIntervals]=None,
                              level:Optional[List[Union[int,float]]]=None,
                              input_size:Optional[int]=None,
                              fitted:bool=False, as_numpy:bool=False)

Perform time series cross validation. Creates n_windows splits where each window has h test periods, trains the models, computes the predictions and merges the actuals.

TypeDefaultDetails
dfDFTypeSeries data in long format.
n_windowsintNumber of windows to evaluate.
hintForecast horizon.
id_colstrunique_idColumn that identifies each serie.
time_colstrdsColumn that identifies each timestep, its values can be timestamps or integers.
target_colstryColumn that contains the target.
step_sizeOptionalNoneStep size between each cross validation window. If None it will be equal to h.
static_featuresOptionalNoneNames of the features that are static and will be repeated when forecasting.
dropnaboolTrueDrop rows with missing values produced by the transformations.
keep_last_nOptionalNoneKeep only these many records from each serie for the forecasting step. Can save time and memory if your features allow it.
refitUnionTrueRetrain model for each cross validation window.
If False, the models are trained at the beginning and then used to predict each window.
If positive int, the models are retrained every refit windows.
max_horizonOptionalNone
before_predict_callbackOptionalNoneFunction to call on the features before computing the predictions.
This function will take the input dataframe that will be passed to the model for predicting and should return a dataframe with the same structure.
The series identifier is on the index.
after_predict_callbackOptionalNoneFunction to call on the predictions before updating the targets.
This function will take a pandas Series with the predictions and should return another one with the same structure.
The series identifier is on the index.
prediction_intervalsOptionalNoneConfiguration to calibrate prediction intervals (Conformal Prediction).
levelOptionalNoneConfidence levels between 0 and 100 for prediction intervals.
input_sizeOptionalNoneMaximum training samples per serie in each window. If None, will use an expanding window.
fittedboolFalseStore the in-sample predictions.
as_numpyboolFalseCast features to numpy array.
ReturnsDFTypePredictions for each window with the series id, timestamp, last train date, target value and predictions from each model.

If we would like to know how good our forecast will be for a specific model and set of features then we can perform cross validation. What cross validation does is take our data and split it in two parts, where the first part is used for training and the second one for validation. Since the data is time dependant we usually take the last x observations from our data as the validation set.

This process is implemented in MLForecast.cross_validation, which takes our data and performs the process described above for n_windows times where each window has h validation samples in it. For example, if we have 100 samples and we want to perform 2 backtests each of size 14, the splits will be as follows:

  1. Train: 1 to 72. Validation: 73 to 86.
  2. Train: 1 to 86. Validation: 87 to 100.

You can control the size between each cross validation window using the step_size argument. For example, if we have 100 samples and we want to perform 2 backtests each of size 14 and move one step ahead in each fold (step_size=1), the splits will be as follows:

  1. Train: 1 to 85. Validation: 86 to 99.
  2. Train: 1 to 86. Validation: 87 to 100.

You can also perform cross validation without refitting your models for each window by setting refit=False. This allows you to evaluate the performance of your models using multiple window sizes without having to retrain them each time.

fcst = MLForecast(
    models=lgb.LGBMRegressor(random_state=0, verbosity=-1),
    freq=1,
    lags=[24 * (i+1) for i in range(7)],
    lag_transforms={
        1: [RollingMean(window_size=24)],
        24: [RollingMean(window_size=24)],
        48: [ExponentiallyWeightedMean(alpha=0.3)],
    },
    num_threads=1,
    target_transforms=[Differences([24])],
)
cv_results = fcst.cross_validation(
    train,
    n_windows=2,
    h=horizon,
    step_size=horizon,
    fitted=True,
)
cv_results
unique_iddscutoffyLGBMRegressor
0H19686586415.515.373393
1H19686686415.114.973393
2H19686786414.814.673393
3H19686886414.414.373393
4H19686986414.214.073393
379H41395691259.064.284167
380H41395791258.064.830429
381H41395891253.040.726851
382H41395991238.042.739657
383H41396091246.052.802769

Since we set fitted=True we can access the predictions for the training sets as well with the cross_validation_fitted_values method.

fcst.cross_validation_fitted_values()
unique_iddsfoldyLGBMRegressor
0H196193012.712.673393
1H196194012.312.273393
2H196195011.911.873393
3H196196011.711.673393
4H196197011.411.473393
5563H413908149.050.620196
5564H413909139.035.972331
5565H413910129.029.359678
5566H413911124.025.784563
5567H413912120.023.168413

We can also compute prediction intervals by passing a configuration to prediction_intervals as well as values for the width through levels.

cv_results_intervals = fcst.cross_validation(
    train,
    n_windows=2,
    h=horizon,
    step_size=horizon,
    prediction_intervals=PredictionIntervals(h=horizon),
    level=[80, 90]
)
cv_results_intervals
unique_iddscutoffyLGBMRegressorLGBMRegressor-lo-90LGBMRegressor-lo-80LGBMRegressor-hi-80LGBMRegressor-hi-90
0H19686586415.515.37339315.31137915.31652815.43025815.435407
1H19686686415.114.97339314.94055614.94055615.00623015.006230
2H19686786414.814.67339314.60623014.60623014.74055614.740556
3H19686886414.414.37339314.30623014.30623014.44055614.440556
4H19686986414.214.07339314.00623014.00623014.14055614.140556
379H41395691259.064.28416729.89009934.37154594.19678898.678234
380H41395791258.064.83042956.87457257.82768971.83316972.786285
381H41395891253.040.72685135.29619535.84620645.60749546.157506
382H41395991238.042.73965735.29215335.80764049.67167450.187161
383H41396091246.052.80276942.46559743.89567061.70986963.139941

The refit argument allows us to control if we want to retrain the models in every window. It can either be:

  • A boolean: True will retrain on every window and False only on the first one.
  • A positive integer: The models will be trained on the first window and then every refit windows.
fcst = MLForecast(
    models=LinearRegression(),
    freq=1,
    lags=[1, 24],
)
for refit, expected_models in zip([True, False, 2], [4, 1, 2]):
    fcst.cross_validation(
        train,
        n_windows=4,
        h=horizon,
        refit=refit,
    )
    test_eq(len(fcst.cv_models_), expected_models)
fig = plot_series(forecasts_df=cv_results.drop(columns='cutoff'))

fig = plot_series(forecasts_df=cv_results_intervals.drop(columns='cutoff'), level=[90])


source

MLForecast.from_cv

 MLForecast.from_cv (cv:mlforecast.lgb_cv.LightGBMCV)

Once you’ve found a set of features and parameters that work for your problem you can build a forecast object from it using MLForecast.from_cv, which takes the trained LightGBMCV object and builds an MLForecast object that will use the same features and parameters. Then you can call fit and predict as you normally would.

cv = LightGBMCV(
    freq=1,
    lags=[24 * (i+1) for i in range(7)],
    lag_transforms={
        48: [ExponentiallyWeightedMean(alpha=0.3)],
    },
    num_threads=1,
    target_transforms=[Differences([24])]
)
hist = cv.fit(
    train,
    n_windows=2,
    h=horizon,
    params={'verbosity': -1},
)
[LightGBM] [Info] Start training from score 0.084340
[10] mape: 0.118569
[20] mape: 0.111506
[30] mape: 0.107314
[40] mape: 0.106089
[50] mape: 0.106630
Early stopping at round 50
Using best iteration: 40
fcst = MLForecast.from_cv(cv)
assert cv.best_iteration_ == fcst.models['LGBMRegressor'].n_estimators