Documentation Index
Fetch the complete documentation index at: https://nixtlaverse.nixtla.io/llms.txt
Use this file to discover all available pages before exploring further.
Leverage scikit-learn’s composability to define pipelines as models
mlforecast takes scikit-learn estimators as models, which means you can
provide scikit-learn’s
pipelines
as models in order to further apply transformations to the data before
passing it to the model.
Data setup
from mlforecast.utils import generate_daily_series
series = generate_daily_series(5)
series.head()
| unique_id | ds | y |
|---|
| 0 | id_0 | 2000-01-01 | 0.428973 |
| 1 | id_0 | 2000-01-02 | 1.423626 |
| 2 | id_0 | 2000-01-03 | 2.311782 |
| 3 | id_0 | 2000-01-04 | 3.192191 |
| 4 | id_0 | 2000-01-05 | 4.148767 |
Pipelines definition
Suppose that you want to use a linear regression model with the lag1 and
the day of the week as features. mlforecast returns the day of the week
as a single column, however, that’s not the optimal format for a linear
regression model, which benefits more from having indicator columns for
each day of the week (removing one to avoid colinearity). We can achieve
this by using scikit-learn’s
OneHotEncoder
and then fitting our linear regression model, which we can do in the
following way:
from mlforecast import MLForecast
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder
fcst = MLForecast(
models=[],
freq='D',
lags=[1],
date_features=['dayofweek']
)
X, y = fcst.preprocess(series, return_X_y=True)
X.head()
| lag1 | dayofweek |
|---|
| 1 | 0.428973 | 6 |
| 2 | 1.423626 | 0 |
| 3 | 2.311782 | 1 |
| 4 | 3.192191 | 2 |
| 5 | 4.148767 | 3 |
This is what will be passed to our model, so we’d like to get the
dayofweek column and perform one hot encoding, leaving the lag1
column untouched. We can achieve that with the following:
ohe = ColumnTransformer(
transformers=[
('encoder', OneHotEncoder(drop='first'), ['dayofweek'])
],
remainder='passthrough',
)
X_transformed = ohe.fit_transform(X)
X_transformed.shape
We can see that our data now has 7 columns, 1 for the lag plus 6 for the
days of the week (we dropped the first one).
ohe.get_feature_names_out()
array(['encoder__dayofweek_1', 'encoder__dayofweek_2',
'encoder__dayofweek_3', 'encoder__dayofweek_4',
'encoder__dayofweek_5', 'encoder__dayofweek_6', 'remainder__lag1'],
dtype=object)
Training
We can now build a pipeline that does this and then passes it to our
linear regression model.
model = make_pipeline(ohe, LinearRegression())
And provide this as a model to mlforecast
fcst = MLForecast(
models={'ohe_lr': model},
freq='D',
lags=[1],
date_features=['dayofweek']
)
fcst.fit(series)
MLForecast(models=[ohe_lr], freq=<Day>, lag_features=['lag1'], date_features=['dayofweek'], num_threads=1)
Forecasting
Finally, we compute the forecasts.
| unique_id | ds | ohe_lr |
|---|
| 0 | id_0 | 2000-08-10 | 4.312748 |
| 1 | id_1 | 2000-04-07 | 4.537019 |
| 2 | id_2 | 2000-06-16 | 4.160505 |
| 3 | id_3 | 2000-08-30 | 3.777040 |
| 4 | id_4 | 2001-01-08 | 2.676933 |
Summary
You can provide complex scikit-learn pipelines as models to mlforecast,
which allows you to perform different transformations depending on the
model and use any of scikit-learn’s compatible estimators.