Using scikit-learn pipelines
Leverage scikit-learn’s composability to define pipelines as models
mlforecast takes scikit-learn estimators as models, which means you can provide scikit-learn’s pipelines as models in order to further apply transformations to the data before passing it to the model.
Data setup
from mlforecast.utils import generate_daily_series
series = generate_daily_series(5)
series.head()
unique_id | ds | y | |
---|---|---|---|
0 | id_0 | 2000-01-01 | 0.428973 |
1 | id_0 | 2000-01-02 | 1.423626 |
2 | id_0 | 2000-01-03 | 2.311782 |
3 | id_0 | 2000-01-04 | 3.192191 |
4 | id_0 | 2000-01-05 | 4.148767 |
Pipelines definition
Suppose that you want to use a linear regression model with the lag1 and the day of the week as features. mlforecast returns the day of the week as a single column, however, that’s not the optimal format for a linear regression model, which benefits more from having indicator columns for each day of the week (removing one to avoid colinearity). We can achieve this by using scikit-learn’s OneHotEncoder and then fitting our linear regression model, which we can do in the following way:
from mlforecast import MLForecast
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder
fcst = MLForecast(
models=[],
freq='D',
lags=[1],
date_features=['dayofweek']
)
X, y = fcst.preprocess(series, return_X_y=True)
X.head()
lag1 | dayofweek | |
---|---|---|
1 | 0.428973 | 6 |
2 | 1.423626 | 0 |
3 | 2.311782 | 1 |
4 | 3.192191 | 2 |
5 | 4.148767 | 3 |
This is what will be passed to our model, so we’d like to get the
dayofweek
column and perform one hot encoding, leaving the lag1
column untouched. We can achieve that with the following:
ohe = ColumnTransformer(
transformers=[
('encoder', OneHotEncoder(drop='first'), ['dayofweek'])
],
remainder='passthrough',
)
X_transformed = ohe.fit_transform(X)
X_transformed.shape
(1096, 7)
We can see that our data now has 7 columns, 1 for the lag plus 6 for the days of the week (we dropped the first one).
ohe.get_feature_names_out()
array(['encoder__dayofweek_1', 'encoder__dayofweek_2',
'encoder__dayofweek_3', 'encoder__dayofweek_4',
'encoder__dayofweek_5', 'encoder__dayofweek_6', 'remainder__lag1'],
dtype=object)
Training
We can now build a pipeline that does this and then passes it to our linear regression model.
model = make_pipeline(ohe, LinearRegression())
And provide this as a model to mlforecast
fcst = MLForecast(
models={'ohe_lr': model},
freq='D',
lags=[1],
date_features=['dayofweek']
)
fcst.fit(series)
MLForecast(models=[ohe_lr], freq=<Day>, lag_features=['lag1'], date_features=['dayofweek'], num_threads=1)
Forecasting
Finally, we compute the forecasts.
fcst.predict(1)
unique_id | ds | ohe_lr | |
---|---|---|---|
0 | id_0 | 2000-08-10 | 4.312748 |
1 | id_1 | 2000-04-07 | 4.537019 |
2 | id_2 | 2000-06-16 | 4.160505 |
3 | id_3 | 2000-08-30 | 3.777040 |
4 | id_4 | 2001-01-08 | 2.676933 |
Summary
You can provide complex scikit-learn pipelines as models to mlforecast, which allows you to perform different transformations depending on the model and use any of scikit-learn’s compatible estimators.