Crossvalidation
Implement crossvalidation to evaluate models on historical data
Time series crossvalidation is a method for evaluating how a model would have performed on historical data. It works by defining a sliding window across past observations and predicting the period following it. It differs from standard crossvalidation by maintaining the chronological order of the data instead of randomly splitting it.
This method allows for a better estimation of our model’s predictive capabilities by considering multiple periods. When only one window is used, it resembles a standard traintest split, where the test data is the last set of observations, and the training set consists of the earlier data.
The following graph showcases how time series crossvalidation works.
In this tutorial we’ll explain how to perform crossvalidation in
NeuralForecast
.
Outline: 1. Install NeuralForecast

Load and plot the data

Train multiple models using crossvalidation

Evaluate models and select the best for each series

Plot crossvalidation results
Prerequesites
This guide assumes basic familiarity with
neuralforecast
. For a minimal example visit the Quick Start
The following line of code is used so that the output of crossvalidation has the ids of the series as a column rather than as the index.
import os
os.environ['NIXTLA_ID_AS_COL'] = '1'
1. Install NeuralForecast
!pip install neuralforecast
2. Load and plot the data
We’ll use pandas to load the hourly dataset from the M4 Forecasting Competition, which has been stored in a parquet file for efficiency.
import pandas as pd
Y_df = pd.read_parquet('https://datasetsnixtla.s3.amazonaws.com/m4hourly.parquet')
Y_df.head()
unique_id  ds  y  

0  H1  1  605.0 
1  H1  2  586.0 
2  H1  3  586.0 
3  H1  4  559.0 
4  H1  5  511.0 
The input to neuralforecast
should be a data frame in long format with
three columns: unique_id
, ds
, and y
.

unique_id
(string, int, or category): A unique identifier for each time series. 
ds
(int or timestamp): An integer indexing time or a timestamp in format YYYYMMDD or YYYYMMDD HH:MM:SS. 
y
(numeric): The target variable to forecast.
This dataset contains 414 unique time series. To reduce the total execution time, we’ll use only the first 10.
uids = Y_df['unique_id'].unique()[:10] # Select 10 ids to make the example run faster
Y_df = Y_df.query('unique_id in @uids').reset_index(drop=True)
To plot the series, we’ll use the plot_series
method from
utilsforecast.plotting
. utilsforecast
is a dependency of
neuralforecast
so it should be already installed.
from utilsforecast.plotting import plot_series
plot_series(Y_df)
3. Train multiple models using crossvalidation
We’ll train different models from neuralforecast
using the
crossvalidation
method to decide which one perfoms best on the
historical data. To do this, we need to import the
NeuralForecast
class and the models that we want to compare.
from neuralforecast import NeuralForecast
from neuralforecast.auto import MLP, NBEATS, NHITS
In this tutorial, we will use neuralforecast's
MPL,
NBEATS,
and
NHITS
models.
First, we need to create a list of models and then instantiate the
NeuralForecast
class. For each model, we’ll define the following hyperparameters:

h
: The forecast horizon. Here, we will use the same horizon as in the M4 competition, which was 48 steps ahead. 
input_size
: The number of historical observations (lags) that the model uses to make predictions. In this case, it will be twice the forecast horizon. 
loss
: The loss function to optimize. Here, we’ll use the Multi Quantile Loss (MQLoss) fromneuralforecast.losses.pytorch
.
Warning
The Multi Quantile Loss (MQLoss) is the sum of the quantile losses for each target quantile. The quantile loss for a single quantile measures how well a model has predicted a specific quantile of the actual distribution, penalizing overestimations and underestimations asymmetrically based on the quantile’s value. For more details see here.
While there are other hyperparameters that can be defined for each model, we’ll use the default values for the purposes of this tutorial. To learn more about the hyperparameters of each model, please check out the corresponding documentation.
from neuralforecast.losses.pytorch import MQLoss
horizon = 48
models = [MLP(h=horizon, input_size=2*horizon, loss=MQLoss()),
NBEATS(h=horizon, input_size=2*horizon, loss=MQLoss()),
NHITS(h=horizon, input_size=2*horizon, loss=MQLoss()),]
nf = NeuralForecast(models=models, freq=1)
Seed set to 1
Seed set to 1
Seed set to 1
The cross_validation
method takes the following arguments:

df
: The data frame in the format described in section 2. 
n_windows
(int): The number of windows to evaluate. Default is 1 and here we’ll use 3. 
step_size
(int): The number of steps between consecutive windows to produce the forecasts. In this example, we’ll setstep_size=horizon
to produce nonoverlapping forecasts. The following diagram shows how the forecasts are produced based on thestep_size
parameter and forecast horizonh
of a model. In this diagramstep_size=2
andh=4
.
refit
(bool or int): Whether to retrain models for each crossvalidation window. IfFalse
, the models are trained at the beginning and then used to predict each window. If a positive integer, the models are retrained everyrefit
windows. Default isFalse
, but here we’ll userefit=1
so that the models are retrained after each window using the data with timestamps up to and including the cutoff.
cv_df = nf.cross_validation(Y_df, n_windows=3, step_size=horizon, refit=1)
It’s worth mentioning that the default version of the cross_validation
method in neuralforecast
diverges from other libraries, where models
are typically retrained at the start of each window. By default, it
trains the models once and then uses them to generate predictions over
all the windows, thus reducing the total execution time. For scenarios
where the models need to be retrained, you can use the refit
parameter
to specify the number of windows after which the models should be
retrained.
cv_df.head()
unique_id  ds  cutoff  MLPmedian  MLPlo90  MLPlo80  MLPhi80  MLPhi90  NBEATSmedian  NBEATSlo90  NBEATSlo80  NBEATShi80  NBEATShi90  NHITSmedian  NHITSlo90  NHITSlo80  NHITShi80  NHITShi90  y  

0  H1  605  604  627.988586  534.398438  557.919495  706.230042  729.458984  622.185120  582.681030  598.783203  648.851318  653.490173  632.352661  579.588623  581.992249  641.553772  663.227234  622.0 
1  H1  606  604  567.843018  473.527313  503.057220  652.319458  702.797974  544.922485  489.482178  518.790222  572.878967  585.352661  553.002197  495.692871  504.136749  595.143494  616.993591  558.0 
2  H1  607  604  521.304260  412.619751  446.281494  598.887146  630.079163  496.092041  439.469055  450.099365  524.822876  523.129578  496.740112  449.902313  453.967438  539.902588  560.253723  513.0 
3  H1  608  604  488.616760  393.635559  403.897003  557.288452  600.689697  464.539612  409.801483  425.563171  483.017029  497.362915  455.020721  416.871094  420.002441  500.547150  507.826721  476.0 
4  H1  609  604  464.543854  339.080414  367.943787  536.813965  558.877808  444.119019  384.093872  409.416290  470.489075  474.619446  432.959076  389.460663  397.191345  465.580536  491.252747  449.0 
The output of the crossvalidation
method is a data frame that
includes the following columns:

unique_id
: The unique identifier for each time series. 
ds
: The timestamp or temporal index. 
cutoff
: The last timestamp or temporal index used in that crossvalidation window. 
"model"
: Columns with the model’s point forecasts (median) and prediction intervals. By default, the 80 and 90% prediction intervals are included when using the MQLoss. 
y
: The actual value.
4. Evaluate models and select the best for each series
To evaluate the point forecasts of the models, we’ll use the Root Mean Squared Error (RMSE), defined as the square root of the mean of the squared differences between the actual and the predicted values.
For convenience, we’ll use the evaluate
and the
rmse
functions from utilsforecast
.
from utilsforecast.evaluation import evaluate
from utilsforecast.losses import rmse
The evaluate
function takes the following arguments:

df
: The data frame with the forecasts to evaluate. 
metrics
(list): The metrics to compute. 
models
(list): Names of the models to evaluate. Default isNone
, which uses all columns after removingid_col
,time_col
, andtarget_col
. 
id_col
(str): Column that identifies unique ids of the series. Default isunique_id
. 
time_col
(str): Column with the timestamps or the temporal index. Default isds
. 
target_col
(str): Column with the target variable. Default isy
.
Notice that if we use the default value of models
, then we need to
exclude the cutoff
column from the crossvalidation data frame.
evaluation_df = evaluate(cv_df.loc[:, cv_df.columns != 'cutoff'], metrics=[rmse])
For each unique id, we’ll select the model with the lowest RMSE.
evaluation_df['best_model'] = evaluation_df.drop(columns=['metric', 'unique_id']).idxmin(axis=1)
evaluation_df
unique_id  metric  MLPmedian  NBEATSmedian  NHITSmedian  best_model  

0  H1  rmse  44.453354  48.828117  47.718341  MLPmedian 
1  H10  rmse  24.375221  18.887296  17.938162  NHITSmedian 
2  H100  rmse  165.888534  211.220029  200.504549  MLPmedian 
3  H101  rmse  375.096472  201.191102  149.309690  NHITSmedian 
4  H102  rmse  475.430266  321.459725  331.499041  NBEATSmedian 
5  H103  rmse  8552.597224  9091.500057  8169.006459  NHITSmedian 
6  H104  rmse  187.017183  194.206979  148.196734  NHITSmedian 
7  H105  rmse  349.070196  304.997747  312.518832  NBEATSmedian 
8  H106  rmse  196.937493  361.771205  357.442132  MLPmedian 
9  H107  rmse  211.585091  236.404925  241.229147  MLPmedian 
We can summarize the results to see how many times each model won.
summary_df = evaluation_df.groupby(['metric', 'best_model']).size().sort_values().to_frame()
summary_df = summary_df.reset_index()
summary_df.columns = ['metric', 'model', 'num. of unique_ids']
summary_df
metric  model  num. of unique_ids  

0  rmse  NBEATSmedian  2 
1  rmse  MLPmedian  4 
2  rmse  NHITSmedian  4 
With this information, we now know which model performs best for each series in the historical data.
5. Plot crossvalidation results
To visualize the crossvalidation results, we will use the plot_series
method again. We’ll need to rename the y
column in the
crossvalidation output to avoid duplicates with the original data
frame. We’ll also exclude the cutoff
column and use the
max_insample_length argument
to plot only the last 300 observations
for better visualization.
cv_df.rename(columns = {'y' : 'actual'}, inplace = True) # rename actual values
plot_series(Y_df, cv_df.loc[:, cv_df.columns != 'cutoff'], max_insample_length=300)
To clarify the concept of crossvalidation further, we’ll plot the
forecasts generated at each cutoff for the series with unique_id='H1'
.
There are three cutoffs because we set n_windows=3
. In this example,
we used refit=1
, so each model is retrained for each window using data
with timestamps up to and including the respective cutoff. Additionally,
since step_size
is equal to the forecast horizon, the resulting
forecasts are nonoverlapping
cutoff1, cutoff2, cutoff3 = cv_df['cutoff'].unique()
plot_series(Y_df, cv_df[cv_df['cutoff'] == cutoff1].loc[:, cv_df.columns != 'cutoff'], ids=['H1']) # use ids parameter to select specific series
plot_series(Y_df, cv_df[cv_df['cutoff'] == cutoff2].loc[:, cv_df.columns != 'cutoff'], ids=['H1'])
plot_series(Y_df, cv_df[cv_df['cutoff'] == cutoff3].loc[:, cv_df.columns != 'cutoff'], ids=['H1'])