Skip to main content
To assist the evaluation of hierarchical forecasting systems, we make available an evaluate function that can be used in combination with loss functions from utilsforecast.losses.

evaluate

evaluate(df, metrics, tags, models=None, train_df=None, level=None, id_col='unique_id', time_col='ds', target_col='y', agg_fn='mean', benchmark=None)
Evaluate hierarchical forecast using different metrics. Parameters:
NameTypeDescriptionDefault
dfpandas, polars, dask or spark DataFrameForecasts to evaluate. Must have id_col, time_col, target_col and modelsโ€™ predictions.required
metricslist of callableFunctions with arguments df, models, id_col, target_col and optionally train_df.required
tagsdictEach key is a level in the hierarchy and its value contains tags associated to that level. Each key is a level in the hierarchy and its value contains tags associated to that level.required
modelslist of strNames of the models to evaluate. If None will use every column in the dataframe after removing id, time and target.None
train_dfpandas, polars, dask or spark DataFrameTraining set. Used to evaluate metrics such as mase.None
levellist of intPrediction interval levels. Used to compute losses that rely on quantiles.None
id_colstrColumn that identifies each serie.โ€˜unique_idโ€™
time_colstrColumn that identifies each timestep, its values can be timestamps or integers.โ€˜dsโ€™
target_colstrColumn that contains the target.โ€˜yโ€™
agg_fnstrStatistic to compute on the scores by id to reduce them to a single number.โ€˜meanโ€™
benchmarkstrIf passed, evaluators are scaled by the error of this benchmark model.None
Returns:
TypeDescription
FrameTpandas, polars DataFrame: Metrics with one row per (id, metric) combination and one column per model. If agg_fn is not None, there is only one row per metric.

Example

import pandas as pd

from hierarchicalforecast.core import HierarchicalReconciliation
from hierarchicalforecast.methods import BottomUp, MinTrace
from hierarchicalforecast.utils import aggregate
from hierarchicalforecast.evaluation import evaluate
from statsforecast.core import StatsForecast
from statsforecast.models import AutoETS
from utilsforecast.losses import mase, rmse
from functools import partial

# Load TourismSmall dataset
df = pd.read_csv('https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/tourism.csv')
df = df.rename({'Trips': 'y', 'Quarter': 'ds'}, axis=1)
df.insert(0, 'Country', 'Australia')
qs = df['ds'].str.replace(r'(\d+) (Q\d)', r'\1-\2', regex=True)
df['ds'] = pd.PeriodIndex(qs, freq='Q').to_timestamp()

# Create hierarchical seires based on geographic levels and purpose
# And Convert quarterly ds string to pd.datetime format
hierarchy_levels = [['Country'],
                    ['Country', 'State'],
                    ['Country', 'Purpose'],
                    ['Country', 'State', 'Region'],
                    ['Country', 'State', 'Purpose'],
                    ['Country', 'State', 'Region', 'Purpose']]

Y_df, S_df, tags = aggregate(df=df, spec=hierarchy_levels)

# Split train/test sets
Y_test_df  = Y_df.groupby('unique_id').tail(8)
Y_train_df = Y_df.drop(Y_test_df.index)

# Compute base auto-ETS predictions
# Careful identifying correct data freq, this data quarterly 'Q'
fcst = StatsForecast(models=[AutoETS(season_length=4, model='ZZA')], freq='QS', n_jobs=-1)
Y_hat_df = fcst.forecast(df=Y_train_df, h=8, fitted=True)
Y_fitted_df = fcst.forecast_fitted_values()

reconcilers = [
                BottomUp(),
                MinTrace(method='ols'),
                MinTrace(method='mint_shrink'),
               ]
hrec = HierarchicalReconciliation(reconcilers=reconcilers)
Y_rec_df = hrec.reconcile(Y_hat_df=Y_hat_df,
                          Y_df=Y_fitted_df,
                          S_df=S_df, tags=tags)

# Evaluate
eval_tags = {}
eval_tags['Total'] = tags['Country']
eval_tags['Purpose'] = tags['Country/Purpose']
eval_tags['State'] = tags['Country/State']
eval_tags['Regions'] = tags['Country/State/Region']
eval_tags['Bottom'] = tags['Country/State/Region/Purpose']

Y_rec_df_with_y = Y_rec_df.merge(Y_test_df, on=['unique_id', 'ds'], how='left')
mase_p = partial(mase, seasonality=4)

evaluation = evaluate(Y_rec_df_with_y,
         metrics=[mase_p, rmse],
         tags=eval_tags,
         train_df=Y_train_df)

numeric_cols = evaluation.select_dtypes(include="number").columns
evaluation[numeric_cols] = evaluation[numeric_cols].map('{:.2f}'.format)

References