module hierarchicalforecast.evaluation
function mse
y and y_hat. MSE measures the relative prediction accuracy of a forecasting method by calculating the squared deviation of the prediction and the true value at a given time, and averages these devations over the length of the series.
Args:
y(np.ndarray): numpy array, Actual values.y_hat(np.ndarray): numpy array, Predicted values.weights(Optional[np.ndarray], optional): numpy array, Specifies date stamps per serie to consider in loss. Default is None.axis(Optional[int], optional): Axis along which to compute the metric. Default is None.
Union[float, np.ndarray]: numpy array, (single value).
function mqloss
y and y_hat. MQL calculates the average multi-quantile Loss for a given set of quantiles, based on the absolute difference between predicted quantiles and observed values.
The limit behavior of MQL allows to measure the accuracy of a full predictive distribution with the continuous ranked probability score (CRPS). This can be achieved through a numerical integration technique, that discretizes the quantiles and treats the CRPS integral with a left Riemann approximation, averaging over uniformly distanced quantiles.
Args:
y(np.ndarray): numpy array, Actual values.y_hat(np.ndarray): numpy array, Predicted values.quantiles(np.ndarray): numpy array. Quantiles between 0 and 1, to perform evaluation upon size (n_quantiles).weights(Optional[np.ndarray], optional): numpy array, Specifies date stamps per serie to consider in loss. Default is None.axis(Optional[int], optional): Axis along which to compute the metric. Default is None.
Union[float, np.ndarray]: numpy array, (single value).
- Roger Koenker and Gilbert Bassett, Jr., “Regression Quantiles”.
- James E. Matheson and Robert L. Winkler, “Scoring Rules for Continuous Probability Distributions”.
function rel_mse
y(np.ndarray): numpy array, Actual values of size (n_series,horizon).y_hat(np.ndarray): numpy array, Predicted values (n_series,horizon).y_train(np.ndarray): numpy array, Training values.mask(Optional[np.ndarray], optional): numpy array, Specifies date stamps per serie to consider in loss. Default is None.
float: loss.
- Hyndman, R. J and Koehler, A. B. (2006). “Another look at measures of forecast accuracy”. International Journal of Forecasting, Volume 22, Issue 4.
- Kin G. Olivares, O. Nganba Meetei, Ruijun Ma, Rohan Reddy, Mengfei Cao, Lee Dicker. “Probabilistic Hierarchical Forecasting with Deep Poisson Mixtures”. International Journal of Forecasting, Volume 40, Issue 2.
function msse
n) is the size of the training data, and is the forecasting horizon (horizon).
Args:
y(np.ndarray): numpy array, Actual values of size (n_series,horizon).y_hat(np.ndarray): numpy array, Predicted values (n_series,horizon).y_train(np.ndarray): numpy array, Predicted values (n_series,n).mask(Optional[np.ndarray], optional): numpy array, Specifies date stamps per serie to consider in loss. Default is None.
float: loss.
function scaled_crps
y_hat compared to the observation y.
This metric averages percentual weighted absolute deviations as defined by the quantile losses.
where is the an estimated multivariate distribution, and are its realizations.
Args:
y(np.ndarray): numpy array, Actual values of size (n_series,horizon).y_hat(np.ndarray): numpy array, Predicted quantiles of size (n_series,horizon,n_quantiles).quantiles(np.ndarray): numpy array,(n_quantiles). Quantiles to estimate from the distribution of y.
float: loss.
- Gneiting, Tilmann. (2011). “Quantiles as optimal point forecasts”. International Journal of Forecasting.
- Spyros Makridakis, Evangelos Spiliotis, Vassilios Assimakopoulos, Zhi Chen, Anil Gaba, Ilia Tsetlin, Robert L. Winkler. (2022). “The M5 uncertainty competition: Results, findings and conclusions”. International Journal of Forecasting.
- Syama Sundar Rangapuram, Lucien D Werner, Konstantinos Benidis, Pedro Mercado, Jan Gasthaus, Tim Januschowski. (2021). “End-to-End Learning of Coherent Probabilistic Forecasts for Hierarchical Time Series”. Proceedings of the 38th International Conference on Machine Learning (ICML).
function energy_score
y and independent multivariate samples y_sample1 and y_sample2. The Energy Score generalizes the CRPS (beta=1) in the multivariate setting.
where are independent samples drawn from .
Args:
y(np.ndarray): numpy array, Actual values of size (n_series,horizon).y_sample1(np.ndarray): numpy array, predictive distribution sample of size (n_series,horizon,n_samples).y_sample2(np.ndarray): numpy array, predictive distribution sample of size (n_series,horizon,n_samples).beta(float, optional): float in (0,2], defines the energy score’s power for the euclidean metric. Default is 2.
float: score.
- Gneiting, Tilmann, and Adrian E. Raftery. (2007). “Strictly proper scoring rules, prediction and estimation”. Journal of the American Statistical Association.
- Anastasios Panagiotelis, Puwasala Gamakumara, George Athanasopoulos, Rob J. Hyndman. (2022). “Probabilistic forecast reconciliation: Properties, evaluation and score optimisation”. European Journal of Operational Research.
function log_score
y(np.ndarray): numpy array, Actual values of size (n_series,horizon).y_hat(np.ndarray): numpy array, Predicted values (n_series,horizon).cov(np.ndarray): numpy matrix, Predicted values covariance (n_series,n_series,horizon).allow_singular(bool, optional): if true allows singular covariance. Default is True.
float: score.
function evaluate
df(pandas, polars, dask or spark DataFrame): Forecasts to evaluate. Must haveid_col,time_col,target_coland models’ predictions.metrics(list of callable): Functions with argumentsdf,models,id_col,target_coland optionallytrain_df.tags(dict): Each key is a level in the hierarchy and its value contains tags associated to that level. Each key is a level in the hierarchy and its value contains tags associated to that level.models(list of str, optional): Names of the models to evaluate. IfNonewill use every column in the dataframe after removing id, time and target.train_df(pandas, polars, dask or spark DataFrame, optional): Training set. Used to evaluate metrics such asmase.level(list of int, optional): Prediction interval levels. Used to compute losses that rely on quantiles.id_col(str): Column that identifies each serie.time_col(str): Column that identifies each timestep, its values can be timestamps or integers.target_col(str): Column that contains the target.agg_fn(str, optional): Statistic to compute on the scores by id to reduce them to a single number.benchmark(str, optional): If passed, evaluators are scaled by the error of this benchmark model.
pandas, polars DataFrame: Metrics with one row per (id, metric) combination and one column per model. Ifagg_fnis notNone, there is only one row per metric.
class HierarchicalEvaluation
Hierarchical Evaluation Class.
You can use your own metrics to evaluate the performance of each level in the structure. The metrics receive y and y_hat as arguments and they are numpy arrays of size (series, horizon). Consider, for example, the function rmse that calculates the root mean squared error.
This class facilitates measurements across the hierarchy, defined by the tags list. See also the aggregate method.
Args:
evaluators(list[Callable]): functions with argumentsy,y_hat(numpy arrays).
method __init__
method evaluate
Y_hat_df(Frame): DataFrame, Forecasts with columns'unique_id','ds'and models to evaluate.Y_test_df(Frame): DataFrame, Observed values with columns['unique_id', 'ds', 'y'].tags(dict[str, np.ndarray]): np.array, each str key is a level and its value contains tags associated to that level.Y_df(Optional[Frame], optional): DataFrame, Training set of base time series with columns['unique_id', 'ds', 'y']. Default is None.benchmark(Optional[str], optional): str, If passed, evaluators are scaled by the error of this benchark. Default is None.id_col(str, optional): str=‘unique_id’, column that identifies each serie. Default is “unique_id”.time_col(str, optional): str=‘ds’, column that identifies each timestep, its values can be timestamps or integers. Default is “ds”.target_col(str, optional): str=‘y’, column that contains the target. Default is “y”.
FrameT: evaluation: DataFrame with accuracy measurements across hierarchical levels.

