Hierarchical Evaluation
To assist the evaluation of hierarchical forecasting systems, we make
available accuracy metrics along with the
HierarchicalEvaluation
module that facilitates the measurement of prediction’s accuracy through
the hierarchy levels.
The available metrics include point and probabilistic multivariate scoring rules that were used in previous hierarchical forecasting studies.
Accuracy Measurements
Relative Mean Squared Error
source
rel_mse
*Relative Mean Squared Error
Computes Relative mean squared error (RelMSE), as proposed by Hyndman & Koehler (2006) as an alternative to percentage errors, to avoid measure unstability.
Parameters:
y
: numpy array, Actual values of size (n_series
,
horizon
).
y_hat
: numpy array, Predicted values (n_series
,
horizon
).
mask
: numpy array, Specifies date stamps per serie to
consider in loss.
Returns:
loss
: float.
References:
- Hyndman, R. J and Koehler, A. B. (2006). “Another
look at measures of forecast accuracy”, International Journal of
Forecasting, Volume 22, Issue
4.
-
Kin G. Olivares, O. Nganba Meetei, Ruijun Ma, Rohan Reddy, Mengfei Cao,
Lee Dicker. “Probabilistic Hierarchical Forecasting with Deep Poisson
Mixtures. Submitted to the International Journal Forecasting, Working
paper available at arxiv.*
Mean Squared Scaled Error
source
msse
*Mean Squared Scaled Error
Computes Mean squared scaled error (MSSE), as proposed by Hyndman & Koehler (2006) as an alternative to percentage errors, to avoid measure unstability.
where (n
) is the size of the training data, and is the
forecasting horizon (horizon
).
Parameters:
y
: numpy array, Actual values of size (n_series
,
horizon
).
y_hat
: numpy array, Predicted values (n_series
,
horizon
).
y_train
: numpy array, Predicted values (n_series
,
n
).
mask
: numpy array, Specifies date stamps per serie to
consider in loss.
Returns:
loss
: float.
Scaled CRPS
source
scaled_crps
*Scaled Continues Ranked Probability Score
Calculates a scaled variation of the CRPS, as proposed by Rangapuram
(2021), to measure the accuracy of predicted quantiles y_hat
compared
to the observation y
.
This metric averages percentual weighted absolute deviations as defined by the quantile losses.
where is the an estimated multivariate distribution, and are its realizations.
Parameters:
y
: numpy array, Actual values of size (n_series
,
horizon
).
y_hat
: numpy array, Predicted quantiles of size
(n_series
, horizon
, n_quantiles
).
quantiles
: numpy
array,(n_quantiles
). Quantiles to estimate from the distribution of
y.
Returns:
loss
: float.
References:
- Gneiting, Tilmann. (2011). “Quantiles as optimal
point forecasts”. International Journal of
Forecasting.
-
Spyros Makridakis, Evangelos Spiliotis, Vassilios Assimakopoulos, Zhi
Chen, Anil Gaba, Ilia Tsetlin, Robert L. Winkler. (2022). “The M5
uncertainty competition: Results, findings and conclusions”.
International Journal of
Forecasting.
-
Syama Sundar Rangapuram, Lucien D Werner, Konstantinos Benidis, Pedro
Mercado, Jan Gasthaus, Tim Januschowski. (2021). “End-to-End Learning of
Coherent Probabilistic Forecasts for Hierarchical Time Series”.
Proceedings of the 38th International Conference on Machine Learning
(ICML).*
Energy Score
source
energy_score
*Energy Score
Calculates Gneiting’s Energy Score sample approximation for y
and
independent multivariate samples y_sample1
and y_sample2
. The Energy
Score generalizes the CRPS (beta
=1) in the multivariate setting.
where are independent samples drawn from .
Parameters:
y
: numpy array, Actual values of size (n_series
,
horizon
).
y_sample1
: numpy array, predictive distribution sample
of size (n_series
, horizon
, n_samples
).
y_sample2
: numpy
array, predictive distribution sample of size (n_series
, horizon
,
n_samples
).
beta
: float in (0,2], defines the energy score’s
power for the euclidean metric.
Returns:
score
: float.
References:
- Gneiting, Tilmann, and Adrian E. Raftery. (2007).
“Strictly proper scoring rules, prediction and estimation”. Journal of
the American Statistical
Association.
-
Anastasios Panagiotelis, Puwasala Gamakumara, George Athanasopoulos,
Rob J. Hyndman. (2022). “Probabilistic forecast reconciliation:
Properties, evaluation and score optimisation”. European Journal of
Operational
Research.*
source
log_score
*Log Score.
One of the simplest multivariate probability scoring rules, it evaluates the negative density at the value of the realisation.
where is the density, is a parametric distribution and represents its density. For the moment we only support multivariate normal log score.
Parameters:
y
: numpy array, Actual values of size (n_series
,
horizon
).
y_hat
: numpy array, Predicted values (n_series
,
horizon
).
cov
: numpy matrix, Predicted values covariance
(n_series
, n_series
, horizon
).
allow_singular
: bool=True, if
true allows singular covariance.
Returns:
score
: float.*
Hierarchical Evaluation
source
HierarchicalEvaluation
*Hierarchical Evaluation Class.
You can use your own metrics to evaluate the performance of each level
in the structure. The metrics receive y
and y_hat
as arguments and
they are numpy arrays of size (series, horizon)
. Consider, for
example, the function rmse
that calculates the root mean squared
error.
This class facilitates measurements across the hierarchy, defined by the
tags
list. See also the aggregate
method.
Parameters:
evaluators
: functions with arguments y
, y_hat
(numpy arrays).
References:
*
source
HierarchicalEvaluation.evaluate
*Hierarchical Evaluation Method.
Parameters:
Y_hat_df
: pd.DataFrame, Forecasts indexed by
'unique_id'
with column 'ds'
and models to evaluate.
Y_test_df
: pd.DataFrame, True values with columns ['ds', 'y']
.
tags
: np.array, each str key is a level and its value contains tags
associated to that level.
Y_df
: pd.DataFrame, Training set of base
time series with columns ['ds', 'y']
indexed by unique_id
.
benchmark
: str, If passed, evaluators are scaled by the error of this
benchark.
Returns:
evaluation
: pd.DataFrame with accuracy measurements
across hierarchical levels.*
Example
References
- Gneiting, Tilmann, and Adrian E. Raftery. (2007). “Strictly proper scoring rules, prediction and estimation”. Journal of the American Statistical Association.
- Gneiting, Tilmann. (2011). “Quantiles as optimal point forecasts”. International Journal of Forecasting.
- Spyros Makridakis, Evangelos Spiliotis, Vassilios Assimakopoulos, Zhi Chen, Anil Gaba, Ilia Tsetlin, Robert L. Winkler. (2022). “The M5 uncertainty competition: Results, findings and conclusions”. International Journal of Forecasting.
- Anastasios Panagiotelis, Puwasala Gamakumara, George Athanasopoulos, Rob J. Hyndman. (2022). “Probabilistic forecast reconciliation: Properties, evaluation and score optimisation”. European Journal of Operational Research.
- Syama Sundar Rangapuram, Lucien D Werner, Konstantinos Benidis, Pedro Mercado, Jan Gasthaus, Tim Januschowski. (2021). “End-to-End Learning of Coherent Probabilistic Forecasts for Hierarchical Time Series”. Proceedings of the 38th International Conference on Machine Learning (ICML).
- Kin G. Olivares, O. Nganba Meetei, Ruijun Ma, Rohan Reddy, Mengfei Cao, Lee Dicker (2022). “Probabilistic Hierarchical Forecasting with Deep Poisson Mixtures”. Submitted to the International Journal Forecasting, Working paper available at arxiv.
- Makridakis, S., Spiliotis E., and Assimakopoulos V. (2022). “M5 Accuracy Competition: Results, Findings, and Conclusions.”, International Journal of Forecasting, Volume 38, Issue 4.