To assist the evaluation of hierarchical forecasting systems, we make available accuracy metrics along with the HierarchicalEvaluation module that facilitates the measurement of prediction’s accuracy through the hierarchy levels.

The available metrics include point and probabilistic multivariate scoring rules that were used in previous hierarchical forecasting studies.

Accuracy Measurements

Relative Mean Squared Error


source

rel_mse

 rel_mse (y, y_hat, y_train, mask=None)

*Relative Mean Squared Error

Computes Relative mean squared error (RelMSE), as proposed by Hyndman & Koehler (2006) as an alternative to percentage errors, to avoid measure unstability.

RelMSE(y,y^,y^naive1)=MSE(y,y^)MSE(y,y^naive1) \mathrm{RelMSE}(\mathbf{y}, \mathbf{\hat{y}}, \mathbf{\hat{y}}^{naive1}) = \frac{\mathrm{MSE}(\mathbf{y}, \mathbf{\hat{y}})}{\mathrm{MSE}(\mathbf{y}, \mathbf{\hat{y}}^{naive1})}

Parameters:
y: numpy array, Actual values of size (n_series, horizon).
y_hat: numpy array, Predicted values (n_series, horizon).
mask: numpy array, Specifies date stamps per serie to consider in loss.

Returns:
loss: float.

References:
- Hyndman, R. J and Koehler, A. B. (2006). “Another look at measures of forecast accuracy”, International Journal of Forecasting, Volume 22, Issue 4.
- Kin G. Olivares, O. Nganba Meetei, Ruijun Ma, Rohan Reddy, Mengfei Cao, Lee Dicker. “Probabilistic Hierarchical Forecasting with Deep Poisson Mixtures. Submitted to the International Journal Forecasting, Working paper available at arxiv.*

Mean Squared Scaled Error


source

msse

 msse (y, y_hat, y_train, mask=None)

*Mean Squared Scaled Error

Computes Mean squared scaled error (MSSE), as proposed by Hyndman & Koehler (2006) as an alternative to percentage errors, to avoid measure unstability.

MSSE(y,y^,yinsample)=1hτ=t+1t+h(yτy^τ)21t1τ=2t(yτyτ1)2 \mathrm{MSSE}(\mathbf{y}, \mathbf{\hat{y}}, \mathbf{y}^{in-sample}) = \frac{\frac{1}{h} \sum^{t+h}_{\tau=t+1} (y_{\tau} - \hat{y}_{\tau})^2}{\frac{1}{t-1} \sum^{t}_{\tau=2} (y_{\tau} - y_{\tau-1})^2}

where nn (n=n=n) is the size of the training data, and hh is the forecasting horizon (h=h=horizon).

Parameters:
y: numpy array, Actual values of size (n_series, horizon).
y_hat: numpy array, Predicted values (n_series, horizon).
y_train: numpy array, Predicted values (n_series, n).
mask: numpy array, Specifies date stamps per serie to consider in loss.

Returns:
loss: float.

References:
- Hyndman, R. J and Koehler, A. B. (2006). “Another look at measures of forecast accuracy”, International Journal of Forecasting, Volume 22, Issue 4.
*

Scaled CRPS


source

scaled_crps

 scaled_crps (y, y_hat, quantiles)

*Scaled Continues Ranked Probability Score

Calculates a scaled variation of the CRPS, as proposed by Rangapuram (2021), to measure the accuracy of predicted quantiles y_hat compared to the observation y.

This metric averages percentual weighted absolute deviations as defined by the quantile losses.

sCRPS(F^τ,yτ)=2Ni01QL(F^i,τ,yi,τ)qiyi,τdq \mathrm{sCRPS}(\hat{F}_{\tau}, \mathbf{y}_{\tau}) = \frac{2}{N} \sum_{i} \int^{1}_{0} \frac{\mathrm{QL}(\hat{F}_{i,\tau}, y_{i,\tau})_{q}}{\sum_{i} | y_{i,\tau} |} dq

where F^τ\hat{F}_{\tau} is the an estimated multivariate distribution, and yi,τy_{i,\tau} are its realizations.

Parameters:
y: numpy array, Actual values of size (n_series, horizon).
y_hat: numpy array, Predicted quantiles of size (n_series, horizon, n_quantiles).
quantiles: numpy array,(n_quantiles). Quantiles to estimate from the distribution of y.

Returns:
loss: float.

References:
- Gneiting, Tilmann. (2011). “Quantiles as optimal point forecasts”. International Journal of Forecasting.
- Spyros Makridakis, Evangelos Spiliotis, Vassilios Assimakopoulos, Zhi Chen, Anil Gaba, Ilia Tsetlin, Robert L. Winkler. (2022). “The M5 uncertainty competition: Results, findings and conclusions”. International Journal of Forecasting.
- Syama Sundar Rangapuram, Lucien D Werner, Konstantinos Benidis, Pedro Mercado, Jan Gasthaus, Tim Januschowski. (2021). “End-to-End Learning of Coherent Probabilistic Forecasts for Hierarchical Time Series”. Proceedings of the 38th International Conference on Machine Learning (ICML).*

Energy Score


source

energy_score

 energy_score (y, y_sample1, y_sample2, beta=2)

*Energy Score

Calculates Gneiting’s Energy Score sample approximation for y and independent multivariate samples y_sample1 and y_sample2. The Energy Score generalizes the CRPS (beta=1) in the multivariate setting.

ES(yτ,y^τ,y^τ)=12EP^[y^τy^τβ]EP^[yτy^τβ]β(0,2] \mathrm{ES}(\mathbf{y}_{\tau}, \mathbf{\hat{y}}_{\tau}, \mathbf{\hat{y}}_{\tau}') = \frac{1}{2} \mathbb{E}_{\hat{P}} \left[ ||\mathbf{\hat{y}}_{\tau} - \mathbf{\hat{y}}_{\tau}'||^{\beta} \right] - \mathbb{E}_{\hat{P}} \left[ ||\mathbf{y}_{\tau} - \mathbf{\hat{y}}_{\tau}||^{\beta} \right] \quad \beta \in (0,2]

where y^τ,y^τ\mathbf{\hat{y}}_{\tau}, \mathbf{\hat{y}}_{\tau}' are independent samples drawn from P^\hat{P}.

Parameters:
y: numpy array, Actual values of size (n_series, horizon).
y_sample1: numpy array, predictive distribution sample of size (n_series, horizon, n_samples).
y_sample2: numpy array, predictive distribution sample of size (n_series, horizon, n_samples).
beta: float in (0,2], defines the energy score’s power for the euclidean metric.

Returns:
score: float.

References:
- Gneiting, Tilmann, and Adrian E. Raftery. (2007). “Strictly proper scoring rules, prediction and estimation”. Journal of the American Statistical Association.
- Anastasios Panagiotelis, Puwasala Gamakumara, George Athanasopoulos, Rob J. Hyndman. (2022). “Probabilistic forecast reconciliation: Properties, evaluation and score optimisation”. European Journal of Operational Research.*


source

log_score

 log_score (y, y_hat, cov, allow_singular=True)

*Log Score.

One of the simplest multivariate probability scoring rules, it evaluates the negative density at the value of the realisation.

LS(yτ,P(θτ))=log(f(yτ,θτ)) \mathrm{LS}(\mathbf{y}_{\tau}, \mathbf{P}(\theta_{\tau})) = - \log(f(\mathbf{y}_{\tau}, \theta_{\tau}))

where ff is the density, P(θτ)\mathbf{P}(\theta_{\tau}) is a parametric distribution and f(yτ,θτ)f(\mathbf{y}_{\tau}, \theta_{\tau}) represents its density. For the moment we only support multivariate normal log score.

f(yτ,θτ)=(2π)k/2det(Σ)1/2exp(12(yτy^τ) ⁣TΣ1(yτy^τ)) f(\mathbf{y}_{\tau}, \theta_{\tau}) = (2\pi )^{-k/2}\det({\boldsymbol{\Sigma }})^{-1/2} \,\exp \left( -{\frac {1}{2}}(\mathbf{y}_{\tau} -\hat{\mathbf{y}}_{\tau})^{\!{\mathsf{T}}} {\boldsymbol{\Sigma }}^{-1} (\mathbf{y}_{\tau} -\hat{\mathbf{y}}_{\tau}) \right)

Parameters:
y: numpy array, Actual values of size (n_series, horizon).
y_hat: numpy array, Predicted values (n_series, horizon).
cov: numpy matrix, Predicted values covariance (n_series, n_series, horizon).
allow_singular: bool=True, if true allows singular covariance.

Returns:
score: float.*

x = np.linspace(0, 5, 10, endpoint=False)
y = multivariate_normal.pdf(x, mean=2.5, cov=0.5)
y

Hierarchical Evaluation


source

HierarchicalEvaluation

 HierarchicalEvaluation (evaluators:List[Callable])

*Hierarchical Evaluation Class.

You can use your own metrics to evaluate the performance of each level in the structure. The metrics receive y and y_hat as arguments and they are numpy arrays of size (series, horizon). Consider, for example, the function rmse that calculates the root mean squared error.

This class facilitates measurements across the hierarchy, defined by the tags list. See also the aggregate method.

Parameters:
evaluators: functions with arguments y, y_hat (numpy arrays).

References:
*


source

HierarchicalEvaluation.evaluate

 HierarchicalEvaluation.evaluate (Y_hat_df:pandas.core.frame.DataFrame,
                                  Y_test_df:pandas.core.frame.DataFrame,
                                  tags:Dict[str,numpy.ndarray], Y_df:Optio
                                  nal[pandas.core.frame.DataFrame]=None,
                                  benchmark:Optional[str]=None)

*Hierarchical Evaluation Method.

Parameters:
Y_hat_df: pd.DataFrame, Forecasts indexed by 'unique_id' with column 'ds' and models to evaluate.
Y_test_df: pd.DataFrame, True values with columns ['ds', 'y'].
tags: np.array, each str key is a level and its value contains tags associated to that level.
Y_df: pd.DataFrame, Training set of base time series with columns ['ds', 'y'] indexed by unique_id.
benchmark: str, If passed, evaluators are scaled by the error of this benchark.

Returns:
evaluation: pd.DataFrame with accuracy measurements across hierarchical levels.*

Example

References