module hierarchicalforecast.evaluation


function mse

mse(
    y: ndarray,
    y_hat: ndarray,
    weights: Optional[ndarray] = None,
    axis: Optional[int] = None
) → Union[float, ndarray]
Mean Squared Error Calculates Mean Squared Error between y and y_hat. MSE measures the relative prediction accuracy of a forecasting method by calculating the squared deviation of the prediction and the true value at a given time, and averages these devations over the length of the series. MSE(yτ,y^τ)=1Hτ=t+1t+H(yτy^τ)2\mathrm{MSE}(\mathbf{y}_{\tau}, \mathbf{\hat{y}}_{\tau}) = \frac{1}{H} \sum^{t+H}_{\tau=t+1} (y_{\tau} - \hat{y}_{\tau})^{2} Args:
  • y (np.ndarray): numpy array, Actual values.
  • y_hat (np.ndarray): numpy array, Predicted values.
  • weights (Optional[np.ndarray], optional): numpy array, Specifies date stamps per serie to consider in loss. Default is None.
  • axis (Optional[int], optional): Axis along which to compute the metric. Default is None.
Returns:
  • Union[float, np.ndarray]: numpy array, (single value).

function mqloss

mqloss(
    y: ndarray,
    y_hat: ndarray,
    quantiles: ndarray,
    weights: Optional[ndarray] = None,
    axis: Optional[int] = None
) → Union[float, ndarray]
Multi-Quantile Loss Calculates the Multi-Quantile loss (MQL) between y and y_hat. MQL calculates the average multi-quantile Loss for a given set of quantiles, based on the absolute difference between predicted quantiles and observed values. MQL(yτ,[y^τ(q1),...,y^τ(qn)])=1nqiQL(yτ,y^τ(qi))\mathrm{MQL}(\mathbf{y}_{\tau},[\mathbf{\hat{y}}^{(q_{1})}_{\tau}, ... ,\hat{y}^{(q_{n})}_{\tau}]) = \frac{1}{n} \sum_{q_{i}} \mathrm{QL}(\mathbf{y}_{\tau}, \mathbf{\hat{y}}^{(q_{i})}_{\tau}) The limit behavior of MQL allows to measure the accuracy of a full predictive distribution F^τ\mathbf{\hat{F}}_{\tau} with the continuous ranked probability score (CRPS). This can be achieved through a numerical integration technique, that discretizes the quantiles and treats the CRPS integral with a left Riemann approximation, averaging over uniformly distanced quantiles. CRPS(yτ,F^τ)=01QL(yτ,y^τ(q))dq\mathrm{CRPS}(y_{\tau}, \mathbf{\hat{F}}_{\tau}) = \int^{1}_{0} \mathrm{QL}(y_{\tau}, \hat{y}^{(q)}_{\tau}) dq Args:
  • y (np.ndarray): numpy array, Actual values.
  • y_hat (np.ndarray): numpy array, Predicted values.
  • quantiles (np.ndarray): numpy array. Quantiles between 0 and 1, to perform evaluation upon size (n_quantiles).
  • weights (Optional[np.ndarray], optional): numpy array, Specifies date stamps per serie to consider in loss. Default is None.
  • axis (Optional[int], optional): Axis along which to compute the metric. Default is None.
Returns:
  • Union[float, np.ndarray]: numpy array, (single value).
References:

function rel_mse

rel_mse(y, y_hat, y_train, mask=None)
Relative Mean Squared Error Computes Relative mean squared error (RelMSE), as proposed by Hyndman & Koehler (2006) as an alternative to percentage errors, to avoid measure unstability. RelMSE(y,y^,y^naive1)=MSE(y,y^)MSE(y,y^naive1)\mathrm{RelMSE}(\mathbf{y}, \mathbf{\hat{y}}, \mathbf{\hat{y}}^{naive1}) = \frac{\mathrm{MSE}(\mathbf{y}, \mathbf{\hat{y}})}{\mathrm{MSE}(\mathbf{y}, \mathbf{\hat{y}}^{naive1})} Args:
  • y (np.ndarray): numpy array, Actual values of size (n_series, horizon).
  • y_hat (np.ndarray): numpy array, Predicted values (n_series, horizon).
  • y_train (np.ndarray): numpy array, Training values.
  • mask (Optional[np.ndarray], optional): numpy array, Specifies date stamps per serie to consider in loss. Default is None.
Returns:
  • float: loss.
References:
  • [Hyndman, R. J and Koehler, A. B. (2006). “Another look at measures of forecast accuracy”,
  • International Journal of Forecasting, Volume 22, Issue 4.](https: //www.sciencedirect.com/science/article/pii/S0169207006000239)
    • [Kin G. Olivares, O. Nganba Meetei, Ruijun Ma, Rohan Reddy, Mengfei Cao, Lee Dicker. “Probabilistic Hierarchical Forecasting with Deep Poisson Mixtures.
  • Submitted to the International Journal Forecasting, Working paper available at arxiv.](https: //arxiv.org/pdf/2110.13179.pdf)

function msse

msse(y, y_hat, y_train, mask=None)
Mean Squared Scaled Error Computes Mean squared scaled error (MSSE), as proposed by Hyndman & Koehler (2006) as an alternative to percentage errors, to avoid measure unstability. MSSE(y,y^,yinsample)=1hτ=t+1t+h(yτy^τ)21t1τ=2t(yτyτ1)2\mathrm{MSSE}(\mathbf{y}, \mathbf{\hat{y}}, \mathbf{y}^{in-sample}) = \frac{\frac{1}{h} \sum^{t+h}_{\tau=t+1} (y_{\tau} - \hat{y}_{\tau})^2}{\frac{1}{t-1} \sum^{t}_{\tau=2} (y_{\tau} - y_{\tau-1})^2} where nn (n=n=n) is the size of the training data, and hh is the forecasting horizon (h=h=horizon). Args:
  • y (np.ndarray): numpy array, Actual values of size (n_series, horizon).
  • y_hat (np.ndarray): numpy array, Predicted values (n_series, horizon).
  • y_train (np.ndarray): numpy array, Predicted values (n_series, n).
  • mask (Optional[np.ndarray], optional): numpy array, Specifies date stamps per serie to consider in loss. Default is None.
Returns:
  • float: loss.
References:

function scaled_crps

scaled_crps(y, y_hat, quantiles)
Scaled Continues Ranked Probability Score Calculates a scaled variation of the CRPS, as proposed by Rangapuram (2021), to measure the accuracy of predicted quantiles y_hat compared to the observation y. This metric averages percentual weighted absolute deviations as defined by the quantile losses. sCRPS(F^τ,yτ)=2Ni01QL(F^i,τ,yi,τ)qiyi,τdq\mathrm{sCRPS}(\hat{F}_{\tau}, \mathbf{y}_{\tau}) = \frac{2}{N} \sum_{i} \int^{1}_{0} \frac{\mathrm{QL}(\hat{F}_{i,\tau}, y_{i,\tau})_{q}}{\sum_{i} | y_{i,\tau} |} dq where F^τ\hat{F}_{\tau} is the an estimated multivariate distribution, and yi,τy_{i,\tau} are its realizations. Args:
  • y (np.ndarray): numpy array, Actual values of size (n_series, horizon).
  • y_hat (np.ndarray): numpy array, Predicted quantiles of size (n_series, horizon, n_quantiles).
  • quantiles (np.ndarray): numpy array,(n_quantiles). Quantiles to estimate from the distribution of y.
Returns:
  • float: loss.
References:
  • [Gneiting, Tilmann. (2011). “Quantiles as optimal point forecasts”.
  • International Journal of Forecasting.](https: //www.sciencedirect.com/science/article/pii/S0169207010000063)
    • [Spyros Makridakis, Evangelos Spiliotis, Vassilios Assimakopoulos, Zhi Chen, Anil Gaba, Ilia Tsetlin, Robert L. Winkler. (2022).
  • "The M5 uncertainty competition: Results, findings and conclusions”.
  • International Journal of Forecasting.](https: //www.sciencedirect.com/science/article/pii/S0169207021001722)
    • [Syama Sundar Rangapuram, Lucien D Werner, Konstantinos Benidis, Pedro Mercado, Jan Gasthaus, Tim Januschowski. (2021). “End-to-End Learning of Coherent Probabilistic Forecasts for Hierarchical Time Series”.
  • Proceedings of the 38th International Conference on Machine Learning (ICML).](https: //proceedings.mlr.press/v139/rangapuram21a.html)

function energy_score

energy_score(y, y_sample1, y_sample2, beta=2)
Energy Score Calculates Gneiting’s Energy Score sample approximation for y and independent multivariate samples y_sample1 and y_sample2. The Energy Score generalizes the CRPS (beta=1) in the multivariate setting. - \mathbb{E}_{\hat{P}} \left[ ||\mathbf{y}_{\tau} - \mathbf{\hat{y}}_{\tau}||^{\beta} \right] \quad \beta \in (0,2] $$ where $\mathbf{\hat{y}}_{\tau}, \mathbf{\hat{y}}_{\tau}'$ are independent samples drawn from $\hat{P}$. **Args:** - <b>`y`</b> (np.ndarray): numpy array, Actual values of size (`n_series`, `horizon`). - <b>`y_sample1`</b> (np.ndarray): numpy array, predictive distribution sample of size (`n_series`, `horizon`, `n_samples`). - <b>`y_sample2`</b> (np.ndarray): numpy array, predictive distribution sample of size (`n_series`, `horizon`, `n_samples`). - <b>`beta`</b> (float, optional): float in (0,2], defines the energy score's power for the euclidean metric. Default is 2. **Returns:** - <b>`float`</b>: score. References: - [Gneiting, Tilmann, and Adrian E. Raftery. (2007). "Strictly proper scoring rules, prediction and estimation". - <b>`Journal of the American Statistical Association.](https`</b>: //sites.stat.washington.edu/raftery/Research/PDF/Gneiting2007jasa.pdf) - [Anastasios Panagiotelis, Puwasala Gamakumara, George Athanasopoulos, Rob J. Hyndman. (2022). - <b>`"Probabilistic forecast reconciliation`</b>: Properties, evaluation and score optimisation". - <b>`European Journal of Operational Research.](https`</b>: //www.sciencedirect.com/science/article/pii/S0377221722006087) --- ## <kbd>function</kbd> `log_score` ```python log_score(y, y_hat, cov, allow_singular=True) ``` Log Score. One of the simplest multivariate probability scoring rules, it evaluates the negative density at the value of the realisation. $$ \mathrm{LS}(\mathbf{y}_{\tau}, \mathbf{P}(\theta_{\tau})) = - \log(f(\mathbf{y}_{\tau}, \theta_{\tau})) $$ where $f$ is the density, $\mathbf{P}(\theta_{\tau})$ is a parametric distribution and $f(\mathbf{y}_{\tau}, \theta_{\tau})$ represents its density. For the moment we only support multivariate normal log score. $$ f(\mathbf{y}_{\tau}, \theta_{\tau}) = (2\pi )^{-k/2}\det({\boldsymbol{\Sigma }})^{-1/2} \,\exp \left( -{\frac {1}{2}}(\mathbf{y}_{\tau} -\hat{\mathbf{y}}_{\tau})^{\!{\mathsf{T}}} {\boldsymbol{\Sigma }}^{-1} (\mathbf{y}_{\tau} -\hat{\mathbf{y}}_{\tau}) \right) $$ **Args:** - <b>`y`</b> (np.ndarray): numpy array, Actual values of size (`n_series`, `horizon`). - <b>`y_hat`</b> (np.ndarray): numpy array, Predicted values (`n_series`, `horizon`). - <b>`cov`</b> (np.ndarray): numpy matrix, Predicted values covariance (`n_series`, `n_series`, `horizon`). - <b>`allow_singular`</b> (bool, optional): if true allows singular covariance. Default is True. **Returns:** - <b>`float`</b>: score. --- ## <kbd>function</kbd> `evaluate` ```python evaluate( df: ~FrameT, metrics: list[Callable], tags: dict[str, ndarray], models: Optional[list[str]] = None, train_df: Optional[~FrameT] = None, level: Optional[list[int]] = None, id_col: str = 'unique_id', time_col: str = 'ds', target_col: str = 'y', agg_fn: Optional[str] = 'mean', benchmark: Optional[str] = None ) → ~FrameT ``` Evaluate hierarchical forecast using different metrics. Parameters ---------- df : pandas, polars, dask or spark DataFrame. Forecasts to evaluate. Must have `id_col`, `time_col`, `target_col` and models' predictions. metrics : list of callable Functions with arguments `df`, `models`, `id_col`, `target_col` and optionally `train_df`. tags : dict Each key is a level in the hierarchy and its value contains tags associated to that level. models : list of str, optional (default=None) Names of the models to evaluate. If `None` will use every column in the dataframe after removing id, time and target. train_df : pandas, polars, dask or spark DataFrame, optional (default=None) Training set. Used to evaluate metrics such as `mase`. level : list of int, optional (default=None) Prediction interval levels. Used to compute losses that rely on quantiles. id_col : str (default='unique_id') Column that identifies each serie. time_col : str (default='ds') Column that identifies each timestep, its values can be timestamps or integers. target_col : str (default='y') Column that contains the target. agg_fn : str, optional (default="mean") Statistic to compute on the scores by id to reduce them to a single number. benchmark : str, optional (default=None) If passed, evaluators are scaled by the error of this benchmark model. Returns ------- pandas, polars DataFrame Metrics with one row per (id, metric) combination and one column per model. If `agg_fn` is not `None`, there is only one row per metric. --- ## <kbd>class</kbd> `HierarchicalEvaluation` Hierarchical Evaluation Class. You can use your own metrics to evaluate the performance of each level in the structure. The metrics receive `y` and `y_hat` as arguments and they are numpy arrays of size `(series, horizon)`. Consider, for example, the function `rmse` that calculates the root mean squared error. This class facilitates measurements across the hierarchy, defined by the `tags` list. See also the [aggregate method](https://nixtla.github.io/hierarchicalforecast/utils.html#aggregate). **Args:** - <b>`evaluators`</b> (list[Callable]): functions with arguments `y`, `y_hat` (numpy arrays). References: ### <kbd>method</kbd> `__init__` ```python __init__(evaluators: list[Callable]) ``` --- ### <kbd>method</kbd> `evaluate` ```python evaluate( Y_hat_df: Union[ForwardRef('DataFrame[Any]'), ForwardRef('LazyFrame[Any]')], Y_test_df: Union[ForwardRef('DataFrame[Any]'), ForwardRef('LazyFrame[Any]')], tags: dict[str, ndarray], Y_df: Optional[ForwardRef('DataFrame[Any]'), ForwardRef('LazyFrame[Any]')] = None, benchmark: Optional[str] = None, id_col: str = 'unique_id', time_col: str = 'ds', target_col: str = 'y' ) → ~FrameT ``` Hierarchical Evaluation Method. **Args:** - <b>`Y_hat_df`</b> (Frame): DataFrame, Forecasts with columns `'unique_id'`, `'ds'` and models to evaluate. - <b>`Y_test_df`</b> (Frame): DataFrame, Observed values with columns `['unique_id', 'ds', 'y']`. - <b>`tags`</b> (dict[str, np.ndarray]): np.array, each str key is a level and its value contains tags associated to that level. - <b>`Y_df`</b> (Optional[Frame], optional): DataFrame, Training set of base time series with columns `['unique_id', 'ds', 'y']`. Default is None. - <b>`benchmark`</b> (Optional[str], optional): str, If passed, evaluators are scaled by the error of this benchark. Default is None. - <b>`id_col`</b> (str, optional): str='unique_id', column that identifies each serie. Default is "unique_id". - <b>`time_col`</b> (str, optional): str='ds', column that identifies each timestep, its values can be timestamps or integers. Default is "ds". - <b>`target_col`</b> (str, optional): str='y', column that contains the target. Default is "y". **Returns:** - <b>`FrameT`</b>: evaluation: DataFrame with accuracy measurements across hierarchical levels.