Skip to main content

module hierarchicalforecast.evaluation


function mse

mse(
    y: ndarray,
    y_hat: ndarray,
    weights: Optional[ndarray] = None,
    axis: Optional[int] = None
) → Union[float, ndarray]
Mean Squared Error Calculates Mean Squared Error between y and y_hat. MSE measures the relative prediction accuracy of a forecasting method by calculating the squared deviation of the prediction and the true value at a given time, and averages these devations over the length of the series. MSE(yτ,y^τ)=1Hτ=t+1t+H(yτy^τ)2\mathrm{MSE}(\mathbf{y}_{\tau}, \mathbf{\hat{y}}_{\tau}) = \frac{1}{H} \sum^{t+H}_{\tau=t+1} (y_{\tau} - \hat{y}_{\tau})^{2} Args:
  • y (np.ndarray): numpy array, Actual values.
  • y_hat (np.ndarray): numpy array, Predicted values.
  • weights (Optional[np.ndarray], optional): numpy array, Specifies date stamps per serie to consider in loss. Default is None.
  • axis (Optional[int], optional): Axis along which to compute the metric. Default is None.
Returns:
  • Union[float, np.ndarray]: numpy array, (single value).

function mqloss

mqloss(
    y: ndarray,
    y_hat: ndarray,
    quantiles: ndarray,
    weights: Optional[ndarray] = None,
    axis: Optional[int] = None
) → Union[float, ndarray]
Multi-Quantile Loss Calculates the Multi-Quantile loss (MQL) between y and y_hat. MQL calculates the average multi-quantile Loss for a given set of quantiles, based on the absolute difference between predicted quantiles and observed values. MQL(yτ,[y^τ(q1),...,y^τ(qn)])=1nqiQL(yτ,y^τ(qi))\mathrm{MQL}(\mathbf{y}_{\tau},[\mathbf{\hat{y}}^{(q_{1})}_{\tau}, ... ,\hat{y}^{(q_{n})}_{\tau}]) = \frac{1}{n} \sum_{q_{i}} \mathrm{QL}(\mathbf{y}_{\tau}, \mathbf{\hat{y}}^{(q_{i})}_{\tau}) The limit behavior of MQL allows to measure the accuracy of a full predictive distribution F^τ\mathbf{\hat{F}}_{\tau} with the continuous ranked probability score (CRPS). This can be achieved through a numerical integration technique, that discretizes the quantiles and treats the CRPS integral with a left Riemann approximation, averaging over uniformly distanced quantiles. CRPS(yτ,F^τ)=01QL(yτ,y^τ(q))dq\mathrm{CRPS}(y_{\tau}, \mathbf{\hat{F}}_{\tau}) = \int^{1}_{0} \mathrm{QL}(y_{\tau}, \hat{y}^{(q)}_{\tau}) dq Args:
  • y (np.ndarray): numpy array, Actual values.
  • y_hat (np.ndarray): numpy array, Predicted values.
  • quantiles (np.ndarray): numpy array. Quantiles between 0 and 1, to perform evaluation upon size (n_quantiles).
  • weights (Optional[np.ndarray], optional): numpy array, Specifies date stamps per serie to consider in loss. Default is None.
  • axis (Optional[int], optional): Axis along which to compute the metric. Default is None.
Returns:
  • Union[float, np.ndarray]: numpy array, (single value).
References:

function rel_mse

rel_mse(y, y_hat, y_train, mask=None)
Relative Mean Squared Error Computes Relative mean squared error (RelMSE), as proposed by Hyndman & Koehler (2006) as an alternative to percentage errors, to avoid measure unstability. RelMSE(y,y^,y^naive1)=MSE(y,y^)MSE(y,y^naive1)\mathrm{RelMSE}(\mathbf{y}, \mathbf{\hat{y}}, \mathbf{\hat{y}}^{naive1}) = \frac{\mathrm{MSE}(\mathbf{y}, \mathbf{\hat{y}})}{\mathrm{MSE}(\mathbf{y}, \mathbf{\hat{y}}^{naive1})} Args:
  • y (np.ndarray): numpy array, Actual values of size (n_series, horizon).
  • y_hat (np.ndarray): numpy array, Predicted values (n_series, horizon).
  • y_train (np.ndarray): numpy array, Training values.
  • mask (Optional[np.ndarray], optional): numpy array, Specifies date stamps per serie to consider in loss. Default is None.
Returns:
  • float: loss.
References:

function msse

msse(y, y_hat, y_train, mask=None)
Mean Squared Scaled Error Computes Mean squared scaled error (MSSE), as proposed by Hyndman & Koehler (2006) as an alternative to percentage errors, to avoid measure unstability. MSSE(y,y^,yinsample)=1hτ=t+1t+h(yτy^τ)21t1τ=2t(yτyτ1)2\mathrm{MSSE}(\mathbf{y}, \mathbf{\hat{y}}, \mathbf{y}^{in-sample}) = \frac{\frac{1}{h} \sum^{t+h}_{\tau=t+1} (y_{\tau} - \hat{y}_{\tau})^2}{\frac{1}{t-1} \sum^{t}_{\tau=2} (y_{\tau} - y_{\tau-1})^2} where nn (n=n=n) is the size of the training data, and hh is the forecasting horizon (h=h=horizon). Args:
  • y (np.ndarray): numpy array, Actual values of size (n_series, horizon).
  • y_hat (np.ndarray): numpy array, Predicted values (n_series, horizon).
  • y_train (np.ndarray): numpy array, Predicted values (n_series, n).
  • mask (Optional[np.ndarray], optional): numpy array, Specifies date stamps per serie to consider in loss. Default is None.
Returns:
  • float: loss.
References:

function scaled_crps

scaled_crps(y, y_hat, quantiles)
Scaled Continues Ranked Probability Score Calculates a scaled variation of the CRPS, as proposed by Rangapuram (2021), to measure the accuracy of predicted quantiles y_hat compared to the observation y. This metric averages percentual weighted absolute deviations as defined by the quantile losses. sCRPS(F^τ,yτ)=2Ni01QL(F^i,τ,yi,τ)qiyi,τdq\mathrm{sCRPS}(\hat{F}_{\tau}, \mathbf{y}_{\tau}) = \frac{2}{N} \sum_{i} \int^{1}_{0} \frac{\mathrm{QL}(\hat{F}_{i,\tau}, y_{i,\tau})_{q}}{\sum_{i} | y_{i,\tau} |} dq where F^τ\hat{F}_{\tau} is the an estimated multivariate distribution, and yi,τy_{i,\tau} are its realizations. Args:
  • y (np.ndarray): numpy array, Actual values of size (n_series, horizon).
  • y_hat (np.ndarray): numpy array, Predicted quantiles of size (n_series, horizon, n_quantiles).
  • quantiles (np.ndarray): numpy array,(n_quantiles). Quantiles to estimate from the distribution of y.
Returns:
  • float: loss.
References:

function energy_score

energy_score(y, y_sample1, y_sample2, beta=2)
Energy Score Calculates Gneiting’s Energy Score sample approximation for y and independent multivariate samples y_sample1 and y_sample2. The Energy Score generalizes the CRPS (beta=1) in the multivariate setting. ES(yτ,y^τ,y^τ)=12EP^[y^τy^τβ]EP^[yτy^τβ]β(0,2]\mathrm{ES}(\mathbf{y}_{\tau}, \mathbf{\hat{y}}_{\tau}, \mathbf{\hat{y}}_{\tau}') = \frac{1}{2} \mathbb{E}_{\hat{P}} \left[ ||\mathbf{\hat{y}}_{\tau} - \mathbf{\hat{y}}_{\tau}'||^{\beta} \right] - \mathbb{E}_{\hat{P}} \left[ ||\mathbf{y}_{\tau} - \mathbf{\hat{y}}_{\tau}||^{\beta} \right] \quad \beta \in (0,2] where y^τ,y^τ\mathbf{\hat{y}}_{\tau}, \mathbf{\hat{y}}_{\tau}' are independent samples drawn from P^\hat{P}. Args:
  • y (np.ndarray): numpy array, Actual values of size (n_series, horizon).
  • y_sample1 (np.ndarray): numpy array, predictive distribution sample of size (n_series, horizon, n_samples).
  • y_sample2 (np.ndarray): numpy array, predictive distribution sample of size (n_series, horizon, n_samples).
  • beta (float, optional): float in (0,2], defines the energy score’s power for the euclidean metric. Default is 2.
Returns:
  • float: score.
References:

function log_score

log_score(y, y_hat, cov, allow_singular=True)
Log Score. One of the simplest multivariate probability scoring rules, it evaluates the negative density at the value of the realisation. LS(yτ,P(θτ))=log(f(yτ,θτ))\mathrm{LS}(\mathbf{y}_{\tau}, \mathbf{P}(\theta_{\tau})) = - \log(f(\mathbf{y}_{\tau}, \theta_{\tau})) where ff is the density, P(θτ)\mathbf{P}(\theta_{\tau}) is a parametric distribution and f(yτ,θτ)f(\mathbf{y}_{\tau}, \theta_{\tau}) represents its density. For the moment we only support multivariate normal log score. f(yτ,θτ)=(2π)k/2det(Σ)1/2exp(12(yτy^τ) ⁣TΣ1(yτy^τ))f(\mathbf{y}_{\tau}, \theta_{\tau}) = (2\pi )^{-k/2}\det({\boldsymbol{\Sigma }})^{-1/2} \,\exp \left(- {\frac {1}{2}}(\mathbf{y}_{\tau} -\hat{\mathbf{y}}_{\tau})^{\!{\mathsf{T}}} {\boldsymbol{\Sigma }}^{-1} (\mathbf{y}_{\tau} -\hat{\mathbf{y}}_{\tau}) \right) Args:
  • y (np.ndarray): numpy array, Actual values of size (n_series, horizon).
  • y_hat (np.ndarray): numpy array, Predicted values (n_series, horizon).
  • cov (np.ndarray): numpy matrix, Predicted values covariance (n_series, n_series, horizon).
  • allow_singular (bool, optional): if true allows singular covariance. Default is True.
Returns:
  • float: score.

function evaluate

evaluate(
    df: ~FrameT,
    metrics: list[Callable],
    tags: dict[str, ndarray],
    models: Optional[list[str]] = None,
    train_df: Optional[~FrameT] = None,
    level: Optional[list[int]] = None,
    id_col: str = 'unique_id',
    time_col: str = 'ds',
    target_col: str = 'y',
    agg_fn: Optional[str] = 'mean',
    benchmark: Optional[str] = None
) → ~FrameT
Evaluate hierarchical forecast using different metrics. Args:
  • df (pandas, polars, dask or spark DataFrame): Forecasts to evaluate. Must have id_col, time_col, target_col and models’ predictions.
  • metrics (list of callable): Functions with arguments df, models, id_col, target_col and optionally train_df.
  • tags (dict): Each key is a level in the hierarchy and its value contains tags associated to that level. Each key is a level in the hierarchy and its value contains tags associated to that level.
  • models (list of str, optional): Names of the models to evaluate. If None will use every column in the dataframe after removing id, time and target.
  • train_df (pandas, polars, dask or spark DataFrame, optional): Training set. Used to evaluate metrics such as mase.
  • level (list of int, optional): Prediction interval levels. Used to compute losses that rely on quantiles.
  • id_col (str): Column that identifies each serie.
  • time_col (str): Column that identifies each timestep, its values can be timestamps or integers.
  • target_col (str): Column that contains the target.
  • agg_fn (str, optional): Statistic to compute on the scores by id to reduce them to a single number.
  • benchmark (str, optional): If passed, evaluators are scaled by the error of this benchmark model.
Returns:
  • pandas, polars DataFrame: Metrics with one row per (id, metric) combination and one column per model. If agg_fn is not None, there is only one row per metric.

class HierarchicalEvaluation

Hierarchical Evaluation Class. You can use your own metrics to evaluate the performance of each level in the structure. The metrics receive y and y_hat as arguments and they are numpy arrays of size (series, horizon). Consider, for example, the function rmse that calculates the root mean squared error. This class facilitates measurements across the hierarchy, defined by the tags list. See also the aggregate method. Args:
  • evaluators (list[Callable]): functions with arguments y, y_hat (numpy arrays).
References:

method __init__

__init__(evaluators: list[Callable])

method evaluate

evaluate(
    Y_hat_df: Union[ForwardRef('DataFrame[Any]'), ForwardRef('LazyFrame[Any]')],
    Y_test_df: Union[ForwardRef('DataFrame[Any]'), ForwardRef('LazyFrame[Any]')],
    tags: dict[str, ndarray],
    Y_df: Optional[ForwardRef('DataFrame[Any]'), ForwardRef('LazyFrame[Any]')] = None,
    benchmark: Optional[str] = None,
    id_col: str = 'unique_id',
    time_col: str = 'ds',
    target_col: str = 'y'
) → ~FrameT
Hierarchical Evaluation Method. Args:
  • Y_hat_df (Frame): DataFrame, Forecasts with columns 'unique_id', 'ds' and models to evaluate.
  • Y_test_df (Frame): DataFrame, Observed values with columns ['unique_id', 'ds', 'y'].
  • tags (dict[str, np.ndarray]): np.array, each str key is a level and its value contains tags associated to that level.
  • Y_df (Optional[Frame], optional): DataFrame, Training set of base time series with columns ['unique_id', 'ds', 'y']. Default is None.
  • benchmark (Optional[str], optional): str, If passed, evaluators are scaled by the error of this benchark. Default is None.
  • id_col (str, optional): str=‘unique_id’, column that identifies each serie. Default is “unique_id”.
  • time_col (str, optional): str=‘ds’, column that identifies each timestep, its values can be timestamps or integers. Default is “ds”.
  • target_col (str, optional): str=‘y’, column that contains the target. Default is “y”.
Returns:
  • FrameT: evaluation: DataFrame with accuracy measurements across hierarchical levels.