hierarchicalforecast.evaluation
mse
y
and y_hat
. MSE measures the relative prediction accuracy of a forecasting method by calculating the squared deviation of the prediction and the true value at a given time, and averages these devations over the length of the series.
Args:
y
(np.ndarray): numpy array, Actual values.y_hat
(np.ndarray): numpy array, Predicted values.weights
(Optional[np.ndarray], optional): numpy array, Specifies date stamps per serie to consider in loss. Default is None.axis
(Optional[int], optional): Axis along which to compute the metric. Default is None.Union[float, np.ndarray]
: numpy array, (single value).mqloss
y
and y_hat
. MQL calculates the average multi-quantile Loss for a given set of quantiles, based on the absolute difference between predicted quantiles and observed values.
The limit behavior of MQL allows to measure the accuracy of a full predictive distribution with the continuous ranked probability score (CRPS). This can be achieved through a numerical integration technique, that discretizes the quantiles and treats the CRPS integral with a left Riemann approximation, averaging over uniformly distanced quantiles.
Args:
y
(np.ndarray): numpy array, Actual values.y_hat
(np.ndarray): numpy array, Predicted values.quantiles
(np.ndarray): numpy array. Quantiles between 0 and 1, to perform evaluation upon size (n_quantiles).weights
(Optional[np.ndarray], optional): numpy array, Specifies date stamps per serie to consider in loss. Default is None.axis
(Optional[int], optional): Axis along which to compute the metric. Default is None.Union[float, np.ndarray]
: numpy array, (single value).[Roger Koenker and Gilbert Bassett, Jr., "Regression Quantiles".](https
: //www.jstor.org/stable/1913643)[James E. Matheson and Robert L. Winkler, "Scoring Rules for Continuous Probability Distributions".](https
: //www.jstor.org/stable/2629907)rel_mse
y
(np.ndarray): numpy array, Actual values of size (n_series
, horizon
).y_hat
(np.ndarray): numpy array, Predicted values (n_series
, horizon
).y_train
(np.ndarray): numpy array, Training values.mask
(Optional[np.ndarray], optional): numpy array, Specifies date stamps per serie to consider in loss. Default is None.float
: loss.International Journal of Forecasting, Volume 22, Issue 4.](https
: //www.sciencedirect.com/science/article/pii/S0169207006000239)
Submitted to the International Journal Forecasting, Working paper available at arxiv.](https
: //arxiv.org/pdf/2110.13179.pdf)msse
n
) is the size of the training data, and is the forecasting horizon (horizon
).
Args:
y
(np.ndarray): numpy array, Actual values of size (n_series
, horizon
).y_hat
(np.ndarray): numpy array, Predicted values (n_series
, horizon
).y_train
(np.ndarray): numpy array, Predicted values (n_series
, n
).mask
(Optional[np.ndarray], optional): numpy array, Specifies date stamps per serie to consider in loss. Default is None.float
: loss.International Journal of Forecasting, Volume 22, Issue 4.](https
: //www.sciencedirect.com/science/article/pii/S0169207006000239)scaled_crps
y_hat
compared to the observation y
.
This metric averages percentual weighted absolute deviations as defined by the quantile losses.
where is the an estimated multivariate distribution, and are its realizations.
Args:
y
(np.ndarray): numpy array, Actual values of size (n_series
, horizon
).y_hat
(np.ndarray): numpy array, Predicted quantiles of size (n_series
, horizon
, n_quantiles
).quantiles
(np.ndarray): numpy array,(n_quantiles
). Quantiles to estimate from the distribution of y.float
: loss.International Journal of Forecasting.](https
: //www.sciencedirect.com/science/article/pii/S0169207010000063)
"The M5 uncertainty competition
: Results, findings and conclusions”.International Journal of Forecasting.](https
: //www.sciencedirect.com/science/article/pii/S0169207021001722)
Proceedings of the 38th International Conference on Machine Learning (ICML).](https
: //proceedings.mlr.press/v139/rangapuram21a.html)energy_score
y
and independent multivariate samples y_sample1
and y_sample2
. The Energy Score generalizes the CRPS (beta
=1) in the multivariate setting.
- \mathbb{E}_{\hat{P}} \left[ ||\mathbf{y}_{\tau} - \mathbf{\hat{y}}_{\tau}||^{\beta} \right] \quad \beta \in (0,2] $$
where $\mathbf{\hat{y}}_{\tau}, \mathbf{\hat{y}}_{\tau}'$ are independent samples drawn from $\hat{P}$.
**Args:**
- <b>`y`</b> (np.ndarray): numpy array, Actual values of size (`n_series`, `horizon`).
- <b>`y_sample1`</b> (np.ndarray): numpy array, predictive distribution sample of size (`n_series`, `horizon`, `n_samples`).
- <b>`y_sample2`</b> (np.ndarray): numpy array, predictive distribution sample of size (`n_series`, `horizon`, `n_samples`).
- <b>`beta`</b> (float, optional): float in (0,2], defines the energy score's power for the euclidean metric. Default is 2.
**Returns:**
- <b>`float`</b>: score.
References:
- [Gneiting, Tilmann, and Adrian E. Raftery. (2007). "Strictly proper scoring rules, prediction and estimation".
- <b>`Journal of the American Statistical Association.](https`</b>: //sites.stat.washington.edu/raftery/Research/PDF/Gneiting2007jasa.pdf)
- [Anastasios Panagiotelis, Puwasala Gamakumara, George Athanasopoulos, Rob J. Hyndman. (2022).
- <b>`"Probabilistic forecast reconciliation`</b>: Properties, evaluation and score optimisation".
- <b>`European Journal of Operational Research.](https`</b>: //www.sciencedirect.com/science/article/pii/S0377221722006087)
---
## <kbd>function</kbd> `log_score`
```python
log_score(y, y_hat, cov, allow_singular=True)
```
Log Score.
One of the simplest multivariate probability scoring rules, it evaluates the negative density at the value of the realisation.
$$ \mathrm{LS}(\mathbf{y}_{\tau}, \mathbf{P}(\theta_{\tau})) = - \log(f(\mathbf{y}_{\tau}, \theta_{\tau})) $$
where $f$ is the density, $\mathbf{P}(\theta_{\tau})$ is a parametric distribution and $f(\mathbf{y}_{\tau}, \theta_{\tau})$ represents its density. For the moment we only support multivariate normal log score.
$$ f(\mathbf{y}_{\tau}, \theta_{\tau}) = (2\pi )^{-k/2}\det({\boldsymbol{\Sigma }})^{-1/2} \,\exp \left(
-{\frac {1}{2}}(\mathbf{y}_{\tau} -\hat{\mathbf{y}}_{\tau})^{\!{\mathsf{T}}} {\boldsymbol{\Sigma }}^{-1} (\mathbf{y}_{\tau} -\hat{\mathbf{y}}_{\tau}) \right) $$
**Args:**
- <b>`y`</b> (np.ndarray): numpy array, Actual values of size (`n_series`, `horizon`).
- <b>`y_hat`</b> (np.ndarray): numpy array, Predicted values (`n_series`, `horizon`).
- <b>`cov`</b> (np.ndarray): numpy matrix, Predicted values covariance (`n_series`, `n_series`, `horizon`).
- <b>`allow_singular`</b> (bool, optional): if true allows singular covariance. Default is True.
**Returns:**
- <b>`float`</b>: score.
---
## <kbd>function</kbd> `evaluate`
```python
evaluate(
df: ~FrameT,
metrics: list[Callable],
tags: dict[str, ndarray],
models: Optional[list[str]] = None,
train_df: Optional[~FrameT] = None,
level: Optional[list[int]] = None,
id_col: str = 'unique_id',
time_col: str = 'ds',
target_col: str = 'y',
agg_fn: Optional[str] = 'mean',
benchmark: Optional[str] = None
) → ~FrameT
```
Evaluate hierarchical forecast using different metrics.
Parameters
---------- df : pandas, polars, dask or spark DataFrame. Forecasts to evaluate. Must have `id_col`, `time_col`, `target_col` and models' predictions. metrics : list of callable Functions with arguments `df`, `models`, `id_col`, `target_col` and optionally `train_df`. tags : dict Each key is a level in the hierarchy and its value contains tags associated to that level. models : list of str, optional (default=None) Names of the models to evaluate. If `None` will use every column in the dataframe after removing id, time and target. train_df : pandas, polars, dask or spark DataFrame, optional (default=None) Training set. Used to evaluate metrics such as `mase`. level : list of int, optional (default=None) Prediction interval levels. Used to compute losses that rely on quantiles. id_col : str (default='unique_id') Column that identifies each serie. time_col : str (default='ds') Column that identifies each timestep, its values can be timestamps or integers. target_col : str (default='y') Column that contains the target. agg_fn : str, optional (default="mean") Statistic to compute on the scores by id to reduce them to a single number. benchmark : str, optional (default=None) If passed, evaluators are scaled by the error of this benchmark model.
Returns
------- pandas, polars DataFrame Metrics with one row per (id, metric) combination and one column per model. If `agg_fn` is not `None`, there is only one row per metric.
---
## <kbd>class</kbd> `HierarchicalEvaluation`
Hierarchical Evaluation Class.
You can use your own metrics to evaluate the performance of each level in the structure. The metrics receive `y` and `y_hat` as arguments and they are numpy arrays of size `(series, horizon)`. Consider, for example, the function `rmse` that calculates the root mean squared error.
This class facilitates measurements across the hierarchy, defined by the `tags` list. See also the [aggregate method](https://nixtla.github.io/hierarchicalforecast/utils.html#aggregate).
**Args:**
- <b>`evaluators`</b> (list[Callable]): functions with arguments `y`, `y_hat` (numpy arrays).
References:
### <kbd>method</kbd> `__init__`
```python
__init__(evaluators: list[Callable])
```
---
### <kbd>method</kbd> `evaluate`
```python
evaluate(
Y_hat_df: Union[ForwardRef('DataFrame[Any]'), ForwardRef('LazyFrame[Any]')],
Y_test_df: Union[ForwardRef('DataFrame[Any]'), ForwardRef('LazyFrame[Any]')],
tags: dict[str, ndarray],
Y_df: Optional[ForwardRef('DataFrame[Any]'), ForwardRef('LazyFrame[Any]')] = None,
benchmark: Optional[str] = None,
id_col: str = 'unique_id',
time_col: str = 'ds',
target_col: str = 'y'
) → ~FrameT
```
Hierarchical Evaluation Method.
**Args:**
- <b>`Y_hat_df`</b> (Frame): DataFrame, Forecasts with columns `'unique_id'`, `'ds'` and models to evaluate.
- <b>`Y_test_df`</b> (Frame): DataFrame, Observed values with columns `['unique_id', 'ds', 'y']`.
- <b>`tags`</b> (dict[str, np.ndarray]): np.array, each str key is a level and its value contains tags associated to that level.
- <b>`Y_df`</b> (Optional[Frame], optional): DataFrame, Training set of base time series with columns `['unique_id', 'ds', 'y']`. Default is None.
- <b>`benchmark`</b> (Optional[str], optional): str, If passed, evaluators are scaled by the error of this benchark. Default is None.
- <b>`id_col`</b> (str, optional): str='unique_id', column that identifies each serie. Default is "unique_id".
- <b>`time_col`</b> (str, optional): str='ds', column that identifies each timestep, its values can be timestamps or integers. Default is "ds".
- <b>`target_col`</b> (str, optional): str='y', column that contains the target. Default is "y".
**Returns:**
- <b>`FrameT`</b>: evaluation: DataFrame with accuracy measurements across hierarchical levels.