Reconciliation Diagnostics

Understanding and debugging hierarchical forecast reconciliation

After reconciling hierarchical forecasts, practitioners often need to answer questions like:

How incoherent were my base forecasts? Did they significantly violate the hierarchical constraints?
How much did reconciliation change the forecasts? Which levels were adjusted the most?
Did reconciliation introduce problems? Such as negative values where they shouldn’t exist?
Are the reconciled forecasts numerically coherent? Within acceptable tolerance?

The HierarchicalReconciliation class provides an optional diagnostics=True parameter that generates a comprehensive report answering these questions. This notebook demonstrates the diagnostics feature through three practical use cases. You can run these experiments using CPU or GPU with Google Colab.

Setup

!pip install hierarchicalforecast statsforecast datasetsforecast

import numpy as np
import pandas as pd

from datasetsforecast.hierarchical import HierarchicalData, HierarchicalInfo
from statsforecast import StatsForecast
from statsforecast.models import AutoARIMA, Naive

from hierarchicalforecast.core import HierarchicalReconciliation
from hierarchicalforecast.methods import BottomUp, TopDown, MinTrace

Load Data

We’ll use the TourismSmall dataset which has a 4-level hierarchy: - Country (1 node) - Country/Purpose (4 nodes) - Country/Purpose/State (28 nodes) - Country/Purpose/State/CityNonCity (56 nodes - bottom level)

group_name = 'TourismSmall'
group = HierarchicalInfo.get_group(group_name)
Y_df, S_df, tags = HierarchicalData.load('./data', group_name)
S_df = S_df.reset_index(names="unique_id")
Y_df['ds'] = pd.to_datetime(Y_df['ds'])

# Train/test split
Y_test_df = Y_df.groupby('unique_id').tail(group.horizon)
Y_train_df = Y_df.drop(Y_test_df.index)

print(f"Hierarchy levels: {list(tags.keys())}")
print(f"Total series: {len(S_df)}")
print(f"Bottom series: {S_df.shape[1] - 1}")

Hierarchy levels: ['Country', 'Country/Purpose', 'Country/Purpose/State', 'Country/Purpose/State/CityNonCity']
Total series: 89
Bottom series: 56

Generate Base Forecasts

fcst = StatsForecast(
    models=[AutoARIMA(season_length=group.seasonality), Naive()],
    freq="QE",
    n_jobs=-1
)
Y_hat_df = fcst.forecast(df=Y_train_df, h=group.horizon)
Y_hat_df.head()

	unique_id	ds	AutoARIMA	Naive
0	bus	2006-03-31	8918.478516	11547.0
1	bus	2006-06-30	9581.925781	11547.0
2	bus	2006-09-30	11194.676758	11547.0
3	bus	2006-12-31	10678.958008	11547.0
4	hol	2006-03-31	42805.347656	26418.0

Use Case 1: Verifying Reconciliation Quality

Scenario: You’ve just run reconciliation and want to verify that it worked correctly - that base forecasts were indeed incoherent and reconciliation fixed them. The diagnostics report answers: - Were the base forecasts incoherent? (coherence residuals before > 0) - Are the reconciled forecasts coherent? (coherence residuals after ≈ 0) - Is numerical coherence satisfied within tolerance?

# Run reconciliation with diagnostics
hrec = HierarchicalReconciliation(reconcilers=[BottomUp(), MinTrace(method='ols')])
Y_rec_df = hrec.reconcile(
    Y_hat_df=Y_hat_df,
    Y_df=Y_train_df,
    S_df=S_df,
    tags=tags,
    diagnostics=True  # Enable diagnostics
)

# View overall coherence verification
coherence_metrics = hrec.diagnostics.query(
    "level == 'Overall' and metric in "
    "['coherence_residual_mae_before', 'coherence_residual_mae_after', 'is_coherent', 'coherence_max_violation']"
)
coherence_metrics

	level	metric	AutoARIMA/BottomUp	Naive/BottomUp	AutoARIMA/MinTrace_method-ols	Naive/MinTrace_method-ols
48	Overall	coherence_residual_mae_before	91.123692	0.0	91.123692	0.0
50	Overall	coherence_residual_mae_after	0.000000	0.0	0.000000	0.0
60	Overall	is_coherent	1.000000	1.0	1.000000	1.0
61	Overall	coherence_max_violation	0.000000	0.0	0.000000	0.0

Interpretation: - coherence_residual_mae_before > 0: Base forecasts violated hierarchical constraints - coherence_residual_mae_after ≈ 0: Reconciliation fixed the incoherence - is_coherent = 1.0: Reconciled forecasts satisfy constraints within tolerance - coherence_max_violation: Maximum deviation from perfect coherence (should be tiny)

# View coherence residuals by hierarchy level
residuals_by_level = hrec.diagnostics.query(
    "metric in ['coherence_residual_mae_before', 'coherence_residual_mae_after']"
).pivot(index='level', columns='metric')
residuals_by_level

	AutoARIMA/BottomUp		Naive/BottomUp		AutoARIMA/MinTrace_method-ols		Naive/MinTrace_method-ols
metric	coherence_residual_mae_after	coherence_residual_mae_before	coherence_residual_mae_after	coherence_residual_mae_before	coherence_residual_mae_after	coherence_residual_mae_before	coherence_residual_mae_after	coherence_residual_mae_before
level
Country	0.0	1551.154858	0.0	0.0	0.0	1551.154858	0.0	0.0
Country/Purpose	0.0	996.859118	0.0	0.0	0.0	996.859118	0.0	0.0
Country/Purpose/State	0.0	91.836329	0.0	0.0	0.0	91.836329	0.0	0.0
Country/Purpose/State/CityNonCity	0.0	0.000000	0.0	0.0	0.0	0.000000	0.0	0.0
Overall	0.0	91.123692	0.0	0.0	0.0	91.123692	0.0	0.0

Note that bottom-level series always have 0 coherence residual (they define the hierarchy), while aggregate levels show how much they deviated from the sum of their children.

Use Case 2: Comparing Reconciliation Methods

Scenario: You want to understand how different reconciliation methods affect your forecasts differently. Which method makes smaller adjustments? Which levels are most impacted? The diagnostics report helps compare: - Adjustment magnitude (MAE, RMSE, max) across methods - Which hierarchy levels each method adjusts the most

# Run multiple reconciliation methods
hrec_compare = HierarchicalReconciliation(reconcilers=[
    BottomUp(),
    TopDown(method='forecast_proportions'),
    MinTrace(method='ols'),
    MinTrace(method='wls_struct'),
])
Y_rec_compare = hrec_compare.reconcile(
    Y_hat_df=Y_hat_df,
    Y_df=Y_train_df,
    S_df=S_df,
    tags=tags,
    diagnostics=True
)

# Compare adjustment magnitude across methods (Overall level)
adjustment_comparison = hrec_compare.diagnostics.query(
    "level == 'Overall' and metric in ['adjustment_mae', 'adjustment_rmse', 'adjustment_max']"
)
adjustment_comparison

	level	metric	AutoARIMA/BottomUp	AutoARIMA/TopDown_method-forecast_proportions	AutoARIMA/MinTrace_method-ols	Naive/MinTrace_method-ols	AutoARIMA/MinTrace_method-wls_struct	Naive/MinTrace_method-wls_struct
52	Overall	adjustment_mae	91.123692	152.381830	125.796357	7.790422e-13	92.567005	3.649316e-13
53	Overall	adjustment_rmse	361.699708	327.852747	235.618628	1.956331e-12	297.653444	7.211469e-13
54	Overall	adjustment_max	3563.736473	2354.425237	1367.921921	1.455192e-11	2621.788616	3.637979e-12

Key insights: - BottomUp only adjusts aggregate levels (bottom level has 0 adjustment) - TopDown only adjusts bottom levels (top level has 0 adjustment) - MinTrace methods distribute adjustments across all levels, typically with smaller overall adjustments

# Compare adjustments by hierarchy level for AutoARIMA forecasts
adjustment_by_level = hrec_compare.diagnostics.query("metric == 'adjustment_mae'")

# Pivot for easier comparison
adjustment_pivot = adjustment_by_level.set_index('level').drop(columns=['metric'])
adjustment_pivot.columns = [c.replace('AutoARIMA/', '') for c in adjustment_pivot.columns]
adjustment_pivot = adjustment_pivot[[c for c in adjustment_pivot.columns if 'AutoARIMA' in c or 'Naive' not in c]]
adjustment_pivot

	BottomUp	TopDown_method-forecast_proportions	MinTrace_method-ols	MinTrace_method-wls_struct
level
Country	1551.154858	0.000000	924.028186	1953.754301
Country/Purpose	996.859118	1106.796143	875.789096	666.870396
Country/Purpose/State	91.836329	151.248239	114.460983	61.695544
Country/Purpose/State/CityNonCity	0.000000	87.497279	63.638995	33.745576
Overall	91.123692	152.381830	125.796357	92.567005

This shows how each method distributes adjustments across hierarchy levels. BottomUp concentrates changes at aggregate levels, TopDown at bottom levels, and MinTrace spreads adjustments more evenly.

Use Case 3: Detecting Negative Value Issues

Scenario: Your forecasts represent quantities that cannot be negative (e.g., sales, visitors). You need to check if reconciliation introduced negative values. The diagnostics report tracks: - negative_count_before/after: Count of negative values before and after reconciliation - negative_introduced: Negatives created by reconciliation - negative_removed: Negatives fixed by reconciliation

# Create forecasts with some negative values to demonstrate
Y_hat_with_negatives = Y_hat_df.copy()

# Introduce some negative base forecasts at random locations in the bottom level
bottom_ids = tags['Country/Purpose/State/CityNonCity']
mask = Y_hat_with_negatives['unique_id'].isin(bottom_ids[:10])
Y_hat_with_negatives.loc[mask, 'AutoARIMA'] -= 5000
Y_hat_with_negatives.loc[mask, 'Naive'] -= 5000

print(f"Negative forecasts introduced for AutoARIMA: {(Y_hat_with_negatives['AutoARIMA'] < 0).sum()}")
print(f"Negative forecasts introduced for Naive: {(Y_hat_with_negatives['Naive'] < 0).sum()}")

Negative forecasts introduced for AutoARIMA: 33
Negative forecasts introduced for Naive: 36

# Run reconciliation with diagnostics
hrec_neg = HierarchicalReconciliation(reconcilers=[
    BottomUp(),
    MinTrace(method='ols'),
    MinTrace(method='ols', nonnegative=True),  # Non-negative constraint
])
Y_rec_neg = hrec_neg.reconcile(
    Y_hat_df=Y_hat_with_negatives,
    Y_df=Y_train_df,
    S_df=S_df,
    tags=tags,
    diagnostics=True
)

# Check negative value metrics at Overall level
negative_metrics = hrec_neg.diagnostics.query(
    "level == 'Overall' and metric in "
    "['negative_count_before', 'negative_count_after', 'negative_introduced', 'negative_removed']"
)
negative_metrics

	level	metric	AutoARIMA/BottomUp	Naive/BottomUp	AutoARIMA/MinTrace_method-ols	Naive/MinTrace_method-ols	AutoARIMA/MinTrace_method-ols_nonnegative-True	Naive/MinTrace_method-ols_nonnegative-True
56	Overall	negative_count_before	33.0	36.0	33.0	36.0	33.0	36.0
57	Overall	negative_count_after	55.0	60.0	3.0	4.0	0.0	0.0
58	Overall	negative_introduced	22.0	24.0	0.0	0.0	0.0	0.0
59	Overall	negative_removed	0.0	0.0	30.0	32.0	33.0	36.0

Interpretation: - negative_count_before: Negatives in base forecasts - negative_count_after: Negatives after reconciliation - negative_introduced: New negatives created by reconciliation (bad!) - negative_removed: Negatives fixed by reconciliation (good!) Notice how MinTrace with nonnegative=True eliminates all negative values.

# Check which levels have negative value issues
negatives_by_level = hrec_neg.diagnostics.query(
    "metric in ['negative_count_before', 'negative_count_after']"
).pivot(index='level', columns='metric')
negatives_by_level

	AutoARIMA/BottomUp		Naive/BottomUp		AutoARIMA/MinTrace_method-ols		Naive/MinTrace_method-ols		AutoARIMA/MinTrace_method-ols_nonnegative-True		Naive/MinTrace_method-ols_nonnegative-True
metric	negative_count_after	negative_count_before	negative_count_after	negative_count_before	negative_count_after	negative_count_before	negative_count_after	negative_count_before	negative_count_after	negative_count_before	negative_count_after	negative_count_before
level
Country	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
Country/Purpose	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
Country/Purpose/State	7.0	0.0	8.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
Country/Purpose/State/CityNonCity	15.0	15.0	16.0	16.0	0.0	15.0	0.0	16.0	0.0	15.0	0.0	16.0
Overall	22.0	15.0	24.0	16.0	0.0	15.0	0.0	16.0	0.0	15.0	0.0	16.0

This shows that BottomUp propagates negatives from bottom to aggregate levels, while standard MinTrace may spread negatives further. The nonnegative MinTrace variant addresses this.

Exporting Diagnostics

The diagnostics DataFrame can be exported to CSV for CI pipelines, benchmarks, or sharing with stakeholders.

# Export full diagnostics report
# hrec.diagnostics.to_csv("reconciliation_diagnostics.csv", index=False)

# Or export a summary
summary = hrec.diagnostics.query("level == 'Overall'").copy()
summary

	level	metric	AutoARIMA/BottomUp	Naive/BottomUp	AutoARIMA/MinTrace_method-ols	Naive/MinTrace_method-ols
48	Overall	coherence_residual_mae_before	91.123692	0.0	91.123692	0.000000e+00
49	Overall	coherence_residual_rmse_before	361.699708	0.0	361.699708	0.000000e+00
50	Overall	coherence_residual_mae_after	0.000000	0.0	0.000000	0.000000e+00
51	Overall	coherence_residual_rmse_after	0.000000	0.0	0.000000	0.000000e+00
52	Overall	adjustment_mae	91.123692	0.0	125.796357	7.790422e-13
53	Overall	adjustment_rmse	361.699708	0.0	235.618628	1.956331e-12
54	Overall	adjustment_max	3563.736473	0.0	1367.921921	1.455192e-11
55	Overall	adjustment_mean	29.283713	0.0	46.279825	-5.114311e-13
56	Overall	negative_count_before	0.000000	0.0	0.000000	0.000000e+00
57	Overall	negative_count_after	0.000000	0.0	2.000000	0.000000e+00
58	Overall	negative_introduced	0.000000	0.0	2.000000	0.000000e+00
59	Overall	negative_removed	0.000000	0.0	0.000000	0.000000e+00
60	Overall	is_coherent	1.000000	1.0	1.000000	1.000000e+00
61	Overall	coherence_max_violation	0.000000	0.0	0.000000	0.000000e+00

Summary of Diagnostic Metrics

Metric	Description	Interpretation
`coherence_residual_mae_before`	Mean absolute incoherence before reconciliation	Higher = more incoherent base forecasts
`coherence_residual_mae_after`	Mean absolute incoherence after reconciliation	Should be ~0
`coherence_residual_rmse_before/after`	RMSE variant of above	More sensitive to large violations
`adjustment_mae`	Mean absolute change made by reconciliation	Higher = more forecast modification
`adjustment_rmse`	RMSE of adjustments	More sensitive to large changes
`adjustment_max`	Maximum absolute adjustment	Identifies extreme changes
`adjustment_mean`	Mean adjustment (signed)	Shows directional bias
`negative_count_before`	Count of negatives in base forecasts	-
`negative_count_after`	Count of negatives after reconciliation	Should be 0 for non-negative data
`negative_introduced`	Negatives created by reconciliation	Warning sign if > 0
`negative_removed`	Negatives fixed by reconciliation	Good if > 0
`is_coherent`	Whether forecasts satisfy constraints (Overall only)	1.0 = coherent
`coherence_max_violation`	Maximum coherence violation (Overall only)	Should be < tolerance

Getting Started

Tutorials

API Reference

Reconciliation Diagnostics

Setup

Load Data

Generate Base Forecasts

Use Case 1: Verifying Reconciliation Quality

Use Case 2: Comparing Reconciliation Methods

Use Case 3: Detecting Negative Value Issues

Exporting Diagnostics

Summary of Diagnostic Metrics

References

Getting Started

Tutorials

API Reference

​Setup

​Load Data

​Generate Base Forecasts

​Use Case 1: Verifying Reconciliation Quality

​Use Case 2: Comparing Reconciliation Methods

​Use Case 3: Detecting Negative Value Issues

​Exporting Diagnostics

​Summary of Diagnostic Metrics

​References

Setup

Load Data

Generate Base Forecasts

Use Case 1: Verifying Reconciliation Quality

Use Case 2: Comparing Reconciliation Methods

Use Case 3: Detecting Negative Value Issues

Exporting Diagnostics

Summary of Diagnostic Metrics

References