Skip to main content
Understanding and debugging hierarchical forecast reconciliation
After reconciling hierarchical forecasts, practitioners often need to answer questions like:
  • How incoherent were my base forecasts? Did they significantly violate the hierarchical constraints?
  • How much did reconciliation change the forecasts? Which levels were adjusted the most?
  • Did reconciliation introduce problems? Such as negative values where they shouldn’t exist?
  • Are the reconciled forecasts numerically coherent? Within acceptable tolerance?
The HierarchicalReconciliation class provides an optional diagnostics=True parameter that generates a comprehensive report answering these questions. This notebook demonstrates the diagnostics feature through three practical use cases. You can run these experiments using CPU or GPU with Google Colab. Open In Colab

Setup

!pip install hierarchicalforecast statsforecast datasetsforecast
import numpy as np
import pandas as pd

from datasetsforecast.hierarchical import HierarchicalData, HierarchicalInfo
from statsforecast import StatsForecast
from statsforecast.models import AutoARIMA, Naive

from hierarchicalforecast.core import HierarchicalReconciliation
from hierarchicalforecast.methods import BottomUp, TopDown, MinTrace

Load Data

We’ll use the TourismSmall dataset which has a 4-level hierarchy: - Country (1 node) - Country/Purpose (4 nodes) - Country/Purpose/State (28 nodes) - Country/Purpose/State/CityNonCity (56 nodes - bottom level)
group_name = 'TourismSmall'
group = HierarchicalInfo.get_group(group_name)
Y_df, S_df, tags = HierarchicalData.load('./data', group_name)
S_df = S_df.reset_index(names="unique_id")
Y_df['ds'] = pd.to_datetime(Y_df['ds'])

# Train/test split
Y_test_df = Y_df.groupby('unique_id').tail(group.horizon)
Y_train_df = Y_df.drop(Y_test_df.index)

print(f"Hierarchy levels: {list(tags.keys())}")
print(f"Total series: {len(S_df)}")
print(f"Bottom series: {S_df.shape[1] - 1}")
Hierarchy levels: ['Country', 'Country/Purpose', 'Country/Purpose/State', 'Country/Purpose/State/CityNonCity']
Total series: 89
Bottom series: 56

Generate Base Forecasts

fcst = StatsForecast(
    models=[AutoARIMA(season_length=group.seasonality), Naive()],
    freq="QE",
    n_jobs=-1
)
Y_hat_df = fcst.forecast(df=Y_train_df, h=group.horizon)
Y_hat_df.head()
unique_iddsAutoARIMANaive
0bus2006-03-318918.47851611547.0
1bus2006-06-309581.92578111547.0
2bus2006-09-3011194.67675811547.0
3bus2006-12-3110678.95800811547.0
4hol2006-03-3142805.34765626418.0

Use Case 1: Verifying Reconciliation Quality

Scenario: You’ve just run reconciliation and want to verify that it worked correctly - that base forecasts were indeed incoherent and reconciliation fixed them. The diagnostics report answers: - Were the base forecasts incoherent? (coherence residuals before > 0) - Are the reconciled forecasts coherent? (coherence residuals after ≈ 0) - Is numerical coherence satisfied within tolerance?
# Run reconciliation with diagnostics
hrec = HierarchicalReconciliation(reconcilers=[BottomUp(), MinTrace(method='ols')])
Y_rec_df = hrec.reconcile(
    Y_hat_df=Y_hat_df,
    Y_df=Y_train_df,
    S_df=S_df,
    tags=tags,
    diagnostics=True  # Enable diagnostics
)
# View overall coherence verification
coherence_metrics = hrec.diagnostics.query(
    "level == 'Overall' and metric in "
    "['coherence_residual_mae_before', 'coherence_residual_mae_after', 'is_coherent', 'coherence_max_violation']"
)
coherence_metrics
levelmetricAutoARIMA/BottomUpNaive/BottomUpAutoARIMA/MinTrace_method-olsNaive/MinTrace_method-ols
48Overallcoherence_residual_mae_before91.1236920.091.1236920.0
50Overallcoherence_residual_mae_after0.0000000.00.0000000.0
60Overallis_coherent1.0000001.01.0000001.0
61Overallcoherence_max_violation0.0000000.00.0000000.0
Interpretation: - coherence_residual_mae_before > 0: Base forecasts violated hierarchical constraints - coherence_residual_mae_after ≈ 0: Reconciliation fixed the incoherence - is_coherent = 1.0: Reconciled forecasts satisfy constraints within tolerance - coherence_max_violation: Maximum deviation from perfect coherence (should be tiny)
# View coherence residuals by hierarchy level
residuals_by_level = hrec.diagnostics.query(
    "metric in ['coherence_residual_mae_before', 'coherence_residual_mae_after']"
).pivot(index='level', columns='metric')
residuals_by_level
AutoARIMA/BottomUpNaive/BottomUpAutoARIMA/MinTrace_method-olsNaive/MinTrace_method-ols
metriccoherence_residual_mae_aftercoherence_residual_mae_beforecoherence_residual_mae_aftercoherence_residual_mae_beforecoherence_residual_mae_aftercoherence_residual_mae_beforecoherence_residual_mae_aftercoherence_residual_mae_before
level
Country0.01551.1548580.00.00.01551.1548580.00.0
Country/Purpose0.0996.8591180.00.00.0996.8591180.00.0
Country/Purpose/State0.091.8363290.00.00.091.8363290.00.0
Country/Purpose/State/CityNonCity0.00.0000000.00.00.00.0000000.00.0
Overall0.091.1236920.00.00.091.1236920.00.0
Note that bottom-level series always have 0 coherence residual (they define the hierarchy), while aggregate levels show how much they deviated from the sum of their children.

Use Case 2: Comparing Reconciliation Methods

Scenario: You want to understand how different reconciliation methods affect your forecasts differently. Which method makes smaller adjustments? Which levels are most impacted? The diagnostics report helps compare: - Adjustment magnitude (MAE, RMSE, max) across methods - Which hierarchy levels each method adjusts the most
# Run multiple reconciliation methods
hrec_compare = HierarchicalReconciliation(reconcilers=[
    BottomUp(),
    TopDown(method='forecast_proportions'),
    MinTrace(method='ols'),
    MinTrace(method='wls_struct'),
])
Y_rec_compare = hrec_compare.reconcile(
    Y_hat_df=Y_hat_df,
    Y_df=Y_train_df,
    S_df=S_df,
    tags=tags,
    diagnostics=True
)
# Compare adjustment magnitude across methods (Overall level)
adjustment_comparison = hrec_compare.diagnostics.query(
    "level == 'Overall' and metric in ['adjustment_mae', 'adjustment_rmse', 'adjustment_max']"
)
adjustment_comparison
levelmetricAutoARIMA/BottomUpNaive/BottomUpAutoARIMA/TopDown_method-forecast_proportionsNaive/TopDown_method-forecast_proportionsAutoARIMA/MinTrace_method-olsNaive/MinTrace_method-olsAutoARIMA/MinTrace_method-wls_structNaive/MinTrace_method-wls_struct
52Overalladjustment_mae91.1236920.0152.3818300.0125.7963577.790422e-1392.5670053.649316e-13
53Overalladjustment_rmse361.6997080.0327.8527470.0235.6186281.956331e-12297.6534447.211469e-13
54Overalladjustment_max3563.7364730.02354.4252370.01367.9219211.455192e-112621.7886163.637979e-12
Key insights: - BottomUp only adjusts aggregate levels (bottom level has 0 adjustment) - TopDown only adjusts bottom levels (top level has 0 adjustment) - MinTrace methods distribute adjustments across all levels, typically with smaller overall adjustments
# Compare adjustments by hierarchy level for AutoARIMA forecasts
adjustment_by_level = hrec_compare.diagnostics.query("metric == 'adjustment_mae'")

# Pivot for easier comparison
adjustment_pivot = adjustment_by_level.set_index('level').drop(columns=['metric'])
adjustment_pivot.columns = [c.replace('AutoARIMA/', '') for c in adjustment_pivot.columns]
adjustment_pivot = adjustment_pivot[[c for c in adjustment_pivot.columns if 'AutoARIMA' in c or 'Naive' not in c]]
adjustment_pivot
BottomUpTopDown_method-forecast_proportionsMinTrace_method-olsMinTrace_method-wls_struct
level
Country1551.1548580.000000924.0281861953.754301
Country/Purpose996.8591181106.796143875.789096666.870396
Country/Purpose/State91.836329151.248239114.46098361.695544
Country/Purpose/State/CityNonCity0.00000087.49727963.63899533.745576
Overall91.123692152.381830125.79635792.567005
This shows how each method distributes adjustments across hierarchy levels. BottomUp concentrates changes at aggregate levels, TopDown at bottom levels, and MinTrace spreads adjustments more evenly.

Use Case 3: Detecting Negative Value Issues

Scenario: Your forecasts represent quantities that cannot be negative (e.g., sales, visitors). You need to check if reconciliation introduced negative values. The diagnostics report tracks: - negative_count_before/after: Count of negative values before and after reconciliation - negative_introduced: Negatives created by reconciliation - negative_removed: Negatives fixed by reconciliation
# Create forecasts with some negative values to demonstrate
Y_hat_with_negatives = Y_hat_df.copy()

# Introduce some negative base forecasts at random locations in the bottom level
bottom_ids = tags['Country/Purpose/State/CityNonCity']
mask = Y_hat_with_negatives['unique_id'].isin(bottom_ids[:10])
Y_hat_with_negatives.loc[mask, 'AutoARIMA'] -= 5000
Y_hat_with_negatives.loc[mask, 'Naive'] -= 5000

print(f"Negative forecasts introduced for AutoARIMA: {(Y_hat_with_negatives['AutoARIMA'] < 0).sum()}")
print(f"Negative forecasts introduced for Naive: {(Y_hat_with_negatives['Naive'] < 0).sum()}")
Negative forecasts introduced for AutoARIMA: 33
Negative forecasts introduced for Naive: 36
# Run reconciliation with diagnostics
hrec_neg = HierarchicalReconciliation(reconcilers=[
    BottomUp(),
    MinTrace(method='ols'),
    MinTrace(method='ols', nonnegative=True),  # Non-negative constraint
])
Y_rec_neg = hrec_neg.reconcile(
    Y_hat_df=Y_hat_with_negatives,
    Y_df=Y_train_df,
    S_df=S_df,
    tags=tags,
    diagnostics=True
)
# Check negative value metrics at Overall level
negative_metrics = hrec_neg.diagnostics.query(
    "level == 'Overall' and metric in "
    "['negative_count_before', 'negative_count_after', 'negative_introduced', 'negative_removed']"
)
negative_metrics
levelmetricAutoARIMA/BottomUpNaive/BottomUpAutoARIMA/MinTrace_method-olsNaive/MinTrace_method-olsAutoARIMA/MinTrace_method-ols_nonnegative-TrueNaive/MinTrace_method-ols_nonnegative-True
56Overallnegative_count_before33.036.033.036.033.036.0
57Overallnegative_count_after55.060.03.04.00.00.0
58Overallnegative_introduced22.024.00.00.00.00.0
59Overallnegative_removed0.00.030.032.033.036.0
Interpretation: - negative_count_before: Negatives in base forecasts - negative_count_after: Negatives after reconciliation - negative_introduced: New negatives created by reconciliation (bad!) - negative_removed: Negatives fixed by reconciliation (good!) Notice how MinTrace with nonnegative=True eliminates all negative values.
# Check which levels have negative value issues
negatives_by_level = hrec_neg.diagnostics.query(
    "metric in ['negative_count_before', 'negative_count_after']"
).pivot(index='level', columns='metric')
negatives_by_level
AutoARIMA/BottomUpNaive/BottomUpAutoARIMA/MinTrace_method-olsNaive/MinTrace_method-olsAutoARIMA/MinTrace_method-ols_nonnegative-TrueNaive/MinTrace_method-ols_nonnegative-True
metricnegative_count_afternegative_count_beforenegative_count_afternegative_count_beforenegative_count_afternegative_count_beforenegative_count_afternegative_count_beforenegative_count_afternegative_count_beforenegative_count_afternegative_count_before
level
Country0.00.00.00.00.00.00.00.00.00.00.00.0
Country/Purpose0.00.00.00.00.00.00.00.00.00.00.00.0
Country/Purpose/State7.00.08.00.00.00.00.00.00.00.00.00.0
Country/Purpose/State/CityNonCity15.015.016.016.00.015.00.016.00.015.00.016.0
Overall22.015.024.016.00.015.00.016.00.015.00.016.0
This shows that BottomUp propagates negatives from bottom to aggregate levels, while standard MinTrace may spread negatives further. The nonnegative MinTrace variant addresses this.

Exporting Diagnostics

The diagnostics DataFrame can be exported to CSV for CI pipelines, benchmarks, or sharing with stakeholders.
# Export full diagnostics report
# hrec.diagnostics.to_csv("reconciliation_diagnostics.csv", index=False)

# Or export a summary
summary = hrec.diagnostics.query("level == 'Overall'").copy()
summary
levelmetricAutoARIMA/BottomUpNaive/BottomUpAutoARIMA/MinTrace_method-olsNaive/MinTrace_method-ols
48Overallcoherence_residual_mae_before91.1236920.091.1236920.000000e+00
49Overallcoherence_residual_rmse_before361.6997080.0361.6997080.000000e+00
50Overallcoherence_residual_mae_after0.0000000.00.0000000.000000e+00
51Overallcoherence_residual_rmse_after0.0000000.00.0000000.000000e+00
52Overalladjustment_mae91.1236920.0125.7963577.790422e-13
53Overalladjustment_rmse361.6997080.0235.6186281.956331e-12
54Overalladjustment_max3563.7364730.01367.9219211.455192e-11
55Overalladjustment_mean29.2837130.046.279825-5.114311e-13
56Overallnegative_count_before0.0000000.00.0000000.000000e+00
57Overallnegative_count_after0.0000000.02.0000000.000000e+00
58Overallnegative_introduced0.0000000.02.0000000.000000e+00
59Overallnegative_removed0.0000000.00.0000000.000000e+00
60Overallis_coherent1.0000001.01.0000001.000000e+00
61Overallcoherence_max_violation0.0000000.00.0000000.000000e+00

Summary of Diagnostic Metrics

MetricDescriptionInterpretation
coherence_residual_mae_beforeMean absolute incoherence before reconciliationHigher = more incoherent base forecasts
coherence_residual_mae_afterMean absolute incoherence after reconciliationShould be ~0
coherence_residual_rmse_before/afterRMSE variant of aboveMore sensitive to large violations
adjustment_maeMean absolute change made by reconciliationHigher = more forecast modification
adjustment_rmseRMSE of adjustmentsMore sensitive to large changes
adjustment_maxMaximum absolute adjustmentIdentifies extreme changes
adjustment_meanMean adjustment (signed)Shows directional bias
negative_count_beforeCount of negatives in base forecasts-
negative_count_afterCount of negatives after reconciliationShould be 0 for non-negative data
negative_introducedNegatives created by reconciliationWarning sign if > 0
negative_removedNegatives fixed by reconciliationGood if > 0
is_coherentWhether forecasts satisfy constraints (Overall only)1.0 = coherent
coherence_max_violationMaximum coherence violation (Overall only)Should be < tolerance

References