> ## Documentation Index
> Fetch the complete documentation index at: https://nixtlaverse.nixtla.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Exogenous Variables

> Aggregating Exogenous Variables for Hierarchical Forecasting

When building forecasting models with exogenous (external) features, it
is important that these features are also coherently aggregated across
the hierarchy — just like the target variable. For example, if
`marketing_spend` is available at the bottom level, the aggregated
series at the state level should contain the *sum* of marketing spend
across its child regions; while a rate-like variable such as `cpi`
should be *averaged*.

The `aggregate` function in `HierarchicalForecast` supports this through
its `exog_vars` parameter: a dictionary mapping column names to their
aggregation strategy (e.g., `'sum'`, `'mean'`, `'min'`, `'max'`). This
lets you aggregate exogenous variables alongside the target in a single
call, and then use them as inputs to models that support exogenous
regressors (e.g., `AutoARIMA`).

In this notebook we demonstrate this workflow on the Australian Domestic
Tourism dataset and compare forecasting performance with and without
exogenous variables.

You can run these experiments using CPU or GPU with Google Colab.

<a href="https://colab.research.google.com/github/Nixtla/hierarchicalforecast/blob/main/nbs/examples/ExogenousVariables.ipynb" target="_parent">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" />
</a>

```python theme={null}
!pip install hierarchicalforecast statsforecast
```

## 1. Load Data and Add Exogenous Variables

We start from the [Tourism](https://otexts.com/fpp3/tourism.html)
dataset and add two synthetic exogenous variables that illustrate
different aggregation needs:

* **`cpi`** — Consumer Price Index. This is a national-level economic
  indicator, identical across regions within each quarter. Because it
  is a rate (not a count), it should be *averaged* when aggregating.
* **`marketing_spend`** — Regional marketing budget. This varies by
  state and grows over time. Because it represents a total amount, it
  should be *summed* when aggregating.

```python theme={null}
import numpy as np
import pandas as pd
```

```python theme={null}
Y_df = pd.read_csv('https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/tourism.csv')
Y_df = Y_df.rename({'Trips': 'y', 'Quarter': 'ds'}, axis=1)
Y_df.insert(0, 'Country', 'Australia')
Y_df = Y_df[['Country', 'Region', 'State', 'Purpose', 'ds', 'y']]
Y_df['ds'] = Y_df['ds'].str.replace(r'(\d+) (Q\d)', r'\1-\2', regex=True)
Y_df['ds'] = pd.PeriodIndex(Y_df['ds'], freq='Q').to_timestamp()
Y_df.head()
```

|   | Country   | Region   | State           | Purpose  | ds         | y          |
| - | --------- | -------- | --------------- | -------- | ---------- | ---------- |
| 0 | Australia | Adelaide | South Australia | Business | 1998-01-01 | 135.077690 |
| 1 | Australia | Adelaide | South Australia | Business | 1998-04-01 | 109.987316 |
| 2 | Australia | Adelaide | South Australia | Business | 1998-07-01 | 166.034687 |
| 3 | Australia | Adelaide | South Australia | Business | 1998-10-01 | 127.160464 |
| 4 | Australia | Adelaide | South Australia | Business | 1999-01-01 | 137.448533 |

```python theme={null}
np.random.seed(42)

# CPI: a national-level index, same across all bottom-level series per quarter
dates = Y_df['ds'].unique()
cpi_values = 100 + np.cumsum(np.random.normal(0.5, 0.3, len(dates)))
cpi_map = dict(zip(dates, cpi_values))
Y_df['cpi'] = Y_df['ds'].map(cpi_map)

# Marketing spend: varies by state, grows over time
state_spend_base = {s: np.random.uniform(50, 200) for s in Y_df['State'].unique()}
Y_df['marketing_spend'] = Y_df.apply(
    lambda r: state_spend_base[r['State']] * (1 + 0.02 * (r['ds'].year - 1998)) + np.random.normal(0, 5), 
    axis=1
).round(1)

Y_df.head()
```

|   | Country   | Region   | State           | Purpose  | ds         | y          | cpi        | marketing\_spend |
| - | --------- | -------- | --------------- | -------- | ---------- | ---------- | ---------- | ---------------- |
| 0 | Australia | Adelaide | South Australia | Business | 1998-01-01 | 135.077690 | 100.649014 | 190.7            |
| 1 | Australia | Adelaide | South Australia | Business | 1998-04-01 | 109.987316 | 101.107535 | 187.8            |
| 2 | Australia | Adelaide | South Australia | Business | 1998-07-01 | 166.034687 | 101.801842 | 183.5            |
| 3 | Australia | Adelaide | South Australia | Business | 1998-10-01 | 127.160464 | 102.758750 | 188.7            |
| 4 | Australia | Adelaide | South Australia | Business | 1999-01-01 | 137.448533 | 103.188504 | 190.3            |

## 2. Aggregate with Exogenous Variables

The `exog_vars` parameter of `aggregate` takes a dictionary where: -
**keys** are the column names of the exogenous variables. - **values**
are the aggregation function(s) to apply — any valid Pandas aggregation
function name such as `'sum'`, `'mean'`, `'min'`, `'max'`, `'std'`.

The aggregated columns are named `{column}_{function}` (e.g.,
`cpi_mean`, `marketing_spend_sum`).

```python theme={null}
spec = [
    ['Country'],
    ['Country', 'State'], 
    ['Country', 'Purpose'], 
    ['Country', 'State', 'Region'], 
    ['Country', 'State', 'Purpose'], 
    ['Country', 'State', 'Region', 'Purpose']
]
```

```python theme={null}
from hierarchicalforecast.utils import aggregate

Y_agg, S_df, tags = aggregate(
    Y_df, 
    spec, 
    exog_vars={
        'cpi': 'mean',              # rate variable -> average
        'marketing_spend': 'sum',   # additive variable -> sum
    }
)
```

The result is a `Y_agg` DataFrame that contains both the aggregated
target (`y`) and the aggregated exogenous variables (`cpi_mean`,
`marketing_spend_sum`) for every level of the hierarchy.

```python theme={null}
# Top level: all of Australia
Y_agg[Y_agg['unique_id'] == 'Australia'].head()
```

|   | unique\_id | ds         | y            | cpi\_mean  | marketing\_spend\_sum |
| - | ---------- | ---------- | ------------ | ---------- | --------------------- |
| 0 | Australia  | 1998-01-01 | 23182.197269 | 100.649014 | 36528.4               |
| 1 | Australia  | 1998-04-01 | 20323.380067 | 101.107535 | 36438.2               |
| 2 | Australia  | 1998-07-01 | 19826.640511 | 101.801842 | 36613.1               |
| 3 | Australia  | 1998-10-01 | 20830.129891 | 102.758750 | 36648.8               |
| 4 | Australia  | 1999-01-01 | 22087.353380 | 103.188504 | 37338.3               |

```python theme={null}
# State level: Victoria
Y_agg[Y_agg['unique_id'] == 'Australia/Victoria'].head()
```

|     | unique\_id         | ds         | y           | cpi\_mean  | marketing\_spend\_sum |
| --- | ------------------ | ---------- | ----------- | ---------- | --------------------- |
| 560 | Australia/Victoria | 1998-01-01 | 6010.424491 | 100.649014 | 13711.9               |
| 561 | Australia/Victoria | 1998-04-01 | 4795.246755 | 101.107535 | 13640.1               |
| 562 | Australia/Victoria | 1998-07-01 | 4316.845170 | 101.801842 | 13742.9               |
| 563 | Australia/Victoria | 1998-10-01 | 4674.829118 | 102.758750 | 13768.8               |
| 564 | Australia/Victoria | 1999-01-01 | 5304.334195 | 103.188504 | 14047.9               |

```python theme={null}
# Bottom level: a specific Region/Purpose combination
Y_agg[Y_agg['unique_id'] == 'Australia/Victoria/Melbourne/Holiday'].head()
```

|       | unique\_id                           | ds         | y          | cpi\_mean  | marketing\_spend\_sum |
| ----- | ------------------------------------ | ---------- | ---------- | ---------- | --------------------- |
| 29920 | Australia/Victoria/Melbourne/Holiday | 1998-01-01 | 427.827800 | 100.649014 | 160.2                 |
| 29921 | Australia/Victoria/Melbourne/Holiday | 1998-04-01 | 424.191925 | 101.107535 | 171.2                 |
| 29922 | Australia/Victoria/Melbourne/Holiday | 1998-07-01 | 449.551508 | 101.801842 | 168.8                 |
| 29923 | Australia/Victoria/Melbourne/Holiday | 1998-10-01 | 450.699582 | 102.758750 | 172.7                 |
| 29924 | Australia/Victoria/Melbourne/Holiday | 1999-01-01 | 412.358542 | 103.188504 | 175.8                 |

Notice that: - `cpi_mean` is the same at every level (since CPI is
identical across bottom series within each quarter). -
`marketing_spend_sum` grows as we move up the hierarchy, because it sums
the spend of all child series.

### Multiple aggregations per variable

You can also apply multiple aggregation functions to a single variable
by passing a list of strings.

```python theme={null}
Y_multi, _, _ = aggregate(
    Y_df, 
    spec, 
    exog_vars={
        'marketing_spend': ['sum', 'mean', 'std'],
    }
)

Y_multi[Y_multi['unique_id'] == 'Australia/Victoria'][
    ['unique_id', 'ds', 'y', 'marketing_spend_sum', 'marketing_spend_mean', 'marketing_spend_std']
].head()
```

|     | unique\_id         | ds         | y           | marketing\_spend\_sum | marketing\_spend\_mean | marketing\_spend\_std |
| --- | ------------------ | ---------- | ----------- | --------------------- | ---------------------- | --------------------- |
| 560 | Australia/Victoria | 1998-01-01 | 6010.424491 | 13711.9               | 163.236905             | 4.878618              |
| 561 | Australia/Victoria | 1998-04-01 | 4795.246755 | 13640.1               | 162.382143             | 5.116737              |
| 562 | Australia/Victoria | 1998-07-01 | 4316.845170 | 13742.9               | 163.605952             | 4.361285              |
| 563 | Australia/Victoria | 1998-10-01 | 4674.829118 | 13768.8               | 163.914286             | 5.038578              |
| 564 | Australia/Victoria | 1999-01-01 | 5304.334195 | 14047.9               | 167.236905             | 5.759074              |

## 3. Train/Test Split

We use the final two years (8 quarters) as test set. We also aggregate a
version **without** exogenous variables to compare later.

```python theme={null}
Y_agg_no_exog, S_df_no_exog, tags_no_exog = aggregate(Y_df, spec)
```

```python theme={null}
horizon = 8

# With exogenous
Y_test_df = Y_agg.groupby('unique_id', as_index=False).tail(horizon)
Y_train_df = Y_agg.drop(Y_test_df.index)
X_test_df = Y_test_df[['unique_id', 'ds', 'cpi_mean', 'marketing_spend_sum']]

# Without exogenous
Y_test_no_exog = Y_agg_no_exog.groupby('unique_id', as_index=False).tail(horizon)
Y_train_no_exog = Y_agg_no_exog.drop(Y_test_no_exog.index)

print(f'Train: {Y_train_df.shape[0]:,} rows, Test: {Y_test_df.shape[0]:,} rows')
```

```text theme={null}
Train: 30,600 rows, Test: 3,400 rows
```

## 4. Base Forecasts

We use `AutoARIMA` from `StatsForecast`, which supports exogenous
regressors. Any column in the training data besides `unique_id`, `ds`,
and `y` is treated as an exogenous feature. Future values are passed via
`X_df`.

We produce base forecasts **with** and **without** the exogenous
variables for comparison.

```python theme={null}
from statsforecast.models import AutoARIMA
from statsforecast.core import StatsForecast
```

```text theme={null}
/home/osprangers/Repositories/hierarchicalforecast/.venv/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
```

```python theme={null}
# With exogenous variables
fcst_exog = StatsForecast(models=[AutoARIMA(season_length=4)], freq='QS', n_jobs=-1)
Y_hat_exog = fcst_exog.forecast(df=Y_train_df, h=horizon, X_df=X_test_df, fitted=True)
Y_fitted_exog = fcst_exog.forecast_fitted_values()

Y_hat_exog.head()
```

|   | unique\_id | ds         | AutoARIMA    |
| - | ---------- | ---------- | ------------ |
| 0 | Australia  | 2016-01-01 | 26166.873073 |
| 1 | Australia  | 2016-04-01 | 24907.791984 |
| 2 | Australia  | 2016-07-01 | 24537.521142 |
| 3 | Australia  | 2016-10-01 | 25631.143759 |
| 4 | Australia  | 2017-01-01 | 26987.070921 |

```python theme={null}
# Without exogenous variables
fcst_no_exog = StatsForecast(models=[AutoARIMA(season_length=4)], freq='QS', n_jobs=-1)
Y_hat_no_exog = fcst_no_exog.forecast(df=Y_train_no_exog, h=horizon, fitted=True)
Y_fitted_no_exog = fcst_no_exog.forecast_fitted_values()

Y_hat_no_exog.head()
```

|   | unique\_id | ds         | AutoARIMA    |
| - | ---------- | ---------- | ------------ |
| 0 | Australia  | 2016-01-01 | 26212.553553 |
| 1 | Australia  | 2016-04-01 | 25033.667125 |
| 2 | Australia  | 2016-07-01 | 24507.027198 |
| 3 | Australia  | 2016-10-01 | 25598.928613 |
| 4 | Australia  | 2017-01-01 | 26982.576796 |

## 5. Reconcile Forecasts

Reconciliation works exactly the same way regardless of whether
exogenous variables were used to produce the base forecasts. The
reconciliation step adjusts the base forecasts to be coherent across the
hierarchy.

```python theme={null}
from hierarchicalforecast.methods import BottomUp, MinTrace
from hierarchicalforecast.core import HierarchicalReconciliation

reconcilers = [
    BottomUp(),
    MinTrace(method='mint_shrink'),
]
```

```python theme={null}
# With exogenous
hrec_exog = HierarchicalReconciliation(reconcilers=reconcilers)
Y_rec_exog = hrec_exog.reconcile(
    Y_hat_df=Y_hat_exog, Y_df=Y_fitted_exog, S_df=S_df, tags=tags
)

Y_rec_exog.head()
```

|   | unique\_id | ds         | AutoARIMA    | AutoARIMA/BottomUp | AutoARIMA/MinTrace\_method-mint\_shrink |
| - | ---------- | ---------- | ------------ | ------------------ | --------------------------------------- |
| 0 | Australia  | 2016-01-01 | 26166.873073 | 24461.292072       | 25436.706581                            |
| 1 | Australia  | 2016-04-01 | 24907.791984 | 22795.931817       | 24076.317511                            |
| 2 | Australia  | 2016-07-01 | 24537.521142 | 22151.772435       | 23272.424691                            |
| 3 | Australia  | 2016-10-01 | 25631.143759 | 22936.268850       | 24404.217321                            |
| 4 | Australia  | 2017-01-01 | 26987.070921 | 24292.353932       | 25663.420421                            |

```python theme={null}
# Without exogenous
hrec_no_exog = HierarchicalReconciliation(reconcilers=reconcilers)
Y_rec_no_exog = hrec_no_exog.reconcile(
    Y_hat_df=Y_hat_no_exog, Y_df=Y_fitted_no_exog, S_df=S_df_no_exog, tags=tags_no_exog
)

Y_rec_no_exog.head()
```

|   | unique\_id | ds         | AutoARIMA    | AutoARIMA/BottomUp | AutoARIMA/MinTrace\_method-mint\_shrink |
| - | ---------- | ---------- | ------------ | ------------------ | --------------------------------------- |
| 0 | Australia  | 2016-01-01 | 26212.553553 | 24495.919988       | 25322.734082                            |
| 1 | Australia  | 2016-04-01 | 25033.667125 | 22660.036262       | 23873.068038                            |
| 2 | Australia  | 2016-07-01 | 24507.027198 | 22307.887506       | 23396.655072                            |
| 3 | Australia  | 2016-10-01 | 25598.928613 | 22853.294863       | 24303.043112                            |
| 4 | Australia  | 2017-01-01 | 26982.576796 | 24264.065071       | 25520.638165                            |

## 6. Evaluation

We compare RMSE and MASE across hierarchy levels for the models **with**
and **without** exogenous variables.

```python theme={null}
from hierarchicalforecast.evaluation import evaluate
from utilsforecast.losses import rmse, mase
from functools import partial

eval_tags = {
    'Total': tags['Country'],
    'Purpose': tags['Country/Purpose'],
    'State': tags['Country/State'],
    'Regions': tags['Country/State/Region'],
    'Bottom': tags['Country/State/Region/Purpose'],
}
```

```python theme={null}
# Evaluate with exogenous (drop exog columns before merge to avoid them being treated as models)
df_exog = Y_rec_exog.merge(Y_test_df[['unique_id', 'ds', 'y']], on=['unique_id', 'ds'])
eval_exog = evaluate(
    df=df_exog,
    tags=eval_tags,
    train_df=Y_train_df[['unique_id', 'ds', 'y']],
    metrics=[rmse, partial(mase, seasonality=4)],
)
eval_exog.columns = ['level', 'metric', 'Base (w/ exog)', 'BottomUp (w/ exog)', 'MinTrace (w/ exog)']
```

```python theme={null}
# Evaluate without exogenous
df_no_exog = Y_rec_no_exog.merge(Y_test_no_exog, on=['unique_id', 'ds'])
eval_no_exog = evaluate(
    df=df_no_exog,
    tags=eval_tags,
    train_df=Y_train_no_exog,
    metrics=[rmse, partial(mase, seasonality=4)],
)
eval_no_exog.columns = ['level', 'metric', 'Base (no exog)', 'BottomUp (no exog)', 'MinTrace (no exog)']
```

```python theme={null}
comparison = eval_exog.merge(eval_no_exog, on=['level', 'metric'])
comparison = comparison[[
    'level', 'metric', 
    'Base (no exog)', 'Base (w/ exog)', 
    'BottomUp (no exog)', 'BottomUp (w/ exog)',
    'MinTrace (no exog)', 'MinTrace (w/ exog)',
]]

numeric_cols = comparison.select_dtypes(include='number').columns
comparison[numeric_cols] = comparison[numeric_cols].map('{:.2f}'.format).astype(np.float64)
```

### RMSE

```python theme={null}
comparison.query('metric == "rmse"')
```

|    | level   | metric | Base (no exog) | Base (w/ exog) | BottomUp (no exog) | BottomUp (w/ exog) | MinTrace (no exog) | MinTrace (w/ exog) |
| -- | ------- | ------ | -------------- | -------------- | ------------------ | ------------------ | ------------------ | ------------------ |
| 0  | Total   | rmse   | 773.97         | 731.03         | 3247.04            | 3218.18            | 2029.46            | 1946.96            |
| 2  | Purpose | rmse   | 471.46         | 516.06         | 871.70             | 859.13             | 571.76             | 545.22             |
| 4  | State   | rmse   | 257.90         | 251.32         | 434.51             | 431.99             | 308.33             | 298.49             |
| 6  | Regions | rmse   | 56.53          | 55.53          | 58.19              | 57.79              | 46.43              | 46.08              |
| 8  | Bottom  | rmse   | 20.39          | 20.44          | 20.39              | 20.44              | 18.13              | 18.22              |
| 10 | Overall | rmse   | 38.72          | 38.78          | 53.00              | 52.71              | 40.26              | 39.58              |

### MASE

```python theme={null}
comparison.query('metric == "mase"')
```

|    | level   | metric | Base (no exog) | Base (w/ exog) | BottomUp (no exog) | BottomUp (w/ exog) | MinTrace (no exog) | MinTrace (w/ exog) |
| -- | ------- | ------ | -------------- | -------------- | ------------------ | ------------------ | ------------------ | ------------------ |
| 1  | Total   | mase   | 0.76           | 0.73           | 3.39               | 3.36               | 2.01               | 1.91               |
| 3  | Purpose | mase   | 1.14           | 1.41           | 2.48               | 2.49               | 1.45               | 1.48               |
| 5  | State   | mase   | 1.46           | 1.26           | 1.97               | 1.93               | 1.42               | 1.31               |
| 7  | Regions | mase   | 1.21           | 1.20           | 1.25               | 1.24               | 1.02               | 1.00               |
| 9  | Bottom  | mase   | 1.02           | 1.02           | 1.02               | 1.02               | 0.94               | 0.95               |
| 11 | Overall | mase   | 1.06           | 1.06           | 1.10               | 1.10               | 0.97               | 0.97               |

### References

* [Hyndman, R.J., & Athanasopoulos, G. (2021). “Forecasting:
  principles and practice, 3rd edition: Chapter 11: Forecasting
  hierarchical and grouped series.”. OTexts: Melbourne, Australia.
  OTexts.com/fpp3 Accessed on July
  2022.](https://otexts.com/fpp3/hierarchical.html)
