Skip to main content
Learn how to select the best forecasting models when you care about more than one metric — without manually ranking and comparing every combination.

What you’ll learn

  • Why single-metric model selection can be misleading
  • What Pareto dominance means and when a model is “dominated”
  • How to use evaluate() correctly as input to ParetoFrontier
  • How to visualize the 2D Pareto frontier across any two metrics
  • How to handle cross-validation output for multi-objective selection

The problem: picking one winner across multiple metrics

After training several forecasting models, a common question is: which one should I deploy? If you optimize for a single metric — say MAE — the answer is straightforward: pick the lowest MAE. But real-world requirements rarely reduce to a single number. You might care about:
  • Accuracy (MAE, RMSE): how close are point forecasts to actuals?
  • Relative error (MAPE, sMAPE): how large is the error relative to the scale of the series?
  • Bias (bias, CFE): does the model systematically over- or under-forecast?
When metrics disagree — Model A has the best MAE, Model B has the best MAPE — a simple ranking breaks down. Pareto analysis offers a principled solution: instead of collapsing everything to a single score, identify which models are not dominated. A model is dominated when another model is at least as good on every metric and strictly better on at least one. Non-dominated models form the Pareto frontier — the set of trade-off-optimal choices.

Install libraries

%%capture
pip install utilsforecast statsforecast -U
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
from statsforecast import StatsForecast
from statsforecast.models import AutoARIMA, MSTL, SeasonalNaive

from utilsforecast.data import generate_series
from utilsforecast.losses import mae, mape, rmse, smape
from utilsforecast.evaluation import evaluate
from utilsforecast.model_selection import ParetoFrontier

Generate synthetic time series

We use generate_series to create a panel of daily time series with weekly seasonality. Each series contains between 100 and 150 observations, giving models enough history for a meaningful fit.
series = generate_series(
    n_series=8,
    freq='D',
    min_length=100,
    max_length=150,
    seed=42,
)
# StatsForecast requires string or integer unique_id, not Categorical
series['unique_id'] = series['unique_id'].astype(str)
series.head()
unique_iddsy
002000-01-010.049987
102000-01-021.229624
202000-01-032.166854
302000-01-043.071433
402000-01-054.325444
We hold out the last 14 days of each series as the evaluation window and use the rest for training.
HORIZON = 14
SEASON  = 7

test_mask = series.groupby('unique_id').cumcount(ascending=False) < HORIZON
train = series[~test_mask].reset_index(drop=True)
test  = series[test_mask].reset_index(drop=True)

print(f'Train: {len(train)} rows | Test: {len(test)} rows')
Train: 893 rows | Test: 112 rows

Fit models and generate forecasts

We compare three models that cover a range of complexity:
  • SeasonalNaive — repeats the last observed season. Fast, transparent, surprisingly hard to beat.
  • AutoARIMA — fits a SARIMA model selected automatically by AIC. More flexible but slower.
  • MSTL — decomposes the series into trend and seasonal components using STL, then forecasts each part separately. Good at capturing multiple seasonal patterns.
sf = StatsForecast(
    models=[
        SeasonalNaive(season_length=SEASON),
        AutoARIMA(season_length=SEASON),
        MSTL(season_length=SEASON),
    ],
    freq='D',
    n_jobs=1,
)
sf.fit(train)
preds = sf.predict(h=HORIZON)
preds.head()
unique_iddsSeasonalNaiveAutoARIMAMSTL
002000-05-045.4534145.2962185.283464
102000-05-056.1360666.1980516.193499
202000-05-060.3238450.2773770.280454
302000-05-071.0002601.2349501.244538
402000-05-082.1762842.2529242.278188
Merge predictions with the held-out actuals to get a single DataFrame ready for evaluate().
eval_df = test.merge(preds, on=['unique_id', 'ds'], how='left')
eval_df.head()
unique_iddsySeasonalNaiveAutoARIMAMSTL
002000-05-045.2670455.4534145.2962185.283464
102000-05-056.2424156.1360666.1980516.193499
202000-05-060.3462180.3238450.2773770.280454
302000-05-071.1347061.0002601.2349501.244538
402000-05-082.1220632.1762842.2529242.278188

Evaluate models across multiple metrics

evaluate() computes any combination of loss functions from utilsforecast.losses and returns a tidy DataFrame with one row per (unique_id, metric) and one column per model. For Pareto analysis we need one scalar per metric per model — a single number that summarises performance across all series. The agg_fn='mean' argument collapses the per-series rows into a single mean, giving a (n_metrics, n_models) table.
scores = evaluate(
    df=eval_df,
    metrics=[mae, rmse, mape, smape],
    agg_fn='mean',
)
scores
metricSeasonalNaiveAutoARIMAMSTL
0mae0.1620200.1200700.119955
1rmse0.1964610.1451290.144143
2mape0.3544160.3592340.340032
3smape0.0870600.0658400.064982
At a glance, no single model wins on every metric. MSTL tends to have lower absolute errors while SeasonalNaive can be competitive on relative metrics for series with strong weekly patterns. This is exactly the situation where Pareto analysis adds value.

Find the Pareto frontier

ParetoFrontier.find_non_dominated() takes the aggregated scores table and returns only the columns corresponding to non-dominated models — dropping any model for which another model is at least as good on every metric and strictly better on at least one.
pareto = ParetoFrontier.find_non_dominated(scores)
pareto
metricMSTL
0mae0.119955
1rmse0.144143
2mape0.340032
3smape0.064982
The models that survive are the Pareto-optimal set. Dropping the rest is safe: for every eliminated model, there is at least one surviving model that dominates it across every metric simultaneously.

Focusing on a subset of metrics

You can restrict the comparison to only the metrics that matter for your use case by passing a metrics list.
# Only consider MAE and RMSE for dominance — ignore MAPE and sMAPE
pareto_subset = ParetoFrontier.find_non_dominated(scores, metrics=['mae', 'rmse'])
pareto_subset
metricMSTL
0mae0.119955
1rmse0.144143

Maximization metrics

By default all metrics are minimized (lower is better). If a metric should be maximized — for example, a custom R² score — pass its name in maximization.
# Hypothetical: minimize MAE but maximize some score column
# ParetoFrontier.find_non_dominated(scores, maximization=['score'])

# With the current metrics, this is equivalent to the default:
pareto_min = ParetoFrontier.find_non_dominated(scores, metrics=['mae', 'rmse', 'mape', 'smape'])
pareto_min
metricMSTL
0mae0.119955
1rmse0.144143
2mape0.340032
3smape0.064982

Visualize the 2D Pareto frontier

When comparing two metrics, ParetoFrontier.plot_pareto_2d() renders a scatter plot where dominated models appear in grey and Pareto-optimal models appear in red, connected by a dashed frontier line.
ParetoFrontier.plot_pareto_2d(
    scores,
    metric_x='mae',
    metric_y='mape',
    title='MAE vs MAPE — Pareto Frontier',
)
The plot accepts maximize_x and maximize_y flags for metrics where larger is better, and show_dominated=False to declutter the chart when many models are present.

Cross-validation: multi-window model selection

A single held-out window can be noisy. StatsForecast’s cross_validation() produces estimates across multiple windows, giving a more robust picture of model performance. The cross-validation output has a cutoff column — evaluate() keeps it, so agg_fn='mean' aggregates across series within each cutoff, not across all windows at once. To collapse everything into a single row per metric for Pareto analysis, apply a second groupby.
cv = sf.cross_validation(df=series, h=HORIZON, n_windows=3)
cv['unique_id'] = cv['unique_id'].astype(str)
cv.head()
unique_iddscutoffySeasonalNaiveAutoARIMAMSTL
002000-05-022000-05-013.1523913.0610443.2005683.200882
102000-05-032000-05-014.0823284.1781494.1949044.197371
202000-05-042000-05-015.2670455.4534145.2949265.284707
302000-05-052000-05-016.2424156.1360666.1986866.194711
402000-05-062000-05-010.3462180.3238450.2770100.282018
MODELS = [c for c in cv.columns if c not in {'unique_id', 'ds', 'cutoff', 'y'}]

# Step 1: compute per-(series, cutoff) scores
cv_scores = evaluate(
    df=cv,
    metrics=[mae, rmse, mape, smape],
)

# Step 2: average across both series and cutoffs
cv_scores_agg = cv_scores.groupby('metric', sort=False)[MODELS].mean().reset_index()
cv_scores_agg
metricSeasonalNaiveAutoARIMAMSTL
0mae0.1633040.1196600.119076
1rmse0.1969490.1447530.143648
2mape0.3496380.3260700.317805
3smape0.0862620.0647980.063181
pareto_cv = ParetoFrontier.find_non_dominated(cv_scores_agg)
pareto_cv
metricMSTL
0mae0.119076
1rmse0.143648
2mape0.317805
3smape0.063181
ParetoFrontier.plot_pareto_2d(
    cv_scores_agg,
    metric_x='mae',
    metric_y='mape',
    title='MAE vs MAPE — Cross-Validated Pareto Frontier',
)

Custom column names

If your pipeline uses column names different from the defaults (unique_id, cutoff), pass id_col and cutoff_col to both evaluate() and find_non_dominated() so the Pareto analysis correctly identifies which columns are model predictions.
# Rename to simulate a custom pipeline
eval_df_custom = eval_df.rename(columns={'unique_id': 'series_id'})

scores_custom = evaluate(
    df=eval_df_custom,
    metrics=[mae, rmse],
    id_col='series_id',
    agg_fn='mean',
)

# Pass the same id_col so find_non_dominated() excludes it from model columns
ParetoFrontier.find_non_dominated(scores_custom, id_col='series_id')
metricMSTL
0mae0.119955
1rmse0.144143

Key takeaways

  • Single-metric selection discards information. When metrics disagree, there is no universally correct answer — only trade-offs worth making explicit.
  • Pareto dominance is a lossless filter. Every eliminated model is objectively outperformed; no information about the surviving frontier is lost.
  • Always aggregate before calling find_non_dominated(). Pass agg_fn='mean' to evaluate() so the input has exactly one row per metric. For cross-validation output, apply a second groupby over the metric column to collapse across cutoffs as well.
  • plot_pareto_2d() makes the trade-off tangible. Pick any two metrics on the axes to see which models sit on the frontier and which ones are dominated.
  • Custom column names are supported. Pass id_col and cutoff_col consistently across evaluate() and find_non_dominated() when your data uses non-default names.