Prerequisites
- We assume you have neuralforecast already installed.
- Explanations are obtained with Captum: an
open-source library for model interpretability in PyTorch. Make sure
to install the package with
pip install captum
to use the features demonstrated below. - You can optionally install
SHAP to access their
visualizations capabilities. This can be done with
pip install shap
.
Load libraries
Load the data
We demonstrate the explainability capabilities with the AirPassengers dataset. This dataset has:- 2 unique series
- a future exogenous variable (
trend
) - a historical exogenous variable (
y_lag[12]
) - static exogenous variable (
Airline1
)
Basic usage
Train a model
Before explaining forecasts, we need to train a forecasting model. Here, we use the NHITS model, but you can use any univariate model. For now, we don’t support multivariate models just yet, this feature will be implemented soon.Get features attributions
Once the model is trained, we can get feature attributions using thenf.explain
method.
This method takes the following parameters:
horizons
: List of horizons to explain. If None, all horizons are explained. Defaults to None.outputs
: List of outputs to explain for models with multiple outputs. Defaults to [0] (first output). This is useful when we have models trained with a probabilistic loss. We will explore that later in the tutorial.explainer
: Name of the explainer to use. Options are ‘IntegratedGradients’, ‘ShapleyValueSampling’, ‘InputXGradient’. Defaults to ‘IntegratedGradients’.df
(pandas, polars or spark DataFrame): DataFrame with columns [unique_id
,ds
,y
] and exogenous variables. If a DataFrame is passed, it is used to generate forecasts. Defaults to None.static_df
(pandas, polars or spark DataFrame): DataFrame with columns [unique_id
] and static exogenous. Defaults to None. Only use it if you trained your model with static exogenous features.futr_df
(pandas, polars or spark DataFrame): DataFrame with [unique_id
,ds
] columns anddf
’s future exogenous. Defaults to None. Only use it if you trained your model with future exogenous features.verbose
: Print warnings. Defaults to True.engine
: Distributed engine for inference. Only used if df is a spark dataframe or if fit was called on a spark dataframe.level
: Confidence levels between 0 and 100. Defaults to None.quantiles
: Alternative to level, target quantiles to predict. Defaults to None.data_kwargs
: Extra arguments to be passed to the dataset within each model.
df
and onwards act exactly the same way as
in the nf.predict()
method.
In this case, let’s explain each horizon step, so we keep
horizons=None
. Since our model used a point loss, there is only one
output, so we also keep the default value outputs=[0]
. Finally, we
choose the “IntegratedGradients” explainer, as it is one of the fastest
method for interpretability in deep learning.
nf.explain()
returns two values:
- A dataframe with the forecasts from the fitted models
- A dictionary with the feature attributions for each model
explanations["NHITS"]
. Note that
if you used an alias when initializing the model, then the key is the
value of the alias.
insample
contains the attributions for past lags and availability maskfutr_exog
contains the attributions for future exogenous featureshist_exog
contains the attributions for historical exogenous featuresstat_exog
contains the attributions for static exogenous featuresbaseline_predictions
contains the baseline prediction of the model if none of the features above were available. Note that if the selected explainer does not have the additivity property, the value will be set to None.
IntegratedGradients
has the additive property, meaning
that taking the sum of baseline predictions and feature attributions
results in the final forecast made by the model.
Now, because we are using Captum, we work directly with tensors, keeping
the entire process fast, efficient, and allowing us to leverage GPUs
when available. As such, the attributions are also stored as tensors as
shown below.
insample
: [batch_size, horizon, n_series, n_output, input_size, 2 (y attribution, mask attribution)]futr_exog
: [batch_size, horizon, n_series, n_output, input_size+horizon, n_futr_features]hist_exog
: [batch_size, horizon, n_series, n_output, input_size, n_hist_features]stat_exog
: [batch_size, horizon, n_series, n_output, n_static_features]baseline_predictions
: [batch_size, horizon, n_series, n_output]
batch_size
is 2 for all, because we are explaining two different
series. n_series
however is 1 because NHITS is a univariate model.
Also note that for insample
, the last shape is always 2, because we
score the attribution of the values of the past lags and their
availability.
At this point, we have all the information needed to analyze the
attribution scores and make visualizations.
Plotting feature attributions
You can now use any method you want to plot feature attributions. You can make plots manually using any visualization library likematplotlib
or seaborn
, but shap
has dedicated plots for
explainability, so let’s see how we can use them.
Basically, with the information we have, we can easily create a
shap.Explanation
object that can then be used to create different
plots from the shap
package.
Specifically, a shap.Explanation
object needs:
values
: the attribution scoresbase_values
: the baseline predictions of the modelfeature_names
: a list to display nice feature names

shap
. For
example, we can do a simple bar plot as shown below.


Verifying additivity
As mentioned above, “IntegratedGradients” has the additive property, meaning that when we sum the baseline predictions with the total attribution scores of each features, we get the final forecasts made by the model.Advanced concepts
Choosing an explainer
In this section, we outline the different explainers supported in neuralforecast. Different algorithms will produce different attribution scores, and so we must choose which applies best to our scenario.Explainer | Local/Global | Additivity Property | Speed |
---|---|---|---|
Integrated Gradients | Local | Yes | Fast |
Shapley Value Sampling | Local | Yes | Very slow |
Input X Gradient | Local | No | Very fast |
- Local/Global: All explainers are local, because they only explain how a specific input affects a specific forecast.
- Additivity Property: Whether the sum of the feature attributions and baseline predictions result in the final forecast.
- Speed:
- Very fast: Single gradient computation
- Fast: Multiple gradient computations (Integrated Gradients)
- Medium: Multiple model evaluations
- Slow: Many model evaluations for sampling-based methods
- Very Slow: Exponential complexity in worst case (exact Shapley values)
Integrated Gradients
Integrated Gradients computes attributions by integrating gradients along the straight-line path from a chosen baseline input (e.g., black image, zero embedding) to the actual input. The method calculates the path integral, which is approximated using a Riemann sum with typically 20-300 gradient computations. Learn more in the original paper. Advantages- Theoretically grounded: Satisfies the axioms of sensitivity (features that affect the output get non-zero attribution) and implementation invariance (functionally equivalent networks produce identical attributions)
- Has the additivity property
- Relies on choosing an appropriate baseline that represents “absence of signal”. By default, we use as input only 0 values.
Shapley Value Sampling
Shapley Value Sampling approximates Shapley values using Monte Carlo sampling of feature permutations. The method randomly samples different orderings of features and computes how much each feature contributes by comparing model predictions when that feature is included versus excluded from the subset. The approach simulates “missing” features by drawing random values from the training data distribution. Learn more in the original paper. Advantages- All subsets of input features are perturbed, so interactions and redundancies between features are taken into account
- Uses simple permutation sampling that is easy to understand
- High computational cost: requires many model evaluations (typically hundreds to thousands) to achieve reasonable approximation accuracy
- Very slow due to the high number of model evaluations
- Simulates missing features by sampling from marginal distributions, which may create unrealistic data instances when features are correlated
Input X Gradient
Input X Gradient computes feature attribution by simply multiplying each input value by the gradient of the model output with respect to that input. This corresponds to a first-order Taylor approximation of how the output would change if the input were set to zero. This means each time step’s input values are multiplied by the gradients to show which historical observations most influence the prediction. Learn more in the original paper. Advantages- Computational efficiency: it requires only a single pass through the model
- No approximations as it uses the gradient of the model
- No additivity
- A bit problematic with the ReLu function, because their gradient can be 0, but it can still carry some information
- Functions like tanh or sigmoid can have very low gradients, even though the input is significant, so it’s problematic for LSTM and GRU models.
Explaining models with different loss functions
Currently, explanations are supported for models trained with:- Point loss functions (MAE, MSE, RMSE, etc.)
- Non-parametric probabilistic losses (IQLoss, MQLoss, etc.)
Explaning a model with a probablistic loss function
If you are explaining a model with a non-parametric loss function, then by default, we only explain the median forecast. This is controlled by theouputs
parameter. Let’s see an example.
outputs=[0]
, which is the default value, we only
explain the median forecast. However, we can explain up to three
ouputs:
- Median forecast
- Lower bound
- Upper bound
ouputs=[0,1,2]
.
Explaining models with a scaler (local_scaler_type
)
If you specify a local_scaler_type
in your NeuralForecast
object,
note that the attribution scores will be scaled. This is because the
data is scaled before the training process. The relative importance is
still relevant, but note that additivity will not hold.
If additivtiy is important, then you must use scaler_type
when
initializing the model, as we do in this tutorial. This scales each
window of data during training, so we can easily inverse transform the
attribution scores.
Again, no matter which approach you choose, the relative attribution
scores are still valid and comparable. It’s only additivity that is
impacted. If you specify a local_scaler_type
, then a warning is issued
about additivity.
Explaining recurrent models
You can explain recurrent models (LSTM, GRU). Just note that if you setrecurrent=True
, then the Integrated Gradients explainer is not
supported. If recurrent=False
, you can use any explainer.
References
- M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic Attribution for Deep Networks.” Available: https://arxiv.org/pdf/1703.01365
- S. Lundberg, P. Allen, and S.-I. Lee, “A Unified Approach to Interpreting Model Predictions,” Nov. 2017. Available: https://arxiv.org/pdf/1705.07874
- J. Castro, D. Gómez, and J. Tejada, “Polynomial calculation of the Shapley value based on sampling,” Computers & Operations Research, vol. 36, no. 5, pp. 1726–1730, May 2009, doi: https://doi.org/10.1016/j.cor.2008.04.004.
- A. Shrikumar, P. Greenside, A. Shcherbina, and A. Kundaje, “Not Just a Black Box: Learning Important Features Through Propagating Activation Differences,” arXiv:1605.01713 [cs], Apr. 2017, Available: https://arxiv.org/abs/1605.01713