The objective of the following article is to obtain a step-by-step guide on building Prediction intervals in forecasting models using mlforecast.

During this walkthrough, we will become familiar with the main MlForecast class and some relevant methods such as, MLForecast.predict and MLForecast.cross_validation in other.

Let’s start!!!

Table of contents

  1. Introduction
  2. Forecasts and prediction intervals
  3. Installing mlforecast
  4. Loading libraries and data
  5. Explore Data with the plot method
  6. Split the data into training and testing
  7. Modeling with mlforecast
  8. References


The target of our prediction is something unknown (otherwise we wouldn’t be making a prediction), so we can think of it as a random variable. For example, the total sales for the next month could have different possible values, and we won’t know what the exact value will be until we get the actual sales at the end of the month. Until next month’s sales are known, this is a random amount.

By the time the next month draws near, we usually have a pretty good idea of possible sales values. However, if we are forecasting sales for the same month next year, the possible values can vary much more. In most forecasting cases, the variability associated with what we are forecasting reduces as we get closer to the event. In other words, the further back in time we make the prediction, the more uncertainty there is.

We can imagine many possible future scenarios, each yielding a different value for what we are trying to forecast.

When we obtain a forecast, we are estimating the middle of the range of possible values the random variable could take. Often, a forecast is accompanied by a prediction interval giving a range of values the random variable could take with relatively high probability. For example, a 95% prediction interval contains a range of values which should include the actual future value with probability 95%.

Rather than plotting individual possible futures , we usually show these prediction intervals instead.

When we generate a forecast, we usually produce a single value known as the point forecast. This value, however, doesn’t tell us anything about the uncertainty associated with the forecast. To have a measure of this uncertainty, we need prediction intervals.

A prediction interval is a range of values that the forecast can take with a given probability. Hence, a 95% prediction interval should contain a range of values that include the actual future value with probability 95%. Probabilistic forecasting aims to generate the full forecast distribution. Point forecasting, on the other hand, usually returns the mean or the median or said distribution. However, in real-world scenarios, it is better to forecast not only the most probable future outcome, but many alternative outcomes as well.

The problem is that some timeseries models provide forecast distributions, but some other ones only provide point forecasts. How can we then estimate the uncertainty of predictions?

Forecasts and prediction intervals

There are at least four sources of uncertainty in forecasting using time series models:

  1. The random error term;
  2. The parameter estimates;
  3. The choice of model for the historical data;
  4. The continuation of the historical data generating process into the future.

When we produce prediction intervals for time series models, we generally only take into account the first of these sources of uncertainty. It would be possible to account for 2 and 3 using simulations, but that is almost never done because it would take too much time to compute. As computing speeds increase, it might become a viable approach in the future.

Even if we ignore the model uncertainty and the DGP uncertainty (sources 3 and 4), and just try to allow for parameter uncertainty as well as the random error term (sources 1 and 2), there are no closed form solutions apart from some simple special cases. see full article Rob J Hyndman

Forecast distributions

We use forecast distributions to express the uncertainty in our predictions. These probability distributions describe the probability of observing different future values using the fitted model. The point forecast corresponds to the mean of this distribution. Most time series models generate forecasts that follow a normal distribution, which implies that we assume that possible future values follow a normal distribution. However, later in this section we will look at some alternatives to normal distributions.

Importance of Confidence Interval Prediction in Time Series:

  1. Uncertainty Estimation: The confidence interval provides a measure of the uncertainty associated with time series predictions. It enables variability and the range of possible future values to be quantified, which is essential for making informed decisions.

  2. Precision evaluation: By having a confidence interval, the precision of the predictions can be evaluated. If the interval is narrow, it indicates that the forecast is more accurate and reliable. On the other hand, if the interval is wide, it indicates greater uncertainty and less precision in the predictions.

  3. Risk management: The confidence interval helps in risk management by providing information about possible future scenarios. It allows identifying the ranges in which the real values could be located and making decisions based on those possible scenarios.

  4. Effective communication: The confidence interval is a useful tool for communicating predictions clearly and accurately. It allows the variability and uncertainty associated with the predictions to be conveyed to the stakeholders, avoiding a wrong or overly optimistic interpretation of the results.

Therefore, confidence interval prediction in time series is essential to understand and manage uncertainty, assess the accuracy of predictions, and make informed decisions based on possible future scenarios.

Prediction intervals

A prediction interval gives us a range in which we expect yty_t to lie with a specified probability. For example, if we assume that the distribution of future observations follows a normal distribution, a 95% prediction interval for the forecast of step h would be represented by the range

y^T+hT±1.96σ^h,\hat{y}_{T+h|T} \pm 1.96 \hat\sigma_h,

Where σ^h\hat\sigma_h is an estimate of the standard deviation of the h -step forecast distribution.

More generally, a prediction interval can be written as

y^T+hT±cσ^h\hat{y}_{T+h|T} \pm c \hat\sigma_h

In this context, the term “multiplier c” is associated with the probability of coverage. In this article, intervals of 80% and 95% are typically calculated, but any other percentage can be used. The table below shows the values of c corresponding to different coverage probabilities, assuming a normal forecast distribution.


Prediction intervals are valuable because they reflect the uncertainty in the predictions. If we only generate point forecasts, we cannot assess how accurate those forecasts are. However, by providing prediction intervals, the amount of uncertainty associated with each forecast becomes apparent. For this reason, point forecasts may lack significant value without the inclusion of corresponding forecast intervals.

One-step prediction intervals

When making a prediction for a future step, it is possible to estimate the standard deviation of the forecast distribution using the standard deviation of the residuals, which is calculated by

where KK is the number of parameters estimated in the forecasting method, and MM is the number of missing values in the residuals. (For example, M=1M=1 for a naive forecast, because we can’t forecast the first observation.)

Multi-step prediction intervals

A typical feature of forecast intervals is that they tend to increase in length as the forecast horizon lengthens. As we move further out in time, there is greater uncertainty associated with the prediction, resulting in wider prediction intervals. In general, σh tends to increase as h increases (although there are some nonlinear forecasting methods that do not follow this property).

To generate a prediction interval, it is necessary to have an estimate of σh. As mentioned above, for one-step forecasts (h=1), equation (1) provides a good estimate of the standard deviation of the forecast, σ1. However, for multi-step forecasts, a more complex calculation method is required. These calculations assume that the residuals are uncorrelated with each other.

Benchmark methods

For the four benchmark methods, it is possible to mathematically derive the forecast standard deviation under the assumption of uncorrelated residuals. If σ^h\hat{\sigma}_h denotes the standard deviation of the hh -step forecast distribution, and σ^\hat{\sigma} is the residual standard deviation given by (1), then we can use the expressions shown in next Table. Note that when h=1h=1 and TT is large, these all give the same approximate value σ^\hat{\sigma}.

Methodh-step forecast standard deviation
Mean forecastsσ^h=σ^1+1/T\hat\sigma_h = \hat\sigma\sqrt{1 + 1/T}
Naïve forecastsσ^h=σ^h\hat\sigma_h = \hat\sigma\sqrt{h}
Seasonal naïve forecastsσ^h=σ^k+1\hat\sigma_h = \hat\sigma\sqrt{k+1}
Drift forecastsσ^h=σ^h(1+h/T)\hat\sigma_h = \hat\sigma\sqrt{h(1+h/T)}

Note that when h=1h=1 and TT is large, these all give the same approximate value σ^\hat{\sigma}.

Prediction intervals from bootstrapped residuals

When a normal distribution for the residuals is an unreasonable assumption, one alternative is to use bootstrapping, which only assumes that the residuals are uncorrelated with constant variance. We will illustrate the procedure using a naïve forecasting method.

A one-step forecast error is defined as et=yty^tt1e_t = y_t - \hat{y}_{t|t-1}. For a naïve forecasting method, y^tt1=yt1\hat{y}_{t|t-1} = y_{t-1}, so we can rewrite this as yt=yt1+et.y_t = y_{t-1} + e_t.

Assuming future errors will be similar to past errors, when t>Tt>T we can replace ete_{t} by sampling from the collection of errors we have seen in the past (i.e., the residuals). So we can simulate the next observation of a time series using

yT+1=yT+eT+1y^*_{T+1} = y_{T} + e^*_{T+1}

where eT+1e^*_{T+1} is a randomly sampled error from the past, and yT+1y^*_{T+1} is the possible future value that would arise if that particular error value occurred. We use We use a * to indicate that this is not the observed yT+1y_{T+1} value, but one possible future that could occur. Adding the new simulated observation to our data set, we can repeat the process to obtain

yT+2=yT+1+eT+2,y^*_{T+2} = y_{T+1}^* + e^*_{T+2},

where eT+2e^*_{T+2} is another draw from the collection of residuals. Continuing in this way, we can simulate an entire set of future values for our time series.

Conformal Prediction

Multi-quantile losses and statistical models can provide provide prediction intervals, but the problem is that these are uncalibrated, meaning that the actual frequency of observations falling within the interval does not align with the confidence level associated with it. For example, a calibrated 95% prediction interval should contain the true value 95% of the time in repeated sampling. An uncalibrated 95% prediction interval, on the other hand, might contain the true value only 80% of the time, or perhaps 99% of the time. In the first case, the interval is too narrow and underestimates the uncertainty, while in the second case, it is too wide and overestimates the uncertainty.

Statistical methods also assume normality. Here, we talk about another method called conformal prediction that doesn’t require any distributional assumptions.

Conformal prediction intervals use cross-validation on a point forecaster model to generate the intervals. This means that no prior probabilities are needed, and the output is well-calibrated. No additional training is needed, and the model is treated as a black box. The approach is compatible with any model

mlforecast now supports Conformal Prediction on all available models.

Installing mlforecast

  • using pip:

    • pip install mlforecast
  • using with conda:

    • conda install -c conda-forge mlforecast

Loading libraries and data

# Handling and processing of Data
# ==============================================================================
import numpy as np
import pandas as pd

import scipy.stats as stats

# Handling and processing of Data for Date (time)
# ==============================================================================
import datetime
import time
from datetime import datetime, timedelta

# ==============================================================================
from statsmodels.tsa.stattools import adfuller
import statsmodels.api as sm
import statsmodels.tsa.api as smt
from statsmodels.tsa.seasonal import seasonal_decompose 
# ==============================================================================
from utilsforecast.plotting import plot_series
import xgboost as xgb

from mlforecast import MLForecast
from mlforecast.lag_transforms import ExpandingMean, ExponentiallyWeightedMean, RollingMean
from mlforecast.target_transforms import Differences
from mlforecast.utils import PredictionIntervals
# Plot
# ==============================================================================
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from import plot_acf, plot_pacf

Read Data

data_url = ""
df = pd.read_csv(data_url, parse_dates=["timestamp"])
02014-07-01 00:00:0010844
12014-07-01 00:30:008127
22014-07-01 01:00:006210
32014-07-01 01:30:004656
42014-07-01 02:00:003820

The input to MlForecast is always a data frame in long format with three columns: unique_id, ds and y:

  • The unique_id (string, int or category) represents an identifier for the series.

  • The ds (datestamp) column should be of a format expected by Pandas, ideally YYYY-MM-DD for a date or YYYY-MM-DD HH:MM:SS for a timestamp.

  • The y (numeric) represents the measurement we wish to forecast.

df["unique_id"] = "1"
df.columns=["ds", "y", "unique_id"]
02014-07-01 00:00:00108441
12014-07-01 00:30:0081271
22014-07-01 01:00:0062101
32014-07-01 01:30:0046561
42014-07-01 02:00:0038201
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10320 entries, 0 to 10319
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   ds         10320 non-null  datetime64[ns]
 1   y          10320 non-null  int64         
 2   unique_id  10320 non-null  object        
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 242.0+ KB

Explore Data with the plot method

Plot some series using the plot method from the StatsForecast class. This method prints 8 random series from the dataset and is useful for basic EDA.

fig = plot_series(df)

The Augmented Dickey-Fuller Test

An Augmented Dickey-Fuller (ADF) test is a type of statistical test that determines whether a unit root is present in time series data. Unit roots can cause unpredictable results in time series analysis. A null hypothesis is formed in the unit root test to determine how strongly time series data is affected by a trend. By accepting the null hypothesis, we accept the evidence that the time series data is not stationary. By rejecting the null hypothesis or accepting the alternative hypothesis, we accept the evidence that the time series data is generated by a stationary process. This process is also known as stationary trend. The values of the ADF test statistic are negative. Lower ADF values indicate a stronger rejection of the null hypothesis.

Augmented Dickey-Fuller Test is a common statistical test used to test whether a given time series is stationary or not. We can achieve this by defining the null and alternate hypothesis.

  • Null Hypothesis: Time Series is non-stationary. It gives a time-dependent trend.

  • Alternate Hypothesis: Time Series is stationary. In another term, the series doesn’t depend on time.

  • ADF or t Statistic < critical values: Reject the null hypothesis, time series is stationary.

  • ADF or t Statistic > critical values: Failed to reject the null hypothesis, time series is non-stationary.

def augmented_dickey_fuller_test(series , column_name):
    print (f'Dickey-Fuller test results for columns: {column_name}')
    dftest = adfuller(series, autolag='AIC')
    dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','No Lags Used','Number of observations used'])
    for key,value in dftest[4].items():
       dfoutput['Critical Value (%s)'%key] = value
    print (dfoutput)
    if dftest[1] <= 0.05:
        print("Reject the null hypothesis")
        print("The data is stationary")
        print("The null hypothesis cannot be rejected")
        print("The data is not stationary")
Dickey-Fuller test results for columns: Ads
Test Statistic                -1.076452e+01
p-value                        2.472132e-19
No Lags Used                   3.900000e+01
Number of observations used    1.028000e+04
Critical Value (1%)           -3.430986e+00
Critical Value (5%)           -2.861821e+00
Critical Value (10%)          -2.566920e+00
dtype: float64
Reject the null hypothesis
The data is stationary

Autocorrelation plots

Autocorrelation Function

Definition 1. Let {xt;1tn}\{x_t;1 ≤ t ≤ n\} be a time series sample of size n from {Xt}\{X_t\}. 1. xˉ=t=1nxtn\bar x = \sum_{t=1}^n \frac{x_t}{n} is called the sample mean of {Xt}\{X_t\}. 2. ck=t=1nk(xt+kxˉ)(xtxˉ)/nc_k =\sum_{t=1}^{n−k} (x_{t+k}- \bar x)(x_t−\bar x)/n is known as the sample autocovariance function of {Xt}\{X_t\}. 3. rk=ck/c0r_k = c_k /c_0 is said to be the sample autocorrelation function of {Xt}\{X_t\}.

Note the following remarks about this definition:

  • Like most literature, this guide uses ACF to denote the sample autocorrelation function as well as the autocorrelation function. What is denoted by ACF can easily be identified in context.

  • Clearly c0 is the sample variance of {Xt}\{X_t\}. Besides, r0=c0/c0=1r_0 = c_0/c_0 = 1 and for any integer k,rk1k, |r_k| ≤ 1.

  • When we compute the ACF of any sample series with a fixed length nn, we cannot put too much confidence in the values of rkr_k for large k’s, since fewer pairs of (xt+k,xt)(x_{t +k }, x_t ) are available for calculating rkr_k as kk is large. One rule of thumb is not to estimate rkr_k for k>n/3k > n/3, and another is n50,kn/4n ≥ 50, k ≤ n/4. In any case, it is always a good idea to be careful.

  • We also compute the ACF of a nonstationary time series sample by Definition 1. In this case, however, the ACF or rkr_k very slowly or hardly tapers off as kk increases.

  • Plotting the ACF (rk)(r_k) against lag kk is easy but very helpful in analyzing time series sample. Such an ACF plot is known as a correlogram.

  • If {Xt}\{X_t\} is stationary with E(Xt)=0E(X_t)=0 and ρk=0\rho_k =0 for all k0k \neq 0,thatis,itisa white noise series, then the sampling distribution of rkr_k is asymptotically normal with the mean 0 and the variance of 1/n1/n. Hence, there is about 95% chance that rkr_k falls in the interval [1.96/n,1.96/n][−1.96/√n, 1.96/√n].

Now we can give a summary that (1) if the time series plot of a time series clearly shows a trend or/and seasonality, it is surely nonstationary; (2) if the ACF rkr_k very slowly or hardly tapers off as lag kk increases, the time series should also be nonstationary.

fig, axs = plt.subplots(nrows=1, ncols=2)

plot_acf(df["y"],  lags=30, ax=axs[0],color="fuchsia")

# Grafico
plot_pacf(df["y"],  lags=30, ax=axs[1],color="lime")
axs[1].set_title('Partial Autocorrelation')
plt.savefig("../../figs/prediction_intervals_in_forecasting_models__autocorrelation.png", bbox_inches='tight')

Decomposition of the time series

How to decompose a time series and why?

In time series analysis to forecast new values, it is very important to know past data. More formally, we can say that it is very important to know the patterns that values follow over time. There can be many reasons that cause our forecast values to fall in the wrong direction. Basically, a time series consists of four components. The variation of those components causes the change in the pattern of the time series. These components are:

  • Level: This is the primary value that averages over time.
  • Trend: The trend is the value that causes increasing or decreasing patterns in a time series.
  • Seasonality: This is a cyclical event that occurs in a time series for a short time and causes short-term increasing or decreasing patterns in a time series.
  • Residual/Noise: These are the random variations in the time series.

Combining these components over time leads to the formation of a time series. Most time series consist of level and noise/residual and trend or seasonality are optional values.

If seasonality and trend are part of the time series, then there will be effects on the forecast value. As the pattern of the forecasted time series may be different from the previous time series.

The combination of the components in time series can be of two types: * Additive * multiplicative

Additive time series

If the components of the time series are added to make the time series. Then the time series is called the additive time series. By visualization, we can say that the time series is additive if the increasing or decreasing pattern of the time series is similar throughout the series. The mathematical function of any additive time series can be represented by: y(t)=level+Trend+seasonality+noisey(t) = level + Trend + seasonality + noise

Multiplicative time series

If the components of the time series are multiplicative together, then the time series is called a multiplicative time series. For visualization, if the time series is having exponential growth or decline with time, then the time series can be considered as the multiplicative time series. The mathematical function of the multiplicative time series can be represented as.

y(t)=LevelTrendseasonalityNoisey(t) = Level * Trend * seasonality * Noise


a = seasonal_decompose(df["y"], model = "additive", period=24).plot()
a.savefig('../../figs/prediction_intervals_in_forecasting_models__seasonal_decompose_aditive.png', bbox_inches='tight')


b = seasonal_decompose(df["y"], model = "Multiplicative", period=24).plot()
b.savefig('../../figs/prediction_intervals_in_forecasting_models__seasonal_decompose_multiplicative.png', bbox_inches='tight')

Split the data into training and testing

Let’s divide our data into sets 1. Data to train our model. 2. Data to test our model

For the test data we will use the last 500 hours to test and evaluate the performance of our model.

train = df[df.ds<='2015-01-21 13:30:00'] 
test = df[df.ds>'2015-01-21 13:30:00']
train.shape, test.shape
((9820, 3), (500, 3))

Now let’s plot the training data and the test data.

fig = plot_series(train,test)

Modeling with mlforecast

Building Model

We define the model that we want to use, for our example we are going to use the XGBoost model.

model1 = [xgb.XGBRegressor()]

We can use the MLForecast.preprocess method to explore different transformations.

If it is true that the series we are working with is a stationary series see (Dickey fuller test), however for the sake of practice and instruction in this guide, we will apply the difference to our series, we will do this using the target_transforms parameter and calling the diff function like: mlforecast.target_transforms.Differences

mlf = MLForecast(models=model1,

It is important to take into account when we use the parameter target_transforms=[Differences([1])] in case the series is stationary we can use a difference, or in the case that the series is not stationary, we can use more than one difference so that the series is constant over time, that is, that it is constant in mean and in variance.

prep = mlf.preprocess(df)
12014-07-01 00:30:00-2717.01
22014-07-01 01:00:00-1917.01
32014-07-01 01:30:00-1554.01
42014-07-01 02:00:00-836.01
52014-07-01 02:30:00-947.01
103152015-01-31 21:30:00951.01
103162015-01-31 22:00:001051.01
103172015-01-31 22:30:001588.01
103182015-01-31 23:00:00-718.01
103192015-01-31 23:30:00-303.01

This has subtacted the lag 1 from each value, we can see what our series look like now.

fig = plot_series(prep)

Adding features


Looks like the seasonality is gone, we can now try adding some lag features.

mlf = MLForecast(models=model1,
prep = mlf.preprocess(df)
252014-07-01 12:30:00-22.01445.0-2717.0
262014-07-01 13:00:00-708.01-22.0-1917.0
272014-07-01 13:30:001281.01-708.0-1554.0
282014-07-01 14:00:0087.011281.0-836.0
292014-07-01 14:30:001045.0187.0-947.0
103152015-01-31 21:30:00951.01428.04642.0
103162015-01-31 22:00:001051.01951.0-519.0
103172015-01-31 22:30:001588.011051.02411.0
103182015-01-31 23:00:00-718.011588.0214.0
103192015-01-31 23:30:00-303.01-718.02595.0
prep.drop(columns=['unique_id', 'ds']).corr()['y']
y        1.000000
lag1     0.663082
lag24    0.155366
Name: y, dtype: float64

Lag transforms

Lag transforms are defined as a dictionary where the keys are the lags and the values are lists of the transformations that we want to apply to that lag. You can refer to the lag transformations guide for more details.

mlf = MLForecast(models=model1,
                 lag_transforms={1: [ExpandingMean()], 24: [RollingMean(window_size=7)]},
prep = mlf.preprocess(df)
312014-07-01 15:30:00-836.01-1211.0-305.0284.533325-1254.285767
322014-07-01 16:00:00-2316.01-836.0157.0248.387100-843.714294
332014-07-01 16:30:00-1215.01-2316.0-63.0168.250000-578.857117
342014-07-01 17:00:002190.01-1215.0357.0126.333336-305.857147
352014-07-01 17:30:002322.012190.01849.0187.02941977.714287
103152015-01-31 21:30:00951.01428.04642.01.2483032064.285645
103162015-01-31 22:00:001051.01951.0-519.01.3403781873.428589
103172015-01-31 22:30:001588.011051.02411.01.4421292179.000000
103182015-01-31 23:00:00-718.011588.0214.01.5959101888.714233
103192015-01-31 23:30:00-303.01-718.02595.01.5261682071.714355

You can see that both approaches get to the same result, you can use whichever one you feel most comfortable with.

Date features

If your time column is made of timestamps then it might make sense to extract features like week, dayofweek, quarter, etc. You can do that by passing a list of strings with pandas time/date components. You can also pass functions that will take the time column as input, as we’ll show here.

mlf = MLForecast(models=model1,
                 lag_transforms={1: [ExpandingMean()], 24: [RollingMean(window_size=7)]},
                 date_features=["year", "month", "day", "hour"]) # Seasonal data
prep = mlf.preprocess(df)
312014-07-01 15:30:00-836.01-1211.0-305.0284.533325-1254.28576720147115
322014-07-01 16:00:00-2316.01-836.0157.0248.387100-843.71429420147116
332014-07-01 16:30:00-1215.01-2316.0-63.0168.250000-578.85711720147116
342014-07-01 17:00:002190.01-1215.0357.0126.333336-305.85714720147117
352014-07-01 17:30:002322.012190.01849.0187.02941977.71428720147117
103152015-01-31 21:30:00951.01428.04642.01.2483032064.285645201513121
103162015-01-31 22:00:001051.01951.0-519.01.3403781873.428589201513122
103172015-01-31 22:30:001588.011051.02411.01.4421292179.000000201513122
103182015-01-31 23:00:00-718.011588.0214.01.5959101888.714233201513123
103192015-01-31 23:30:00-303.01-718.02595.01.5261682071.714355201513123

Fit the Model

# fit the models,  
prediction_intervals=PredictionIntervals(n_windows=5, h=30, method="conformal_distribution" )  )
MLForecast(models=[XGBRegressor], freq=30min, lag_features=['lag1', 'lag24', 'expanding_mean_lag1', 'rolling_mean_lag24_window_size7'], date_features=['year', 'month', 'day', 'hour'], num_threads=1)

Let’s see the results of our model in this case the XGBoost model. We can observe it with the following instruction:

Let us now visualize the fitted values of our models.

result = mlf.forecast_fitted_values()
result = result.set_index("unique_id")
12014-07-01 15:30:0018544.018441.443359
12014-07-01 16:00:0016228.016391.152344
12014-07-01 16:30:0015013.015260.714844
12014-07-01 17:00:0017203.017066.148438
12014-07-01 17:30:0019525.019714.404297
12015-01-31 21:30:0024670.024488.646484
12015-01-31 22:00:0025721.025868.865234
12015-01-31 22:30:0027309.027290.125000
12015-01-31 23:00:0026591.027123.226562
12015-01-31 23:30:0026288.026241.205078
from statsmodels.stats.diagnostic import normal_ad
from scipy import stats
sw_result = stats.shapiro(result["XGBRegressor"])
ad_result = normal_ad(np.array(result["XGBRegressor"]), axis=0)
dag_result = stats.normaltest(result["XGBRegressor"], axis=0, nan_policy='propagate')

It’s important to note that we can only use this method if we assume that the residuals of our validation predictions are normally distributed. To see if this is the case, we will use a PP-plot and test its normality with the Anderson-Darling, Kolmogorov-Smirnov, and D’Agostino K^2 tests.

The PP-plot(Probability-to-Probability) plots the data sample against the normal distribution plot in such a way that if normally distributed, the data points will form a straight line.

The three normality tests determine how likely a data sample is from a normally distributed population using p-values. The null hypothesis for each test is that “the sample came from a normally distributed population”. This means that if the resulting p-values are below a chosen alpha value, then the null hypothesis is rejected. Thus there is evidence to suggest that the data comes from a non-normal distribution. For this article, we will use an Alpha value of 0.01.

fig, axs = plt.subplots(nrows=2, ncols=2)

# plot[1,1]
axs[0,0].set_title("Residuals model");

# plot
axs[0,1].hist(result["XGBRegressor"], density=True,bins=50, alpha=0.5 )
axs[0,1].set_title("Density plot - Residual");

# plot
stats.probplot(result["XGBRegressor"], dist="norm", plot=axs[1,0])
axs[1,0].set_title('Plot Q-Q')
axs[1,0].annotate("SW p-val: {:.4f}".format(sw_result[1]), xy=(0.05,0.9), xycoords='axes fraction', fontsize=15,
            bbox=dict(boxstyle="round", fc="none", ec="gray", pad=0.6))

axs[1,0].annotate("AD p-val: {:.4f}".format(ad_result[1]), xy=(0.05,0.8), xycoords='axes fraction', fontsize=15,
            bbox=dict(boxstyle="round", fc="none", ec="gray", pad=0.6))

axs[1,0].annotate("DAG p-val: {:.4f}".format(dag_result[1]), xy=(0.05,0.7), xycoords='axes fraction', fontsize=15,
            bbox=dict(boxstyle="round", fc="none", ec="gray", pad=0.6))
# plot
plot_acf(result["XGBRegressor"],  lags=35, ax=axs[1,1],color="fuchsia")

plt.savefig("../../figs/prediction_intervals_in_forecasting_models__plot_residual_model.png", bbox_inches='tight')

Predict method with prediction intervals

To generate forecasts use the predict method.

forecast_df = mlf.predict(h=30, level=[80,95])
012015-02-01 00:00:0026320.29882825559.88424125680.22836926960.36928727080.713416
112015-02-01 00:30:0026446.47265624130.42961425195.46162127697.48369128762.515698
212015-02-01 01:00:0024909.97070323094.95053723579.58339826240.35800826724.990869
312015-02-01 01:30:0024405.40234421548.62829622006.66259826804.14209027262.176392
412015-02-01 02:00:0022292.39062520666.73696321130.21543023454.56582023918.044287

Plot prediction intervals

Now let’s visualize the result of our forecast and the historical data of our time series, also let’s draw the confidence interval that we have obtained when making the prediction with 95% confidence.

fig = plot_series(df, forecast_df, level=[80,95], max_insample_length=200,engine="matplotlib")
fig.get_axes()[0].set_title("Prediction intervals")
fig.savefig('../../figs/prediction_intervals_in_forecasting_models__plot_forecasting_intervals.png', bbox_inches='tight')

The confidence interval is a range of values that has a high probability of containing the true value of a variable. In machine learning time series models, the confidence interval is used to estimate the uncertainty in the predictions.

One of the main benefits of using the confidence interval is that it allows users to understand the accuracy of the predictions. For example, if the confidence interval is very wide, it means that the prediction is less accurate. Conversely, if the confidence interval is very narrow, it means that the prediction is more accurate.

Another benefit of the confidence interval is that it helps users make informed decisions. For example, if a prediction is within the confidence interval, it means that it is likely to come true. Conversely, if a prediction is outside the confidence interval, it means that it is less likely to come true.

In general, the confidence interval is an important tool for machine learning time series models. It helps users understand the accuracy of the forecasts and make informed decisions.


  1. Changquan Huang • Alla Petukhina. Springer series (2022). Applied Time Series Analysis and Forecasting with Python.
  2. Ivan Svetunkov. Forecasting and Analytics with the Augmented Dynamic Adaptive Model (ADAM)
  3. James D. Hamilton. Time Series Analysis Princeton University Press, Princeton, New Jersey, 1st Edition, 1994.
  4. Nixtla Parameters for Mlforecast.
  5. Pandas available frequencies.
  6. Rob J. Hyndman and George Athanasopoulos (2018). “Forecasting principles and practice, Time series cross-validation”..
  7. Seasonal periods- Rob J Hyndman.