In this notebook, we’ll implement anomaly detection in time series data
Prerequesites This tutorial assumes basic familiarity with StatsForecast. For a minimal example visit the Quick Start
Important Once an anomaly has been identified, we must decide what to do with it. For example, we could remove it or replace it with another value. The correct course of action is context-dependent and beyond this notebook’s scope. Removing an anomaly will likely improve the accuracy of the forecast, but it can also underestimate the amount of randomness in the data.
Tip You can use Colab to run this Notebook interactively![]()
pip install statsforecast
unique_id | ds | y | |
---|---|---|---|
0 | H1 | 1 | 605.0 |
1 | H1 | 2 | 586.0 |
2 | H1 | 3 | 586.0 |
3 | H1 | 4 | 559.0 |
4 | H1 | 5 | 511.0 |
unique_id
, df
and y
.
unique_id
: (string, int or category) A unique identifier for the
series.ds
: (timestamp or int) A timestamp in format YYYY-MM-DD or
YYYY-MM-DD HH:MM:SS or an integer indexing time.y
: (numeric) The measurement we wish to forecast.n_series
.
plot_series
function from the
utilsforecast
package. This function has multiple parameters, and the
required ones to generate the plots in this notebook are explained
below.
df
: A pandas dataframe with columns [unique_id, ds, y].forecasts_df
: A pandas dataframe with columns [unique_id, ds]
and models.ids
: A list with the ids of the time series we want to plot.level
: Prediction interval levels to plot.plot_anomalies
: Whether or not to include the anomalies for each
prediction interval.statsforecast.models
and then we
need to instantiate it. Since we’re using hourly data, we have two
seasonal periods: one every 24 hours (hourly) and one every 24*7 hours
(daily). Hence, we need to set season_length = [24, 24*7]
.
models
: The list of models defined in the previous step.freq
: A string or integer indicating the frequency of the data.
See pandas’ available
frequencies.n_jobs
: An integer that indicates the number of jobs used in
parallel processing. Use -1 to select all cores.forecast
method, which requieres the following arguments:
df
: The dataframe with the training data.h
: The forecasting horizon.level
: The confidence levels of the prediction intervals.fitted
: Return insample predictions.level
and set fitted=True
since
we’ll need the insample forecasts and their prediction intervals to
detect the anomalies.
unique_id | ds | MSTL | MSTL-lo-99 | MSTL-hi-99 | |
---|---|---|---|---|---|
0 | H1 | 749 | 607.607223 | 587.173250 | 628.041196 |
1 | H1 | 750 | 552.364253 | 521.069710 | 583.658796 |
2 | H1 | 751 | 506.785334 | 465.894977 | 547.675691 |
3 | H1 | 752 | 472.906141 | 423.114088 | 522.698195 |
4 | H1 | 753 | 452.240231 | 394.064394 | 510.416067 |
plot_series
function from before.
forecast_fitted_values
method.
unique_id | ds | y | MSTL | MSTL-lo-99 | MSTL-hi-99 | |
---|---|---|---|---|---|---|
0 | H1 | 1 | 605.0 | 605.098607 | 584.678408 | 625.518805 |
1 | H1 | 2 | 586.0 | 588.496673 | 568.076474 | 608.916872 |
2 | H1 | 3 | 586.0 | 585.586856 | 565.166657 | 606.007054 |
3 | H1 | 4 | 559.0 | 554.012377 | 533.592178 | 574.432576 |
4 | H1 | 5 | 511.0 | 510.153508 | 489.733309 | 530.573707 |
unique_id | ds | y | MSTL | MSTL-lo-99 | MSTL-hi-99 | |
---|---|---|---|---|---|---|
42 | H1 | 43 | 613.0 | 649.404871 | 628.984672 | 669.825069 |
47 | H1 | 48 | 683.0 | 662.245526 | 641.825328 | 682.665725 |
48 | H1 | 49 | 687.0 | 655.382320 | 634.962122 | 675.802519 |
100 | H1 | 101 | 507.0 | 484.934230 | 464.514031 | 505.354428 |
110 | H1 | 111 | 451.0 | 474.899006 | 454.478808 | 495.319205 |
level
and the
plot_anomalies
arguments of the plot_series
function.
ids
argument to
select one particular time series, for example, H10
.