> ## Documentation Index
> Fetch the complete documentation index at: https://nixtlaverse.nixtla.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Amazon Forecast vs StatsForecast

> Amazon’s AutoML vs open source statistical methods

## Data

We will make use of the [M5
competition](https://mofc.unic.ac.cy/m5-competition/) dataset provided
by Walmart. This dataset is interesting for its scale but also the fact
that it features many timeseries with infrequent occurances. Such
timeseries are common in retail scenarios and are difficult for
traditional timeseries forecasting techniques to address.

The data are ready for download at the following URLs:

* Train set:
  `https://m5-benchmarks.s3.amazonaws.com/data/train/target.parquet`
* Temporal exogenous variables (used by AmazonForecast):
  `https://m5-benchmarks.s3.amazonaws.com/data/train/temporal.parquet`
* Static exogenous variables (used by AmazonForecast):
  `https://m5-benchmarks.s3.amazonaws.com/data/train/static.parquet`

A more detailed description of the data can be found
[here](../../../datasetsforecast/m5).

> **Warning**
>
> The M5 competition is hierarchical. That is, forecasts are required
> for different levels of aggregation: national, state, store, etc. In
> this experiment, we only generate forecasts using the bottom-level
> data. The evaluation is performed using the bottom-up reconciliation
> method to obtain the forecasts for the higher hierarchies.

## Amazon Forecast

Amazon Forecast is a fully automated solution for time series
forecasting. The solution can take the time series to forecast and
exogenous variables (temporal and static). For this experiment, we used
the AutoPredict functionality of Amazon Forecast following the steps of
[this
tutorial](https://docs.aws.amazon.com/forecast/latest/dg/gs-console.html).
A detailed description of the particular steps for this dataset can be
found [here](./amazonstatsforecast.html).

Amazon Forecast creates predictors with AutoPredictor, which involves
applying the optimal combination of algorithms to each time series in
your datasets. The predictor is an Amazon Forecast model that is trained
using your target time series, related time series, item metadata, and
any additional datasets you include.

Included algorithms range from commonly used statistical algorithms like
Autoregressive Integrated Moving Average (ARIMA), to complex neural
network algorithms like CNN-QR and DeepAR+.: CNN-QR, DeepAR+, Prophet,
NPTS, ARIMA, and ETS.

To leverage the probabilistic features of Amazon Forecast and enable
confidence intervals for further analysis we forecasted the following
quantiles: 0.1 | 0.5 | 0.9.

The full pipeline of Amazon Forecast took 4.1 hours and the results can
be found here: `s3://m5-benchmarks/forecasts/amazonforecast-m5.parquet`

## Nixtla’s StatsForecast

### Install necessary libraries

We assume you have StatsForecast already installed. Check this guide for
instructions on [how to install
StatsForecast](../getting-started/installation.html).

Additionally, we will install `s3fs` to read from the S3 Filesystem of
AWS. (If you don’t want to use a cloud storage provider, you can read
your files locally using pandas)

```python theme={null}
%%capture
!pip install statsforecast s3fs
```

### Input format

We will use pandas to read the data set stored in a parquet file for
efficiency. You can use ordinary pandas operations to read your data in
other formats likes `.csv`.

The input to StatsForecast is always a data frame in [long
format](https://www.theanalysisfactor.com/wide-and-long-data/) with
three columns: `unique_id`, `ds` and `y`:

* The `unique_id` (string, int or category) represents an identifier
  for the series.

* The `ds` (datestamp) column should be of a format expected by
  Pandas, ideally YYYY-MM-DD for a date or YYYY-MM-DD HH:MM:SS for a
  timestamp.

* The `y` (numeric) represents the measurement we wish to forecast. We
  will rename the

So we will rename the original columns to make it compatible with
StatsForecast.

Depending on your internet connection, this step should take around 20
seconds.

> **Warning**
>
> We are reading a file from S3, so you need to install the s3fs
> library. To install it, run `! pip install s3fs`

### Read data

```python theme={null}
import pandas as pd

Y_df_m5 = pd.read_parquet('https://m5-benchmarks.s3.amazonaws.com/data/train/target.parquet') 

Y_df_m5 = Y_df_m5.rename(columns={
    'item_id': 'unique_id', 
    'timestamp': 'ds', 
    'demand': 'y'
}) 

Y_df_m5.head()
```

|   | unique\_id           | ds         | y   |
| - | -------------------- | ---------- | --- |
| 0 | FOODS\_1\_001\_CA\_1 | 2011-01-29 | 3.0 |
| 1 | FOODS\_1\_001\_CA\_1 | 2011-01-30 | 0.0 |
| 2 | FOODS\_1\_001\_CA\_1 | 2011-01-31 | 0.0 |
| 3 | FOODS\_1\_001\_CA\_1 | 2011-02-01 | 1.0 |
| 4 | FOODS\_1\_001\_CA\_1 | 2011-02-02 | 4.0 |

### Train statistical models

We fit the model by instantiating a new `StatsForecast` object with the
following parameters:

* `models`: a list of models. Select the models you want from
  [models](../../src/core/models.html) and import them. For this
  example, we will use `AutoETS` and `DynamicOptimizedTheta`. We set
  `season_length` to 7 because we expect seasonal effects every week.
  (See: [Seasonal
  periods](https://robjhyndman.com/hyndsight/seasonal-periods/))

* `freq`: a string indicating the frequency of the data. (See [panda’s
  available
  frequencies](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases).)

* `n_jobs`: n\_jobs: int, number of jobs used in the parallel
  processing, use -1 for all cores.

* `fallback_model`: a model to be used if a model fails.

Any settings are passed into the constructor. Then you call its fit
method and pass in the historical data frame.

* `AutoETS`: Exponential Smoothing model. Automatically selects the
  best ETS (Error, Trend, Seasonality) model using an information
  criterion. Ref: `AutoETS`.

* `SeasonalNaive`: Memory Efficient Seasonal Naive predictions. Ref:
  `SeasonalNaive`.

* `DynamicOptimizedTheta`: fit two theta lines to a deseasonalized
  time series, using different techniques to obtain and combine the
  two theta lines to produce the final forecasts. Ref:
  `DynamicOptimizedTheta`.

```python theme={null}
from statsforecast import StatsForecast
from statsforecast.models import (
    AutoETS,
    DynamicOptimizedTheta,
    SeasonalNaive
)

# Create list of models
models = [
    AutoETS(season_length=7),
    DynamicOptimizedTheta(season_length=7),
]

# Instantiate StatsForecast class
sf = StatsForecast( 
    models=models,
    freq='D', 
    n_jobs=-1,
    fallback_model=SeasonalNaive(season_length=7)
)
```

```text theme={null}
/home/ubuntu/fede/statsforecast/statsforecast/core.py:21: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
  from tqdm.autonotebook import tqdm
```

The `forecast` method takes two arguments: forecasts the next `h` (for
horizon) and `level`.

* `h` (int): represents the forecast h steps into the future. In this
  case, 12 months ahead.

* `level` (list of floats): this optional parameter is used for
  probabilistic forecasting. Set the `level` (or confidence
  percentile) of your prediction interval. For example, `level=[90]`
  means that the model expects the real value to be inside that
  interval 90% of the times.

The forecast object here is a new data frame that includes a column with
the name of the model and the y hat values, as well as columns for the
uncertainty intervals.

> **Note**
>
> The `forecast` is inteded to be compatible with distributed clusters,
> so it does not store any model parameters. If you want to store
> parameter for everymodel you can use the `fit` and `predict` methods.
> However, those methods are not defined for distrubed engines like
> Spark, Ray or Dask.

```python theme={null}
from time import time
```

```python theme={null}
init = time()
forecasts_df = sf.forecast(df=Y_df_m5, h=28)
end = time()
print(f'Statsforecast time M5 {(end - init) / 60}')
```

```text theme={null}
Statsforecast time M5 14.274124479293823
```

Store the results for further evaluation.

```python theme={null}
forecasts_df['ThETS'] = forecasts_df[['DynamicOptimizedTheta', 'AutoETS']].clip(0).median(axis=1, numeric_only=True)
forecasts_df.to_parquet('s3://m5-benchmarks/forecasts/statsforecast-m5.parquet')
```

## Evaluation

This section evaluates the performance of `StatsForecast` and
`AmazonForecast`. To do this, we first need to install
[datasetsforecast](https://github.com/Nixtla/datasetsforecast), a Python
library developed by Nixtla that includes a large battery of benchmark
datasets and evaluation utilities. The library will allow us to
calculate the performance of the models using the original evaluation
used in the competition.

```python theme={null}
%%capture
!pip install datasetsforecast
```

```python theme={null}
from datasetsforecast.m5 import M5, M5Evaluation
```

The following function will allow us to evaluate a specific model
included in the input dataframe. The function is useful for evaluating
different models.

```python theme={null}
from datasetsforecast.m5 import M5, M5Evaluation
from statsforecast import StatsForecast

### Evaluator
def evaluate_forecasts(df, model, model_name):
    Y_hat = df.set_index('ds', append=True)[model].unstack()
    *_, S_df = M5.load('data')
    Y_hat = S_df.merge(Y_hat, how='left', on=['unique_id'])
    eval_ = M5Evaluation.evaluate(y_hat=Y_hat, directory='./data')
    eval_ = eval_.rename(columns={'wrmsse': f'{model_name}_{model}_wrmsse'})
    return eval_
```

Now let’s read the forecasts generated for each solution.

```python theme={null}
### Read Forecasts
statsforecasts_df = pd.read_parquet('s3://m5-benchmarks/forecasts/statsforecast-m5.parquet')
amazonforecasts_df = pd.read_parquet('s3://m5-benchmarks/forecasts/amazonforecast-m5.parquet')

### Amazon Forecast wrangling
amazonforecasts_df = amazonforecasts_df.rename(columns={'item_id': 'unique_id', 'date': 'ds'})
# amazon forecast returns the unique_id column in lower case
# we need to transform it to upper case to ensure proper merging
amazonforecasts_df['unique_id'] = amazonforecasts_df['unique_id'].str.upper()
amazonforecasts_df = amazonforecasts_df.set_index('unique_id')
# parse datestamp
amazonforecasts_df['ds'] = pd.to_datetime(amazonforecasts_df['ds']).dt.tz_localize(None)
```

Finally, let’s use our predefined function to compute the performance of
each model.

```python theme={null}
### Evaluate performances
m5_eval_df = pd.concat([
    evaluate_forecasts(statsforecasts_df, 'ThETS', 'StatsForecast'),
    evaluate_forecasts(statsforecasts_df, 'AutoETS', 'StatsForecast'),
    evaluate_forecasts(statsforecasts_df, 'DynamicOptimizedTheta', 'StatsForecast'),
    evaluate_forecasts(amazonforecasts_df, 'p50', 'AmazonForecast'),
], axis=1)
m5_eval_df.T
```

|                                              | Total    | Level1   | Level2   | Level3   | Level4   | Level5   | Level6   | Level7   | Level8   | Level9   | Level10  | Level11  | Level12  |
| -------------------------------------------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- |
| StatsForecast\_ThETS\_wrmsse                 | 0.669606 | 0.424331 | 0.515777 | 0.580670 | 0.474098 | 0.552459 | 0.578092 | 0.651079 | 0.642446 | 0.725324 | 1.009390 | 0.967537 | 0.914068 |
| StatsForecast\_AutoETS\_wrmsse               | 0.672404 | 0.430474 | 0.516340 | 0.580736 | 0.482090 | 0.559721 | 0.579939 | 0.655362 | 0.643638 | 0.727967 | 1.010596 | 0.968168 | 0.913820 |
| StatsForecast\_DynamicOptimizedTheta\_wrmsse | 0.675333 | 0.429670 | 0.521640 | 0.589278 | 0.478730 | 0.557520 | 0.584278 | 0.656283 | 0.650613 | 0.731735 | 1.013910 | 0.971758 | 0.918576 |
| AmazonForecast\_p50\_wrmsse                  | 1.617815 | 1.912144 | 1.786991 | 1.736382 | 1.972658 | 2.010498 | 1.805926 | 1.819329 | 1.667225 | 1.619216 | 1.156432 | 1.012942 | 0.914040 |

The results (including processing time and costs) can be summarized in
the following table.

<img width="637" alt="image" src="https://user-images.githubusercontent.com/10517170/206330119-48be0a7c-9ff6-412e-a52b-59a181c2a9d9.png" />