> ## Documentation Index
> Fetch the complete documentation index at: https://nixtlaverse.nixtla.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Favorita

> Favorita dataset

##

### `FavoritaData`

Favorita Data.

The processed Favorita dataset of grocery contains item sales daily history with additional
information on promotions, items, stores, and holidays, containing 371,312 series from
January 2013 to August 2017, with a geographic hierarchy of states, cities, and stores.
This wrangling matches that of the DPMN paper.

References:

* [Kin G. Olivares, O. Nganba Meetei, Ruijun Ma, Rohan Reddy, Mengfei Cao,
  Lee Dicker (2022). "Probabilistic Hierarchical Forecasting with Deep Poisson
  Mixtures". International Journal Forecasting, special
  issue.](https://doi.org/10.1016/j.ijforecast.2023.04.007)

#### `FavoritaData.load`

```python theme={null}
load(directory, group, cache=True, verbose=False)
```

Load Favorita forecasting benchmark dataset.

In contrast with other hierarchical datasets, this dataset contains a geographic
hierarchy for each individual grocery item series, identified with 'item\_id' column.
The geographic hierarchy is captured by the 'hier\_id' column.

For this reason minor wrangling is needed to adapt it for use with HierarchicalForecast,
and StatsForecast libraries.

**Parameters:**

| Name        | Type                       | Description                                                             | Default            |
| ----------- | -------------------------- | ----------------------------------------------------------------------- | ------------------ |
| `directory` | <code>[str](#str)</code>   | Directory where data will be downloaded and saved.                      | *required*         |
| `group`     | <code>[str](#str)</code>   | Dataset group name in 'Favorita200', 'Favorita500', 'FavoritaComplete'. | *required*         |
| `cache`     | <code>[bool](#bool)</code> | If True saves and loads. Defaults to True.                              | <code>True</code>  |
| `verbose`   | <code>[bool](#bool)</code> | Whether or not print partial outputs. Defaults to False.                | <code>False</code> |

**Returns:**

| Name    | Type | Description                                                                                                                                                                                                                                                              |
| ------- | ---- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `tuple` |      | A tuple containing: - Y\_df (pd.DataFrame): Target base time series with columns \['item\_id', 'hier\_id', 'ds', 'y']. - S\_df (pd.DataFrame): Hierarchical constraints dataframe of size (base, bottom). - tags (dict): Dictionary with hierarchical level information. |

Example:

```python theme={null}
# Qualitative evaluation of hierarchical data
from datasetsforecast.favorita import FavoritaData
from hierarchicalforecast.utils import HierarchicalPlot

group = 'Favorita200' # 'Favorita500', 'FavoritaComplete'
directory = './data/favorita'
Y_df, S_df, tags = FavoritaData.load(directory=directory, group=group)

Y_item_df = Y_df[Y_df.item_id==1916577] # 112830, 1501570, 1916577
Y_item_df = Y_item_df.rename(columns={'hier_id': 'unique_id'})
Y_item_df = Y_item_df.set_index('unique_id')
del Y_item_df['item_id']

hplots = HierarchicalPlot(S=S_df, tags=tags)
hplots.plot_hierarchically_linked_series(
    Y_df=Y_item_df, bottom_series='store_[40]',
)
```

#### `FavoritaData.load_preprocessed`

```python theme={null}
load_preprocessed(directory, group, cache=True, verbose=False)
```

Load Favorita group datasets.

For the exploration of more complex models, we make available the entire information
including data at the bottom level of the items sold in Favorita stores, in addition
to the aggregate/national level information for the items.

**Parameters:**

| Name        | Type                       | Description                                                             | Default            |
| ----------- | -------------------------- | ----------------------------------------------------------------------- | ------------------ |
| `directory` | <code>[str](#str)</code>   | Directory where data will be downloaded and saved.                      | *required*         |
| `group`     | <code>[str](#str)</code>   | Dataset group name in 'Favorita200', 'Favorita500', 'FavoritaComplete'. | *required*         |
| `cache`     | <code>[bool](#bool)</code> | If True saves and loads. Defaults to True.                              | <code>True</code>  |
| `verbose`   | <code>[bool](#bool)</code> | Whether or not print partial outputs. Defaults to False.                | <code>False</code> |

**Returns:**

| Name    | Type                                                                                                                                                                 | Description                                                                                                                                                                                                                                                                                                                         |
| ------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `tuple` | <code>[Tuple](#typing.Tuple)\[[DataFrame](#pandas.DataFrame), [DataFrame](#pandas.DataFrame), [DataFrame](#pandas.DataFrame), [DataFrame](#pandas.DataFrame)]</code> | A tuple containing: - static\_bottom (pd.DataFrame): Static variables of bottom level series. - static\_agg (pd.DataFrame): Static variables of aggregate level series. - temporal\_bottom (pd.DataFrame): Temporal variables of bottom level series. - temporal\_agg (pd.DataFrame): Temporal variables of aggregate level series. |

#### Example

```python theme={null}
# Qualitative evaluation of hierarchical data
from datasetsforecast.favorita import FavoritaData
from hierarchicalforecast.utils import HierarchicalPlot

group = 'Favorita200' # 'Favorita500', 'FavoritaComplete'
directory = './data/favorita'
Y_df, S_df, tags = FavoritaData.load(directory=directory, group=group)

Y_item_df = Y_df[Y_df.item_id==1916577] # 112830, 1501570, 1916577
Y_item_df = Y_item_df.rename(columns={'hier_id': 'unique_id'})
Y_item_df = Y_item_df.set_index('unique_id')
del Y_item_df['item_id']

hplots = HierarchicalPlot(S=S_df, tags=tags)
hplots.plot_hierarchically_linked_series(
    Y_df=Y_item_df, bottom_series='store_[40]',
)
```

## Auxiliary Functions

This auxiliary functions are used to efficiently create and wrangle
Favorita’s series.

## Numpy Wrangling

### `numpy_balance`

```python theme={null}
numpy_balance(*arrs)
```

Fast NumPy implementation of 'balance' operation.

Useful to create a balanced panel dataset, ie a dataset with all the
interactions of 'unique\_id' and 'ds'.

**Parameters:**

| Name    | Type | Description   | Default         |
| ------- | ---- | ------------- | --------------- |
| `*arrs` |      | NumPy arrays. | <code>()</code> |

**Returns:**

| Type                                   | Description                             |
| -------------------------------------- | --------------------------------------- |
| <code>[ndarray](#numpy.ndarray)</code> | NumPy array with balanced combinations. |

### `numpy_ffill`

```python theme={null}
numpy_ffill(arr)
```

Fast NumPy implementation of `ffill` that fills missing values.

Fills missing values in an array by propagating the last non-missing value forward.

For example, if the array has the following values:

```
0  1  2    3
1  2  NaN  4
```

The `ffill` method would fill the missing values as follows:

```
0  1  2  3
1  2  2  4
```

**Parameters:**

| Name  | Type                                   | Description  | Default    |
| ----- | -------------------------------------- | ------------ | ---------- |
| `arr` | <code>[ndarray](#numpy.ndarray)</code> | NumPy array. | *required* |

**Returns:**

| Type                                   | Description                             |
| -------------------------------------- | --------------------------------------- |
| <code>[ndarray](#numpy.ndarray)</code> | NumPy array with forward-filled values. |

### `numpy_bfill`

```python theme={null}
numpy_bfill(arr)
```

Fast NumPy implementation of `bfill` that fills missing values.

Fills missing values in an array by propagating the last non-missing value backwards.

For example, if the array has the following values:

```
0  1  2    3
1  2  NaN  4
```

The `bfill` method would fill the missing values as follows:

```
0  1  2  3
1  2  4  4
```

**Parameters:**

| Name  | Type                                   | Description  | Default    |
| ----- | -------------------------------------- | ------------ | ---------- |
| `arr` | <code>[ndarray](#numpy.ndarray)</code> | NumPy array. | *required* |

**Returns:**

| Type                                   | Description                              |
| -------------------------------------- | ---------------------------------------- |
| <code>[ndarray](#numpy.ndarray)</code> | NumPy array with backward-filled values. |

### `one_hot_encoding`

```python theme={null}
one_hot_encoding(df, index_col)
```

Encodes dataFrame's categorical variables skipping index column.

**Parameters:**

| Name        | Type                                        | Description                         | Default    |
| ----------- | ------------------------------------------- | ----------------------------------- | ---------- |
| `df`        | <code>[DataFrame](#pandas.DataFrame)</code> | DataFrame with categorical columns. | *required* |
| `index_col` | <code>[str](#str)</code>                    | The index column to avoid encoding. | *required* |

**Returns:**

| Type                                        | Description                                         |
| ------------------------------------------- | --------------------------------------------------- |
| <code>[DataFrame](#pandas.DataFrame)</code> | DataFrame with one hot encoded categorical columns. |

### `nested_one_hot_encoding`

```python theme={null}
nested_one_hot_encoding(df, index_col)
```

Encodes dataFrame's hierarchically-nested categorical variables.

Skips the index column. Nested categorical variables (example geographic levels
country>state), require the dummy features to preserve encoding order, to reflect
the hierarchy of the categorical variables.

**Parameters:**

| Name        | Type                                        | Description                                               | Default    |
| ----------- | ------------------------------------------- | --------------------------------------------------------- | ---------- |
| `df`        | <code>[DataFrame](#pandas.DataFrame)</code> | DataFrame with hierarchically-nested categorical columns. | *required* |
| `index_col` | <code>[str](#str)</code>                    | The index column to avoid encoding.                       | *required* |

**Returns:**

| Type                                        | Description                                                               |
| ------------------------------------------- | ------------------------------------------------------------------------- |
| <code>[DataFrame](#pandas.DataFrame)</code> | DataFrame with one hot encoded hierarchically-nested categorical columns. |

### `get_levels_from_S_df`

```python theme={null}
get_levels_from_S_df(S_df)
```

Get hierarchical index levels implied by aggregation constraints dataframe.

Create levels from summation matrix (base, bottom).
Goes through the rows until all the bottom level series are 'covered'
by the aggregation constraints to discover blocks/hierarchy levels.

**Parameters:**

| Name   | Type                                        | Description                                                  | Default    |
| ------ | ------------------------------------------- | ------------------------------------------------------------ | ---------- |
| `S_df` | <code>[DataFrame](#pandas.DataFrame)</code> | Summing matrix of size (base, bottom), see aggregate method. | *required* |

**Returns:**

| Name     | Type                       | Description                                                    |
| -------- | -------------------------- | -------------------------------------------------------------- |
| `levels` | <code>[list](#list)</code> | Hierarchical aggregation indexes, where each entry is a level. |

### `distance_to_holiday`

```python theme={null}
distance_to_holiday(holiday_dates, dates)
```

### `make_holidays_distance_df`

```python theme={null}
make_holidays_distance_df(holidays_df, dates)
```

### `CodeTimer`

```python theme={null}
CodeTimer(name=None, verbose=True)
```

### `Favorita200`

```python theme={null}
Favorita200(freq='D', horizon=34, seasonality=7, test_size=34, tags_names=('Country', 'Country/State', 'Country/State/City', 'Country/State/City/Store'))
```

### `Favorita500`

```python theme={null}
Favorita500(freq='D', horizon=34, seasonality=7, test_size=34, tags_names=('Country', 'Country/State', 'Country/State/City', 'Country/State/City/Store'))
```

### `FavoritaComplete`

### `FavoritaRawData`

Favorita Raw Data.

Raw subset datasets from the Favorita 2018 Kaggle competition.
This class contains utilities to download, load and filter portions of the dataset.

If you prefer, you can also download original dataset available from Kaggle directly:

```
pip install kaggle --upgrade
kaggle competitions download -c favorita-grocery-sales-forecasting
```

#### `FavoritaRawData.download`

```python theme={null}
download(directory)
```

Downloads Favorita Competition Dataset.

The dataset weights 980MB, its download is not currently robust to
brief interruptions of the process. It is recommended execute with
good connection.

**Parameters:**

| Name        | Type                     | Description                              | Default    |
| ----------- | ------------------------ | ---------------------------------------- | ---------- |
| `directory` | <code>[str](#str)</code> | Directory where data will be downloaded. | *required* |

Examples:

```python theme={null}
from datasetsforecast.favorita import FavoritaRawData
verbose = True
group = 'Favorita200'  # 'Favorita500', 'FavoritaComplete'
directory = './data/favorita'  # directory = f's3://favorita'
filter_items, filter_stores, filter_dates, raw_group_data = FavoritaRawData._load_raw_group_data(directory=directory, group=group, verbose=verbose)
n_items = len(filter_items)
n_stores = len(filter_stores)
n_dates = len(filter_dates)
print('\n')
print('n_stores: \t', n_stores)
print('n_items: \t', n_items)
print('n_dates: \t', n_dates)
print('n_items * n_dates: \t\t', n_items * n_dates)
print('n_items * n_stores: \t\t', n_items * n_stores)
print('n_items * n_dates * n_stores: \t', n_items * n_dates * n_stores)
```
