Auxiliary Functions

This auxiliary functions are used to efficiently create and wrangle Favorita’s series.

Numpy Wrangling


source

numpy_balance

 numpy_balance (*arrs)

Fast NumPy implementation of ‘balance’ operation, useful to create a balanced panel dataset, ie a dataset with all the interactions of ‘unique_id’ and ‘ds’.

Parameters:
arrs: NumPy arrays.

Returns:
out: NumPy array.


source

numpy_ffill

 numpy_ffill (arr)

Fast NumPy implementation of ffill that fills missing values in an array by propagating the last non-missing value forward.

For example, if the array has the following values:
0 1 2 3
1 2 NaN 4

The ffill method would fill the missing values as follows:
0 1 2 3
1 2 2 4

Parameters:
arr: NumPy array.

Returns:
out: NumPy array.


source

numpy_bfill

 numpy_bfill (arr)

Fast NumPy implementation of bfill that fills missing values in an array by propagating the last non-missing value backwards.

For example, if the array has the following values:
0 1 2 3
1 2 NaN 4

The bfill method would fill the missing values as follows:
0 1 2 3
1 2 4 4

Parameters:
arr: NumPy array.

Returns: out: NumPy array.

Pandas Wrangling


source

one_hot_encoding

 one_hot_encoding (df, index_col)

Encodes dataFrame df’s categorical variables skipping index_col.

Parameters:
df: pd.DataFrame with categorical columns.
index_col: str, the index column to avoid encoding.

Returns: one_hot_concat_df: pd.DataFrame with one hot encoded categorical columns.


source

nested_one_hot_encoding

 nested_one_hot_encoding (df, index_col)

Encodes dataFrame df’s hierarchically-nested categorical variables skipping index_col.

Nested categorical variables (example geographic levels country>state), require the dummy features to preserve encoding order, to reflect the hierarchy of the categorical variables.

Parameters:
df: pd.DataFrame with hierarchically-nested categorical columns.
index_col: str, the index column to avoid encoding.

Returns:
one_hot_concat_df: pd.DataFrame with one hot encoded hierarchically-nested categorical columns.


source

get_levels_from_S_df

 get_levels_from_S_df (S_df)

Get hierarchical index levels implied by aggregation constraints dataframe S_df.

Create levels from summation matrix (base, bottom). Goes through the rows until all the bottom level series are ‘covered’ by the aggregation constraints to discover blocks/hierarchy levels.

Parameters:
S_df: pd.DataFrame with summing matrix of size (base, bottom), see aggregate method.

Returns:
levels: list, with hierarchical aggregation indexes, where each entry is a level.

Favorita Dataset

Favorita Raw


source

FavoritaRawData

 FavoritaRawData ()

Favorita Raw Data

Raw subset datasets from the Favorita 2018 Kaggle competition. This class contains utilities to download, load and filter portions of the dataset.

If you prefer, you can also download original dataset available from Kaggle directly.
pip install kaggle --upgrade
kaggle competitions download -c favorita-grocery-sales-forecasting


source

FavoritaRawData._load_raw_group_data

 FavoritaRawData._load_raw_group_data (directory, group, verbose=False)

Load raw group data.

Reads, filters and sorts Favorita subset dataset.

Parameters:
directory: str, Directory where data will be downloaded.
group: str, dataset group name in ‘Favorita200’, ‘Favorita500’, ‘FavoritaComplete’.
verbose: bool=False, wether or not print partial outputs.

Returns:
filter_items: ordered list with unique items identifiers in the Favorita subset.
filter_stores: ordered list with unique store identifiers in the Favorita subset.
filter_dates: ordered list with dates in the Favorita subset.
raw_group_data: dictionary with original raw Favorita pd.DataFrames, temporal, oil, items, store_info, holidays, transactions.

Favorita Raw Usage example

from datasetsforecast.favorita import FavoritaRawData

verbose = True
group = 'Favorita200' # 'Favorita500', 'FavoritaComplete'
directory = './data/favorita' # directory = f's3://favorita'

filter_items, filter_stores, filter_dates, raw_group_data = \
    FavoritaRawData._load_raw_group_data(directory=directory, group=group, verbose=verbose)
n_items  = len(filter_items)
n_stores = len(filter_stores)
n_dates  = len(filter_dates)

print('\n')
print('n_stores: \t', n_stores)
print('n_items: \t', n_items)
print('n_dates: \t', n_dates)
print('n_items * n_dates: \t\t',n_items * n_dates)
print('n_items * n_stores: \t\t',n_items * n_stores)
print('n_items * n_dates * n_stores: \t', n_items * n_dates * n_stores)

FavoritaData


source

FavoritaData

 FavoritaData ()

Favorita Data

The processed Favorita dataset of grocery contains item sales daily history with additional information on promotions, items, stores, and holidays, containing 371,312 series from January 2013 to August 2017, with a geographic hierarchy of states, cities, and stores. This wrangling matches that of the DPMN paper.


source

FavoritaData.load_preprocessed

 FavoritaData.load_preprocessed (directory:str, group:str,
                                 cache:bool=True, verbose:bool=False)

Load Favorita group datasets.

For the exploration of more complex models, we make available the entire information including data at the bottom level of the items sold in Favorita stores, in addition to the aggregate/national level information for the items.

Parameters:
directory: str, directory where data will be downloaded and saved.
group: str, dataset group name in ‘Favorita200’, ‘Favorita500’, ‘FavoritaComplete’.
cache: bool=False, If True saves and loads.
verbose: bool=False, wether or not print partial outputs.

Returns:
static_bottom: pd.DataFrame, with static variables of bottom level series.
static_agg: pd.DataFrame, with static variables of aggregate level series.
temporal_bottom: pd.DataFrame, with temporal variables of bottom level series.
temporal_agg: pd.DataFrame, with temporal variables of aggregate level series.


source

FavoritaData.load

 FavoritaData.load (directory:str, group:str, cache:bool=True,
                    verbose:bool=False)

Load Favorita forecasting benchmark dataset.

In contrast with other hierarchical datasets, this dataset contains a geographic hierarchy for each individual grocery item series, identified with ‘item_id’ column. The geographic hierarchy is captured by the ‘hier_id’ column.

For this reason minor wrangling is needed to adapt it for use with HierarchicalForecast, and StatsForecast libraries.

Parameters:
directory: str, directory where data will be downloaded and saved.
group: str, dataset group name in ‘Favorita200’, ‘Favorita500’, ‘FavoritaComplete’.
cache: bool=False, If True saves and loads.
verbose: bool=False, wether or not print partial outputs.

Returns:
Y_df: pd.DataFrame, target base time series with columns [‘item_id’, ‘hier_id’, ‘ds’, ‘y’].
S_df: pd.DataFrame, hierarchical constraints dataframe of size (base, bottom).

# #| hide
# #| eval: false
# # Test the equality of created and loaded datasets columns and rows
# static_agg1, static_bottom1, temporal_agg1, temporal_bottom1, S_df1 = \
#                         FavoritaData.load_preprocessed(directory=directory, group=group, cache=False)

# static_agg2, static_bottom2, temporal_agg2, temporal_bottom2, S_df2 = \
#                         FavoritaData.load_preprocessed(directory=directory, group=group)

# test_eq(len(static_agg1)+len(static_agg1.columns), 
#         len(static_agg2)+len(static_agg2.columns))
# test_eq(len(static_bottom1)+len(static_bottom1.columns), 
#         len(static_bottom2)+len(static_bottom2.columns))

# test_eq(len(temporal_agg1)+len(temporal_agg1.columns), 
#         len(temporal_agg2)+len(temporal_agg2.columns))
# test_eq(len(temporal_bottom1)+len(temporal_bottom1.columns), 
#         len(temporal_bottom2)+len(temporal_bottom2.columns))

Favorita Usage Example

# Qualitative evaluation of hierarchical data
from datasetsforecast.favorita import FavoritaData
from hierarchicalforecast.utils import HierarchicalPlot

group = 'Favorita200' # 'Favorita500', 'FavoritaComplete'
directory = './data/favorita'
Y_df, S_df, tags = FavoritaData.load(directory=directory, group=group)

Y_item_df = Y_df[Y_df.item_id==1916577] # 112830, 1501570, 1916577
Y_item_df = Y_item_df.rename(columns={'hier_id': 'unique_id'})
Y_item_df = Y_item_df.set_index('unique_id')
del Y_item_df['item_id']

hplots = HierarchicalPlot(S=S_df, tags=tags)
hplots.plot_hierarchically_linked_series(
    Y_df=Y_item_df, bottom_series='store_[40]',
)