Skip to main content

FavoritaData

Favorita Data. The processed Favorita dataset of grocery contains item sales daily history with additional information on promotions, items, stores, and holidays, containing 371,312 series from January 2013 to August 2017, with a geographic hierarchy of states, cities, and stores. This wrangling matches that of the DPMN paper. References:

FavoritaData.load

load(directory, group, cache=True, verbose=False)
Load Favorita forecasting benchmark dataset. In contrast with other hierarchical datasets, this dataset contains a geographic hierarchy for each individual grocery item series, identified with ‘item_id’ column. The geographic hierarchy is captured by the ‘hier_id’ column. For this reason minor wrangling is needed to adapt it for use with HierarchicalForecast, and StatsForecast libraries. Parameters:
NameTypeDescriptionDefault
directorystrDirectory where data will be downloaded and saved.required
groupstrDataset group name in ‘Favorita200’, ‘Favorita500’, ‘FavoritaComplete’.required
cacheboolIf True saves and loads. Defaults to True.True
verboseboolWhether or not print partial outputs. Defaults to False.False
Returns:
NameTypeDescription
tupleA tuple containing: - Y_df (pd.DataFrame): Target base time series with columns [‘item_id’, ‘hier_id’, ‘ds’, ‘y’]. - S_df (pd.DataFrame): Hierarchical constraints dataframe of size (base, bottom). - tags (dict): Dictionary with hierarchical level information.
Example:
# Qualitative evaluation of hierarchical data
from datasetsforecast.favorita import FavoritaData
from hierarchicalforecast.utils import HierarchicalPlot

group = 'Favorita200' # 'Favorita500', 'FavoritaComplete'
directory = './data/favorita'
Y_df, S_df, tags = FavoritaData.load(directory=directory, group=group)

Y_item_df = Y_df[Y_df.item_id==1916577] # 112830, 1501570, 1916577
Y_item_df = Y_item_df.rename(columns={'hier_id': 'unique_id'})
Y_item_df = Y_item_df.set_index('unique_id')
del Y_item_df['item_id']

hplots = HierarchicalPlot(S=S_df, tags=tags)
hplots.plot_hierarchically_linked_series(
    Y_df=Y_item_df, bottom_series='store_[40]',
)

FavoritaData.load_preprocessed

load_preprocessed(directory, group, cache=True, verbose=False)
Load Favorita group datasets. For the exploration of more complex models, we make available the entire information including data at the bottom level of the items sold in Favorita stores, in addition to the aggregate/national level information for the items. Parameters:
NameTypeDescriptionDefault
directorystrDirectory where data will be downloaded and saved.required
groupstrDataset group name in ‘Favorita200’, ‘Favorita500’, ‘FavoritaComplete’.required
cacheboolIf True saves and loads. Defaults to True.True
verboseboolWhether or not print partial outputs. Defaults to False.False
Returns:
NameTypeDescription
tupleTuple[DataFrame, DataFrame, DataFrame, DataFrame]A tuple containing: - static_bottom (pd.DataFrame): Static variables of bottom level series. - static_agg (pd.DataFrame): Static variables of aggregate level series. - temporal_bottom (pd.DataFrame): Temporal variables of bottom level series. - temporal_agg (pd.DataFrame): Temporal variables of aggregate level series.

Example

# Qualitative evaluation of hierarchical data
from datasetsforecast.favorita import FavoritaData
from hierarchicalforecast.utils import HierarchicalPlot

group = 'Favorita200' # 'Favorita500', 'FavoritaComplete'
directory = './data/favorita'
Y_df, S_df, tags = FavoritaData.load(directory=directory, group=group)

Y_item_df = Y_df[Y_df.item_id==1916577] # 112830, 1501570, 1916577
Y_item_df = Y_item_df.rename(columns={'hier_id': 'unique_id'})
Y_item_df = Y_item_df.set_index('unique_id')
del Y_item_df['item_id']

hplots = HierarchicalPlot(S=S_df, tags=tags)
hplots.plot_hierarchically_linked_series(
    Y_df=Y_item_df, bottom_series='store_[40]',
)

Auxiliary Functions

This auxiliary functions are used to efficiently create and wrangle Favorita’s series.

Numpy Wrangling

numpy_balance

numpy_balance(*arrs)
Fast NumPy implementation of ‘balance’ operation. Useful to create a balanced panel dataset, ie a dataset with all the interactions of ‘unique_id’ and ‘ds’. Parameters:
NameTypeDescriptionDefault
*arrsNumPy arrays.()
Returns:
TypeDescription
ndarrayNumPy array with balanced combinations.

numpy_ffill

numpy_ffill(arr)
Fast NumPy implementation of ffill that fills missing values. Fills missing values in an array by propagating the last non-missing value forward. For example, if the array has the following values:
0  1  2    3
1  2  NaN  4
The ffill method would fill the missing values as follows:
0  1  2  3
1  2  2  4
Parameters:
NameTypeDescriptionDefault
arrndarrayNumPy array.required
Returns:
TypeDescription
ndarrayNumPy array with forward-filled values.

numpy_bfill

numpy_bfill(arr)
Fast NumPy implementation of bfill that fills missing values. Fills missing values in an array by propagating the last non-missing value backwards. For example, if the array has the following values:
0  1  2    3
1  2  NaN  4
The bfill method would fill the missing values as follows:
0  1  2  3
1  2  4  4
Parameters:
NameTypeDescriptionDefault
arrndarrayNumPy array.required
Returns:
TypeDescription
ndarrayNumPy array with backward-filled values.

one_hot_encoding

one_hot_encoding(df, index_col)
Encodes dataFrame’s categorical variables skipping index column. Parameters:
NameTypeDescriptionDefault
dfDataFrameDataFrame with categorical columns.required
index_colstrThe index column to avoid encoding.required
Returns:
TypeDescription
DataFrameDataFrame with one hot encoded categorical columns.

nested_one_hot_encoding

nested_one_hot_encoding(df, index_col)
Encodes dataFrame’s hierarchically-nested categorical variables. Skips the index column. Nested categorical variables (example geographic levels country>state), require the dummy features to preserve encoding order, to reflect the hierarchy of the categorical variables. Parameters:
NameTypeDescriptionDefault
dfDataFrameDataFrame with hierarchically-nested categorical columns.required
index_colstrThe index column to avoid encoding.required
Returns:
TypeDescription
DataFrameDataFrame with one hot encoded hierarchically-nested categorical columns.

get_levels_from_S_df

get_levels_from_S_df(S_df)
Get hierarchical index levels implied by aggregation constraints dataframe. Create levels from summation matrix (base, bottom). Goes through the rows until all the bottom level series are ‘covered’ by the aggregation constraints to discover blocks/hierarchy levels. Parameters:
NameTypeDescriptionDefault
S_dfDataFrameSumming matrix of size (base, bottom), see aggregate method.required
Returns:
NameTypeDescription
levelslistHierarchical aggregation indexes, where each entry is a level.

distance_to_holiday

distance_to_holiday(holiday_dates, dates)

make_holidays_distance_df

make_holidays_distance_df(holidays_df, dates)

CodeTimer

CodeTimer(name=None, verbose=True)

Favorita200

Favorita200(freq='D', horizon=34, seasonality=7, test_size=34, tags_names=('Country', 'Country/State', 'Country/State/City', 'Country/State/City/Store'))

Favorita500

Favorita500(freq='D', horizon=34, seasonality=7, test_size=34, tags_names=('Country', 'Country/State', 'Country/State/City', 'Country/State/City/Store'))

FavoritaComplete

FavoritaRawData

Favorita Raw Data. Raw subset datasets from the Favorita 2018 Kaggle competition. This class contains utilities to download, load and filter portions of the dataset. If you prefer, you can also download original dataset available from Kaggle directly:
pip install kaggle --upgrade
kaggle competitions download -c favorita-grocery-sales-forecasting

FavoritaRawData.download

download(directory)
Downloads Favorita Competition Dataset. The dataset weights 980MB, its download is not currently robust to brief interruptions of the process. It is recommended execute with good connection. Parameters:
NameTypeDescriptionDefault
directorystrDirectory where data will be downloaded.required
Examples:
from datasetsforecast.favorita import FavoritaRawData
verbose = True
group = 'Favorita200'  # 'Favorita500', 'FavoritaComplete'
directory = './data/favorita'  # directory = f's3://favorita'
filter_items, filter_stores, filter_dates, raw_group_data = FavoritaRawData._load_raw_group_data(directory=directory, group=group, verbose=verbose)
n_items = len(filter_items)
n_stores = len(filter_stores)
n_dates = len(filter_dates)
print('\n')
print('n_stores: \t', n_stores)
print('n_items: \t', n_items)
print('n_dates: \t', n_dates)
print('n_items * n_dates: \t\t', n_items * n_dates)
print('n_items * n_stores: \t\t', n_items * n_stores)
print('n_items * n_dates * n_stores: \t', n_items * n_dates * n_stores)