Favorita
Auxiliary Functions
This auxiliary functions are used to efficiently create and wrangle Favorita’s series.
Numpy Wrangling
source
numpy_balance
*Fast NumPy implementation of ‘balance’ operation, useful to create a balanced panel dataset, ie a dataset with all the interactions of ‘unique_id’ and ‘ds’.
Parameters:
arrs
: NumPy arrays.
Returns:
out
: NumPy array.*
source
numpy_ffill
*Fast NumPy implementation of ffill
that fills missing values in an
array by propagating the last non-missing value forward.
For example, if the array has the following values:
0 1 2 3
1 2
NaN 4
The ffill
method would fill the missing values as follows:
0 1 2
3
1 2 2 4
Parameters:
arr
: NumPy array.
Returns:
out
: NumPy array.*
source
numpy_bfill
*Fast NumPy implementation of bfill
that fills missing values in an
array by propagating the last non-missing value backwards.
For example, if the array has the following values:
0 1 2 3
1 2
NaN 4
The bfill
method would fill the missing values as follows:
0 1 2
3
1 2 4 4
Parameters:
arr
: NumPy array.
Returns: out
: NumPy array.*
Pandas Wrangling
source
one_hot_encoding
*Encodes dataFrame df
’s categorical variables skipping index_col
.
Parameters:
df
: pd.DataFrame with categorical columns.
index_col
: str, the index column to avoid encoding.
Returns: one_hot_concat_df
: pd.DataFrame with one hot encoded
categorical columns.
*
source
nested_one_hot_encoding
*Encodes dataFrame df
’s hierarchically-nested categorical variables
skipping index_col
.
Nested categorical variables (example geographic levels country>state), require the dummy features to preserve encoding order, to reflect the hierarchy of the categorical variables.
Parameters:
df
: pd.DataFrame with hierarchically-nested
categorical columns.
index_col
: str, the index column to avoid
encoding.
Returns:
one_hot_concat_df
: pd.DataFrame with one hot encoded
hierarchically-nested categorical columns.
*
source
get_levels_from_S_df
*Get hierarchical index levels implied by aggregation constraints
dataframe S_df
.
Create levels from summation matrix (base, bottom). Goes through the rows until all the bottom level series are ‘covered’ by the aggregation constraints to discover blocks/hierarchy levels.
Parameters:
S_df
: pd.DataFrame with summing matrix of size
(base, bottom)
, see aggregate
method.
Returns:
levels
: list, with hierarchical aggregation indexes,
where each entry is a level.*
Favorita Dataset
Favorita Raw
source
FavoritaRawData
*Favorita Raw Data
Raw subset datasets from the Favorita 2018 Kaggle competition. This class contains utilities to download, load and filter portions of the dataset.
If you prefer, you can also download original dataset available from
Kaggle directly.
pip install kaggle --upgrade
kaggle competitions download -c favorita-grocery-sales-forecasting
*
source
FavoritaRawData._load_raw_group_data
*Load raw group data.
Reads, filters and sorts Favorita subset dataset.
Parameters:
directory
: str, Directory where data will be
downloaded.
group
: str, dataset group name in ‘Favorita200’,
‘Favorita500’, ‘FavoritaComplete’.
verbose
: bool=False, wether or
not print partial outputs.
Returns:
filter_items
: ordered list with unique items
identifiers in the Favorita subset.
filter_stores
: ordered list
with unique store identifiers in the Favorita subset.
filter_dates
: ordered list with dates in the Favorita subset.
raw_group_data
: dictionary with original raw Favorita pd.DataFrames,
temporal, oil, items, store_info, holidays, transactions.
*
Favorita Raw Usage example
FavoritaData
source
FavoritaData
*Favorita Data
The processed Favorita dataset of grocery contains item sales daily history with additional information on promotions, items, stores, and holidays, containing 371,312 series from January 2013 to August 2017, with a geographic hierarchy of states, cities, and stores. This wrangling matches that of the DPMN paper.
source
FavoritaData.load_preprocessed
*Load Favorita group datasets.
For the exploration of more complex models, we make available the entire information including data at the bottom level of the items sold in Favorita stores, in addition to the aggregate/national level information for the items.
Parameters:
directory
: str, directory where data will be
downloaded and saved.
group
: str, dataset group name in
‘Favorita200’, ‘Favorita500’, ‘FavoritaComplete’.
cache
:
bool=False, If True
saves and loads.
verbose
: bool=False, wether
or not print partial outputs.
Returns:
static_bottom
: pd.DataFrame, with static variables of
bottom level series.
static_agg
: pd.DataFrame, with static
variables of aggregate level series.
temporal_bottom
:
pd.DataFrame, with temporal variables of bottom level series.
temporal_agg
: pd.DataFrame, with temporal variables of aggregate level
series.
*
source
FavoritaData.load
*Load Favorita forecasting benchmark dataset.
In contrast with other hierarchical datasets, this dataset contains a geographic hierarchy for each individual grocery item series, identified with ‘item_id’ column. The geographic hierarchy is captured by the ‘hier_id’ column.
For this reason minor wrangling is needed to adapt it for use with
HierarchicalForecast
,
and StatsForecast
libraries.
Parameters:
directory
: str, directory where data will be
downloaded and saved.
group
: str, dataset group name in
‘Favorita200’, ‘Favorita500’, ‘FavoritaComplete’.
cache
:
bool=False, If True
saves and loads.
verbose
: bool=False, wether
or not print partial outputs.
Returns:
Y_df
: pd.DataFrame, target base time series with
columns [‘item_id’, ‘hier_id’, ‘ds’, ‘y’].
S_df
: pd.DataFrame,
hierarchical constraints dataframe of size (base, bottom).
*