Using Large Datasets
Tutorial on how to train neuralforecast models on datasets that cannot fit into memory
The standard DataLoader class used by NeuralForecast expects the dataset to be represented by a single DataFrame, which is entirely loaded into memory when fitting the model. However, when the dataset is too large for this, we can instead use the custom large-scale DataLoader. This custom loader assumes that each timeseries is split across a collection of Parquet files, and ensure that only one batch is ever loaded into memory at a given time.
In this notebook, we will demonstrate the expected format of these files, how to train the model and and how to perform inference using this large-scale DataLoader.
Load libraries
Data
Each timeseries should be stored in a directory named unique_id=timeseries_id. Within this directory, the timeseries can be entirely contained in a single Parquet file or split across multiple Parquet files. Regardless of the format, the timeseries must be ordered by time.
For example, the following code splits the AirPassengers DataFrame (of which each timeseries is already sorted by time) into the below format:
> data
> unique_id=Airline1
- a59945617fdb40d1bc6caa4aadad881c-0.parquet
> unique_id=Airline2
- a59945617fdb40d1bc6caa4aadad881c-0.parquet
We then simply input a list of the paths to these directories.
unique_id | ds | y | trend | y_[lag12] | |
---|---|---|---|---|---|
0 | Airline1 | 1949-01-31 | 112.0 | 0 | 112.0 |
1 | Airline1 | 1949-02-28 | 118.0 | 1 | 118.0 |
2 | Airline1 | 1949-03-31 | 132.0 | 2 | 132.0 |
3 | Airline1 | 1949-04-30 | 129.0 | 3 | 129.0 |
4 | Airline1 | 1949-05-31 | 121.0 | 4 | 121.0 |
… | … | … | … | … | … |
283 | Airline2 | 1960-08-31 | 906.0 | 283 | 859.0 |
284 | Airline2 | 1960-09-30 | 808.0 | 284 | 763.0 |
285 | Airline2 | 1960-10-31 | 761.0 | 285 | 707.0 |
286 | Airline2 | 1960-11-30 | 690.0 | 286 | 662.0 |
287 | Airline2 | 1960-12-31 | 732.0 | 287 | 705.0 |
You can also create this directory structure with a spark dataframe using the following:
The DataLoader class still expects the static data to be passed in as a single DataFrame with one row per timeseries.
id_col | airline1 | airline2 | |
---|---|---|---|
0 | Airline1 | 0 | 1 |
1 | Airline2 | 1 | 0 |
Model training
We now train a NHITS model on the above dataset. It is worth noting that
NeuralForecast currently does not support scaling when using this
DataLoader. If you want to scale the timeseries this should be done
before passing it in to the fit
method.
Forecasting
When working with large datasets, we need to provide a single DataFrame
containing the input timesteps of all the timeseries for which wish to
generate predictions. If we have future exogenous features, we should
also include the future values of these features in the separate
futr_df
DataFrame.
For the below prediction we are assuming we only want to predict the next 12 timesteps for Airline2.
id_col | ds | NHITS | |
---|---|---|---|
0 | Airline2 | 1960-01-31 | 710.602417 |
1 | Airline2 | 1960-02-29 | 688.900879 |
2 | Airline2 | 1960-03-31 | 758.637573 |
3 | Airline2 | 1960-04-30 | 748.974365 |
4 | Airline2 | 1960-05-31 | 753.558655 |
5 | Airline2 | 1960-06-30 | 801.517822 |
6 | Airline2 | 1960-07-31 | 863.835449 |
7 | Airline2 | 1960-08-31 | 847.854980 |
8 | Airline2 | 1960-09-30 | 797.115845 |
9 | Airline2 | 1960-10-31 | 748.879761 |
10 | Airline2 | 1960-11-30 | 707.076233 |
11 | Airline2 | 1960-12-31 | 747.851685 |
Evaluation
metric | NHITS | |
---|---|---|
0 | mae | 23.693777 |
1 | rmse | 29.992256 |
2 | smape | 0.014734 |