Preprocessing - Nixtla

id_time_grid

 id_time_grid (df:~DFType, freq:Union[str,int],
               start:Union[str,int,datetime.date,datetime.datetime]='per_s
               erie', end:Union[str,int,datetime.date,datetime.datetime]='
               global', id_col:str='unique_id', time_col:str='ds')

Generate all expected combiations of ids and times.

	Type	Default	Details
df	DFType		Input data
freq	Union		Series’ frequency
start	Union	per_serie	Initial timestamp for the series. * ‘per_serie’ uses each serie’s first timestamp * ‘global’ uses the first timestamp seen in the data * Can also be a specific timestamp or integer, e.g. ‘2000-01-01’, 2000 or datetime(2000, 1, 1)
end	Union	global	Initial timestamp for the series. * ‘per_serie’ uses each serie’s last timestamp * ‘global’ uses the last timestamp seen in the data * Can also be a specific timestamp or integer, e.g. ‘2000-01-01’, 2000 or datetime(2000, 1, 1)
id_col	str	unique_id	Column that identifies each serie.
time_col	str	ds	Column that identifies each timestamp.
Returns	DFType		Dataframe with expected ids and times.

source

fill_gaps

 fill_gaps (df:~DFType, freq:Union[str,int],
            start:Union[str,int,datetime.date,datetime.datetime]='per_seri
            e',
            end:Union[str,int,datetime.date,datetime.datetime]='global',
            id_col:str='unique_id', time_col:str='ds')

Enforce start and end datetimes for dataframe.

	Type	Default	Details
df	DFType		Input data
freq	Union		Series’ frequency
start	Union	per_serie	Initial timestamp for the series. * ‘per_serie’ uses each serie’s first timestamp * ‘global’ uses the first timestamp seen in the data * Can also be a specific timestamp or integer, e.g. ‘2000-01-01’, 2000 or datetime(2000, 1, 1)
end	Union	global	Initial timestamp for the series. * ‘per_serie’ uses each serie’s last timestamp * ‘global’ uses the last timestamp seen in the data * Can also be a specific timestamp or integer, e.g. ‘2000-01-01’, 2000 or datetime(2000, 1, 1)
id_col	str	unique_id	Column that identifies each serie.
time_col	str	ds	Column that identifies each timestamp.
Returns	DFType		Dataframe with gaps filled.

df = pd.DataFrame(
    {
        'unique_id': [0, 0, 0, 1, 1],
        'ds': pd.to_datetime(['2020', '2021', '2023', '2021', '2022']),
        'y': np.arange(5),
    }
)
df

	unique_id	ds	y
0	0	2020-01-01	0
1	0	2021-01-01	1
2	0	2023-01-01	2
3	1	2021-01-01	3
4	1	2022-01-01	4

The default functionality is taking the current starts and only extending the end date to be the same for all series.

fill_gaps(
    df,
    freq='YS',
)

	unique_id	ds	y
0	0	2020-01-01	0.0
1	0	2021-01-01	1.0
2	0	2022-01-01	NaN
3	0	2023-01-01	2.0
4	1	2021-01-01	3.0
5	1	2022-01-01	4.0
6	1	2023-01-01	NaN

We can also specify end='per_serie' to only fill possible gaps within each serie.

fill_gaps(
    df,
    freq='YS',
    end='per_serie',
)

	unique_id	ds	y
0	0	2020-01-01	0.0
1	0	2021-01-01	1.0
2	0	2022-01-01	NaN
3	0	2023-01-01	2.0
4	1	2021-01-01	3.0
5	1	2022-01-01	4.0

We can also specify an end date in the future.

fill_gaps(
    df,
    freq='YS',
    end='2024',
)

	unique_id	ds	y
0	0	2020-01-01	0.0
1	0	2021-01-01	1.0
2	0	2022-01-01	NaN
3	0	2023-01-01	2.0
4	0	2024-01-01	NaN
5	1	2021-01-01	3.0
6	1	2022-01-01	4.0
7	1	2023-01-01	NaN
8	1	2024-01-01	NaN

We can set all series to start at the same time.

fill_gaps(
    df,
    freq='YS',
    start='global'
)

	unique_id	ds	y
0	0	2020-01-01	0.0
1	0	2021-01-01	1.0
2	0	2022-01-01	NaN
3	0	2023-01-01	2.0
4	1	2020-01-01	NaN
5	1	2021-01-01	3.0
6	1	2022-01-01	4.0
7	1	2023-01-01	NaN

We can also set a common start date for all series (which can be earlier than their current starts).

fill_gaps(
    df,
    freq='YS',
    start='2019',
)

	unique_id	ds	y
0	0	2019-01-01	NaN
1	0	2020-01-01	0.0
2	0	2021-01-01	1.0
3	0	2022-01-01	NaN
4	0	2023-01-01	2.0
5	1	2019-01-01	NaN
6	1	2020-01-01	NaN
7	1	2021-01-01	3.0
8	1	2022-01-01	4.0
9	1	2023-01-01	NaN

In case the times are integers the frequency, start and end must also be integers.

df = pd.DataFrame(
    {
        'unique_id': [0, 0, 0, 1, 1],
        'ds': [2020, 2021, 2023, 2021, 2022],
        'y': np.arange(5),
    }
)
df

	unique_id	ds	y
0	0	2020	0
1	0	2021	1
2	0	2023	2
3	1	2021	3
4	1	2022	4

fill_gaps(
    df,
    freq=1,
    start=2019,
    end=2024,
)

	unique_id	ds	y
0	0	2019	NaN
1	0	2020	0.0
2	0	2021	1.0
3	0	2022	NaN
4	0	2023	2.0
5	0	2024	NaN
6	1	2019	NaN
7	1	2020	NaN
8	1	2021	3.0
9	1	2022	4.0
10	1	2023	NaN
11	1	2024	NaN

The function also accepts polars dataframes

df = pl.DataFrame(
    {
        'unique_id': [0, 0, 0, 1, 1],
        'ds': [
            datetime(2020, 1, 1), datetime(2022, 1, 1), datetime(2023, 1, 1),
            datetime(2021, 1, 1), datetime(2022, 1, 1)],
        'y': np.arange(5),
    }
)
df

unique_id	ds	y
i64	datetime[μs]	i64
0	2020-01-01 00:00:00	0
0	2022-01-01 00:00:00	1
0	2023-01-01 00:00:00	2
1	2021-01-01 00:00:00	3
1	2022-01-01 00:00:00	4

polars_ms = fill_gaps(
    df.with_columns(pl.col('ds').cast(pl.Datetime(time_unit='ms'))),
    freq='1y',
    start=datetime(2019, 1, 1),
    end=datetime(2024, 1, 1),
)
assert polars_ms.schema['ds'].time_unit == 'ms'
polars_ms

unique_id	ds	y
i64	datetime[ms]	i64
0	2019-01-01 00:00:00	null
0	2020-01-01 00:00:00	0
0	2021-01-01 00:00:00	null
0	2022-01-01 00:00:00	1
0	2023-01-01 00:00:00	2
…	…	…
1	2020-01-01 00:00:00	null
1	2021-01-01 00:00:00	3
1	2022-01-01 00:00:00	4
1	2023-01-01 00:00:00	null
1	2024-01-01 00:00:00	null

df = pl.DataFrame(
    {
        'unique_id': [0, 0, 0, 1, 1],
        'ds': [
            date(2020, 1, 1), date(2022, 1, 1), date(2023, 1, 1),
            date(2021, 1, 1), date(2022, 1, 1)],
        'y': np.arange(5),
    }
)
df

unique_id	ds	y
i64	date	i64
0	2020-01-01	0
0	2022-01-01	1
0	2023-01-01	2
1	2021-01-01	3
1	2022-01-01	4

fill_gaps(
    df,
    freq='1y',
    start=date(2020, 1, 1),
    end=date(2024, 1, 1),
)

unique_id	ds	y
i64	date	i64
0	2020-01-01	0
0	2021-01-01	null
0	2022-01-01	1
0	2023-01-01	2
0	2024-01-01	null
1	2020-01-01	null
1	2021-01-01	3
1	2022-01-01	4
1	2023-01-01	null
1	2024-01-01	null

df = pl.DataFrame(
    {
        'unique_id': [0, 0, 0, 1, 1],
        'ds': [2020, 2021, 2023, 2021, 2022],
        'y': np.arange(5),
    }
)
df

unique_id	ds	y
i64	i64	i64
0	2020	0
0	2021	1
0	2023	2
1	2021	3
1	2022	4

fill_gaps(
    df,
    freq=1,
    start=2019,
    end=2024,
)

unique_id	ds	y
i64	i64	i64
0	2019	null
0	2020	0
0	2021	1
0	2022	null
0	2023	2
…	…	…
1	2020	null
1	2021	3
1	2022	4
1	2023	null
1	2024	null

API Reference

​id_time_grid

​fill_gaps

id_time_grid

fill_gaps