Spark
Run TimeGPT distributedly on top of Spark
Spark is an open-source distributed
computing framework designed for large-scale data processing. In this
guide, we will explain how to use TimeGPT
on top of Spark.
Outline:
1. Installation
Install Spark through Fugue. Fugue provides an easy-to-use interface for distributed computing that lets users execute Python code on top of several distributed computing frameworks, including Spark.
Note
You can install
fugue
withpip
:
If executing on a distributed Spark
cluster, ensure that the nixtla
library is installed across all the workers.
2. Load Data
You can load your data as a pandas
DataFrame. In this tutorial, we
will use a dataset that contains hourly electricity prices from
different markets.
unique_id | ds | y | |
---|---|---|---|
0 | BE | 2016-10-22 00:00:00 | 70.00 |
1 | BE | 2016-10-22 01:00:00 | 37.10 |
2 | BE | 2016-10-22 02:00:00 | 37.10 |
3 | BE | 2016-10-22 03:00:00 | 44.75 |
4 | BE | 2016-10-22 04:00:00 | 37.10 |
3. Initialize Spark
Initialize Spark
and convert the pandas DataFrame to a Spark
DataFrame.
4. Use TimeGPT on Spark
Using TimeGPT
on top of Spark
is almost identical to the
non-distributed case. The only difference is that you need to use a
Spark
DataFrame.
First, instantiate the
NixtlaClient
class.
👍 Use an Azure AI endpoint
To use an Azure AI endpoint, set the
base_url
argument:
nixtla_client = NixtlaClient(base_url="you azure ai endpoint", api_key="your api_key")
Then use any method from the
NixtlaClient
class such as
forecast
or
cross_validation
.
📘 Available models in Azure AI
If you are using an Azure AI endpoint, please be sure to set
model="azureai"
:
nixtla_client.forecast(..., model="azureai")
For the public API, we support two models:
timegpt-1
andtimegpt-1-long-horizon
.By default,
timegpt-1
is used. Please see this tutorial on how and when to usetimegpt-1-long-horizon
.
You can also use exogenous variables with TimeGPT
on top of Spark
.
To do this, please refer to the Exogenous
Variables
tutorial. Just keep in mind that instead of using a pandas DataFrame,
you need to use a Spark
DataFrame instead.
5. Stop Spark
When you are done, stop the Spark
session.