Spark is an open-source distributed computing framework designed for large-scale data processing. In this guide, we will explain how to use TimeGPT on top of Spark. Outline:
  1. Installation
  2. Load Your Data
  3. Initialize Spark
  4. Use TimeGPT on Spark
  5. Stop Spark

1. Installation

Install Spark through Fugue. Fugue provides an easy-to-use interface for distributed computing that lets users execute Python code on top of several distributed computing frameworks, including Spark.
Note You can install fugue with pip:
pip install fugue[spark]
If executing on a distributed Spark cluster, ensure that the nixtla library is installed across all the workers.

2. Load Data

You can load your data as a pandas DataFrame. In this tutorial, we will use a dataset that contains hourly electricity prices from different markets.
import pandas as pd
df = pd.read_csv(
    'https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short.csv',
    parse_dates=['ds'],
) 
df.head()
unique_iddsy
0BE2016-10-22 00:00:0070.00
1BE2016-10-22 01:00:0037.10
2BE2016-10-22 02:00:0037.10
3BE2016-10-22 03:00:0044.75
4BE2016-10-22 04:00:0037.10

3. Initialize Spark

Initialize Spark and convert the pandas DataFrame to a Spark DataFrame.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
spark_df = spark.createDataFrame(df)
spark_df.show(5)

4. Use TimeGPT on Spark

Using TimeGPT on top of Spark is almost identical to the non-distributed case. The only difference is that you need to use a Spark DataFrame. First, instantiate the NixtlaClient class.
from nixtla import NixtlaClient
nixtla_client = NixtlaClient(
    # defaults to os.environ.get("NIXTLA_API_KEY")
    api_key = 'my_api_key_provided_by_nixtla'
)
👍 Use an Azure AI endpoint To use an Azure AI endpoint, set the base_url argument: nixtla_client = NixtlaClient(base_url="you azure ai endpoint", api_key="your api_key")
Then use any method from the NixtlaClient class such as forecast or cross_validation.
fcst_df = nixtla_client.forecast(spark_df, h=12)
fcst_df.show(5)
📘 Available models in Azure AI If you are using an Azure AI endpoint, please be sure to set model="azureai": nixtla_client.forecast(..., model="azureai") For the public API, we support two models: timegpt-1 and timegpt-1-long-horizon. By default, timegpt-1 is used. Please see this tutorial on how and when to use timegpt-1-long-horizon.
cv_df = nixtla_client.cross_validation(spark_df, h=12, n_windows=5, step_size=2)
cv_df.show(5)
You can also use exogenous variables with TimeGPT on top of Spark. To do this, please refer to the Exogenous Variables tutorial. Just keep in mind that instead of using a pandas DataFrame, you need to use a Spark DataFrame instead.

5. Stop Spark

When you are done, stop the Spark session.
spark.stop()