1. MLP
Multi-Layer PerceptronMLP
Module
Multi-Layer Perceptron for time series forecasting.
A feedforward neural network with configurable depth and width. The networkconsists of an input layer, multiple hidden layers with activation functions
and dropout, and an output layer. All hidden layers have the same dimensionality. Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
in_features | int | Dimension of input features. | required |
out_features | int | Dimension of output features. | required |
activation | str | Activation function name. Must be one of the supported activations in ACTIVATIONS list (e.g., ‘ReLU’, ‘Tanh’, ‘GELU’, ‘ELU’). | required |
hidden_size | int | Number of units in each hidden layer. All hidden layers share the same dimensionality. | required |
num_layers | int | Total number of layers including input and output layers. Must be at least 2. For example, num_layers=3 creates: input layer, one hidden layer, and output layer. | required |
dropout | float | Dropout probability applied after each hidden layer’s activation. Should be in range [0.0, 1.0]. Not applied to output layer. | required |
| Type | Description |
|---|---|
Tensor | Transformed output tensor of shape […, out_features]. |
2. Temporal Convolutions
For long time in deep learning, sequence modelling was synonymous with recurrent networks, yet several papers have shown that simple convolutional architectures can outperform canonical recurrent networks like LSTMs by demonstrating longer effective memory. References -van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A. W., & Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. Computing Research Repository, abs/1609.03499. URL: http://arxiv.org/abs/1609.03499. arXiv:1609.03499. -Shaojie Bai, Zico Kolter, Vladlen Koltun. (2018). An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. Computing Research Repository, abs/1803.01271. URL: https://arxiv.org/abs/1803.01271.Chomp1d
Module
Temporal trimming layer for 1D sequences.
Removes the rightmost horizon timesteps from a 3D tensor. This is commonlyused to trim padding added by convolution operations, ensuring the output
sequence has the desired length. The operation trims the temporal dimension: [N, C, T] -> [N, C, T-horizon] Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
horizon | int | Number of timesteps to remove from the end of the temporal dimension. | required |
| Type | Description |
|---|---|
Tensor | Trimmed tensor of shape [N, C, T-horizon]. |
CausalConv1d
CausalConv1d
Module
Causal Convolution 1d
Receives x input of dim [N,C_in,T], and computes a causal convolution
in the time dimension. Skipping the H steps of the forecast horizon, through
its dilation.
Consider a batch of one element, the dilated convolution operation on the
time step is defined:
where is the dilation factor, is the kernel size, is the index of
the considered past observation. The dilation effectively applies a filter with skip
connections. If one recovers a normal convolution.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
in_channels | int | Dimension of x input’s initial channels. | required |
out_channels | int | Dimension of x outputs’s channels. | required |
activation | str | Identifying activations from PyTorch activations. | required |
padding | int | Number of zero padding used to the left. | required |
kernel_size | int | Convolution’s kernel size. | required |
dilation | int | Dilation skip connections. | required |
| Type | Description |
|---|---|
Tensor | Torch tensor of dim [N,C_out,T] activation(conv1d(inputs, kernel) + bias). |
TemporalConvolutionEncoder
3. Transformers
References- Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, Wancai Zhang. “Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting”
- Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long.
TransEncoder
Module
Transformer Encoder.
A stack of transformer encoder layers that processes input sequences throughmultiple self-attention and feed-forward layers. Optionally includes convolutional
layers between attention layers for distillation and a final normalization layer. Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
attn_layers | list of TransEncoderLayer | List of transformer encoder layers to stack. | required |
conv_layers | list of nn.Module | List of convolutional layers applied between attention layers. Must have length len(attn_layers) - 1 if provided. Used for distillation in models like Informer. | None |
norm_layer | Module | Normalization layer applied to the final output. Typically nn.LayerNorm. | None |
| Type | Description |
|---|---|
Tensor | Encoded output tensor of shape [batch, seq_len, hidden_size] after passing through all encoder layers and optional normalization. |
list[torch.Tensor]] | List of attention weights from each encoder layer, each of shape [batch, n_heads, seq_len, seq_len] (or None if not computed). |
TransEncoderLayer
Module
Transformer Encoder Layer.
A single layer of the transformer encoder that applies self-attention followed bya position-wise feed-forward network with residual connections and layer normalization.
Dropout is applied after the self-attention output and twice in the feed-forward network
(after each convolution) before the residual connections for regularization. Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
attention | AttentionLayer | Self-attention mechanism to apply. | required |
hidden_size | int | Dimension of the model’s hidden representations. | required |
conv_hidden_size | int | Dimension of the feed-forward network’s hidden layer. Defaults to 4 * hidden_size if not specified. | None |
dropout | float | Dropout probability applied after attention and feed-forward layers. | 0.1 |
activation | str | Activation function to use in the feed-forward network. Either “relu” or “gelu”. | ‘relu’ |
TransDecoder
Module
Transformer decoder module for sequence-to-sequence forecasting.
Stacks multiple TransDecoderLayer modules to process decoder inputs withself-attention and cross-attention mechanisms. Optionally applies layer
normalization and a final projection layer to produce output predictions. Parameters:
Returns:
| Type | Description |
|---|---|
Tensor | Decoded output tensor. If projection is provided, returns tensor of shape [batch, target_seq_len, output_dim]. Otherwise, returns tensor of shape [batch, target_seq_len, hidden_size]. |
TransDecoderLayer
Module
Transformer Decoder Layer.
A single layer of the transformer decoder that applies masked self-attention,cross-attention with encoder outputs, and a position-wise feed-forward network
with residual connections and layer normalization. Dropout is applied after each
sub-layer (self-attention, cross-attention, and twice in the feed-forward network)
before the residual connection for regularization. Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
self_attention | AttentionLayer | Masked self-attention mechanism for the decoder. | required |
cross_attention | AttentionLayer | Cross-attention mechanism to attend to encoder outputs. | required |
hidden_size | int | Dimension of the model’s hidden representations. | required |
conv_hidden_size | int | Dimension of the feed-forward network’s hidden layer. Defaults to 4 * hidden_size if not specified. | None |
dropout | float | Dropout probability applied after attention and feed-forward layers. | 0.1 |
activation | str | Activation function to use in the feed-forward network. Either “relu” or “gelu”. | ‘relu’ |
| Type | Description |
|---|---|
Tensor | Output tensor of shape [batch, target_seq_len, hidden_size] after applying masked self-attention, cross-attention, and feed-forward transformations. |
AttentionLayer
Module
Multi-head attention layer wrapper.
This layer wraps an attention mechanism and handles the linear projectionsfor queries, keys, and values in multi-head attention. It projects inputs
to multiple heads, applies the inner attention mechanism, and projects back
to the original hidden dimension. Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
attention | Module | Inner attention mechanism (e.g., FullAttention, ProbAttention) that computes attention scores and outputs. | required |
hidden_size | int | Dimension of the model’s hidden states. | required |
n_heads | int | Number of attention heads. | required |
d_keys | int | Dimension of keys per head. If None defaults to hidden_size // n_heads. | None |
d_values | int | Dimension of values per head. If None defaults to hidden_size // n_heads. | None |
| Type | Description |
|---|---|
Tensor | Output tensor of shape [batch, seq_len, hidden_size] after applying multi-head attention. |
| (torch.Tensor) or None: Attention weights of shape [batch, n_heads, seq_len, seq_len] if output_attention is True in the inner attention mechanism, otherwise None. |
FullAttention
Module
Full attention mechanism with scaled dot-product attention.
Implements standard multi-head attention using scaled dot-product attention.Supports both efficient computation via PyTorch’s scaled_dot_product_attention
and explicit attention computation when attention weights are needed. Optional
causal masking prevents attention to future positions in autoregressive models. Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
mask_flag | bool | If True, applies causal masking to prevent attention to future positions. | True |
factor | int | Attention factor parameter (unused in FullAttention, kept for API compatibility with ProbAttention). | 5 |
scale | float | Custom scaling factor for attention scores. If None, uses 1/sqrt(d_k) where d_k is the key dimension. | None |
attention_dropout | float | Dropout rate applied to attention weights. | 0.1 |
output_attention | bool | If True, returns attention weights along with output. If False, uses efficient flash attention. | False |
TriangularCausalMask
from attending to future positions in the sequence. This ensures causality
in autoregressive models where predictions at time t should only depend on
positions before t. The mask is created using torch.triu with diagonal=1, resulting in a mask
where positions (i, j) are True when j > i, effectively masking out future
positions during attention computation. Parameters:
Attributes:
DataEmbedding_inverted
Module
Inverted data embedding module for variate-as-token transformer architectures.
Transforms time series data by treating each variate (channel) as a token ratherthan each time step. The input is permuted from [Batch, Time, Variate] to
[Batch, Variate, Time], then a linear layer projects the time dimension to the
hidden dimension. Optionally concatenates temporal covariates along the variate
dimension. Parameters:
Returns:
| Type | Description |
|---|---|
Tensor | Inverted embeddings of shape [batch, n_variates, hidden_size] or [batch, n_variates + n_temporal_features, hidden_size] if x_mark is provided. |
DataEmbedding
Module
Data embedding module combining value, positional, and temporal embeddings.
Transforms time series data into high-dimensional embeddings by combining:
- Value embeddings: Convolutional encoding of the time series values
- Positional embeddings: Sinusoidal encodings for relative position within window
- Temporal embeddings: Linear projection of absolute calendar features (optional)
| Name | Type | Description | Default |
|---|---|---|---|
c_in | int | Number of input channels (variates) in the time series. | required |
exog_input_size | int | Number of exogenous/temporal features. If 0, temporal embeddings are disabled. | required |
hidden_size | int | Dimension of the embedding vectors. | required |
pos_embedding | bool | Whether to include positional embeddings. | True |
dropout | float | Dropout rate applied to the final embeddings. | 0.1 |
| Type | Description |
|---|---|
Tensor | Combined embeddings of shape [batch, seq_len, hidden_size] after applying dropout to the sum of value, positional, and temporal embeddings. |
TemporalEmbedding
Module
Temporal embedding module for encoding calendar-based time features.
Creates learnable or fixed embeddings for temporal features including month,day, weekday, hour, and optionally minute. These embeddings are summed to
produce a combined temporal representation. Parameters:
Returns:
| Type | Description |
|---|---|
Tensor | Combined temporal embeddings of shape [batch, seq_len, d_model], representing the sum of all temporal component embeddings. |
FixedEmbedding
Module
Fixed sinusoidal embedding for categorical temporal features.
Creates non-trainable embeddings using sine and cosine functions at differentfrequencies. Unlike PositionalEmbedding which encodes continuous positions,
FixedEmbedding is designed for discrete categorical inputs (e.g., hour of day,
day of month, month of year). The embeddings are precomputed and frozen,
making them non-learnable parameters. Parameters:
Returns:
| Type | Description |
|---|---|
Tensor | Fixed embeddings of shape [batch, seq_len, d_model], detached from the computation graph. |
TimeFeatureEmbedding
Module
Linear embedding for temporal/calendar features.
Transforms time-based features (e.g., hour, day, month) into embeddings usinga single linear projection without bias. This embedding is typically used to
incorporate calendar information into transformer models, providing absolute
temporal context that complements positional encodings. Parameters:
Returns:
| Type | Description |
|---|---|
Tensor | Time feature embeddings of shape [batch, seq_len, hidden_size]. |
PositionalEmbedding
Module
Sinusoidal positional embedding for transformer models.
Generates fixed sinusoidal positional encodings using sine and cosine functionsat different frequencies. These encodings provide position information to
transformer models, allowing them to understand the relative or absolute position
of tokens in a sequence. The encodings are precomputed and stored as a buffer,
making them non-trainable. Parameters:
Returns:
| Type | Description |
|---|---|
Tensor | Positional encodings of shape [1, seq_len, hidden_size] where seq_len is the length of the input sequence. |
SeriesDecomp
Module
Series decomposition block for trend-residual decomposition.
Decomposes time series into trend and residual components using moving average
filtering. The trend is extracted via a moving average filter, and the residual
is computed as the difference between the input and the trend.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
kernel_size | int | Size of the moving average window for trend extraction. | required |
MovingAvg
Module
Moving average block to highlight the trend of time series.
Applies a moving average filter using 1D average pooling to smooth time seriesdata and extract trend components. The input is padded on both ends by repeating
the first and last values to maintain the original sequence length. Parameters:
Returns:
| Type | Description |
|---|---|
Tensor | Smoothed time series of shape [batch, seq_len, channels], representing the trend component after applying moving average. |
RevIN
Module
Reversible Instance Normalization for time series forecasting.
Normalizes time series data by removing the mean (or last value) and scaling bystandard deviation. The normalization can be reversed after model predictions to
restore the original scale. Optionally includes learnable affine parameters for
additional transformation flexibility. Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
num_features | int | The number of features or channels in the time series. | required |
eps | float | A value added for numerical stability. | 1e-05 |
affine | bool | If True, RevIN has learnable affine parameters (weight and bias). | False |
subtract_last | bool | If True, subtracts the last value instead of the mean in normalization. | False |
non_norm | bool | If True, no normalization is performed (identity operation). | False |
| Type | Description |
|---|---|
Tensor | Normalized tensor (if mode=“norm”) or denormalized tensor (if mode=“denorm”) of the same shape as the input [batch, seq_len, num_features]. |
RevINMultivariate
Module
Reversible Instance Normalization for multivariate time series models.
Normalizes multivariate time series data using batch statistics computed acrossthe time dimension. The normalization can be reversed after model predictions to
restore the original scale. Optionally includes learnable affine parameters for
additional transformation flexibility. Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
num_features | int | The number of features or channels in the time series. | required |
eps | float | A value added for numerical stability. | 1e-05 |
affine | bool | If True, RevINMultivariate has learnable affine parameters (weight and bias). | False |
subtract_last | bool | Not used in this implementation (kept for API compatibility). | False |
non_norm | bool | Not used in this implementation (kept for API compatibility). | False |
| Type | Description |
|---|---|
Tensor | Normalized tensor (if mode=“norm”) or denormalized tensor (if mode=“denorm”) of the same shape as the input [batch, seq_len, num_features]. |

