Skip to main content

1. MLP

Multi-Layer Perceptron

MLP

MLP(in_features, out_features, activation, hidden_size, num_layers, dropout)
Bases: Module Multi-Layer Perceptron for time series forecasting. A feedforward neural network with configurable depth and width. The network
consists of an input layer, multiple hidden layers with activation functions
and dropout, and an output layer. All hidden layers have the same dimensionality.
Parameters:
NameTypeDescriptionDefault
in_featuresintDimension of input features.required
out_featuresintDimension of output features.required
activationstrActivation function name. Must be one of the supported activations in ACTIVATIONS list (e.g., ‘ReLU’, ‘Tanh’, ‘GELU’, ‘ELU’).required
hidden_sizeintNumber of units in each hidden layer. All hidden layers share the same dimensionality.required
num_layersintTotal number of layers including input and output layers. Must be at least 2. For example, num_layers=3 creates: input layer, one hidden layer, and output layer.required
dropoutfloatDropout probability applied after each hidden layer’s activation. Should be in range [0.0, 1.0]. Not applied to output layer.required
Returns:
TypeDescription
TensorTransformed output tensor of shape […, out_features].

2. Temporal Convolutions

For long time in deep learning, sequence modelling was synonymous with recurrent networks, yet several papers have shown that simple convolutional architectures can outperform canonical recurrent networks like LSTMs by demonstrating longer effective memory. References -van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A. W., & Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. Computing Research Repository, abs/1609.03499. URL: http://arxiv.org/abs/1609.03499. arXiv:1609.03499. -Shaojie Bai, Zico Kolter, Vladlen Koltun. (2018). An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. Computing Research Repository, abs/1803.01271. URL: https://arxiv.org/abs/1803.01271.

Chomp1d

Chomp1d(horizon)
Bases: Module Temporal trimming layer for 1D sequences. Removes the rightmost horizon timesteps from a 3D tensor. This is commonly
used to trim padding added by convolution operations, ensuring the output
sequence has the desired length.
The operation trims the temporal dimension: [N, C, T] -> [N, C, T-horizon] Parameters:
NameTypeDescriptionDefault
horizonintNumber of timesteps to remove from the end of the temporal dimension.required
Returns:
TypeDescription
TensorTrimmed tensor of shape [N, C, T-horizon].

CausalConv1d

CausalConv1d

CausalConv1d(
    in_channels,
    out_channels,
    kernel_size,
    padding,
    dilation,
    activation,
    stride=1,
)
Bases: Module Causal Convolution 1d Receives x input of dim [N,C_in,T], and computes a causal convolution in the time dimension. Skipping the H steps of the forecast horizon, through its dilation. Consider a batch of one element, the dilated convolution operation on the tt time step is defined: Conv1D(x,w)(t)=(x[d]w)(t)=k=1Kwkxtdk\mathrm{Conv1D}(\mathbf{x},\mathbf{w})(t) = (\mathbf{x}_{[*d]} \mathbf{w})(t) = \sum^{K}_{k=1} w_{k} \mathbf{x}_{t-dk} where dd is the dilation factor, KK is the kernel size, tdkt-dk is the index of the considered past observation. The dilation effectively applies a filter with skip connections. If d=1d=1 one recovers a normal convolution. Parameters:
NameTypeDescriptionDefault
in_channelsintDimension of x input’s initial channels.required
out_channelsintDimension of x outputs’s channels.required
activationstrIdentifying activations from PyTorch activations.required
paddingintNumber of zero padding used to the left.required
kernel_sizeintConvolution’s kernel size.required
dilationintDilation skip connections.required
Returns:
TypeDescription
TensorTorch tensor of dim [N,C_out,T] activation(conv1d(inputs, kernel) + bias).

TemporalConvolutionEncoder

3. Transformers

References

TransEncoder

TransEncoder(attn_layers, conv_layers=None, norm_layer=None)
Bases: Module Transformer Encoder. A stack of transformer encoder layers that processes input sequences through
multiple self-attention and feed-forward layers. Optionally includes convolutional
layers between attention layers for distillation and a final normalization layer.
Parameters:
NameTypeDescriptionDefault
attn_layerslist of TransEncoderLayerList of transformer encoder layers to stack.required
conv_layerslist of nn.ModuleList of convolutional layers applied between attention layers. Must have length len(attn_layers) - 1 if provided. Used for distillation in models like Informer.None
norm_layerModuleNormalization layer applied to the final output. Typically nn.LayerNorm.None
Returns:
TypeDescription
TensorEncoded output tensor of shape [batch, seq_len, hidden_size] after passing through all encoder layers and optional normalization.
list[torch.Tensor]]List of attention weights from each encoder layer, each of shape [batch, n_heads, seq_len, seq_len] (or None if not computed).

TransEncoderLayer

TransEncoderLayer(
    attention,
    hidden_size,
    conv_hidden_size=None,
    dropout=0.1,
    activation="relu",
)
Bases: Module Transformer Encoder Layer. A single layer of the transformer encoder that applies self-attention followed by
a position-wise feed-forward network with residual connections and layer normalization.
Dropout is applied after the self-attention output and twice in the feed-forward network
(after each convolution) before the residual connections for regularization.
Parameters:
NameTypeDescriptionDefault
attentionAttentionLayerSelf-attention mechanism to apply.required
hidden_sizeintDimension of the model’s hidden representations.required
conv_hidden_sizeintDimension of the feed-forward network’s hidden layer. Defaults to 4 * hidden_size if not specified.None
dropoutfloatDropout probability applied after attention and feed-forward layers.0.1
activationstrActivation function to use in the feed-forward network. Either “relu” or “gelu”.‘relu’
Returns:
TypeDescription
TensorOutput tensor of shape [batch, seq_len, hidden_size] after applying self-attention and feed-forward transformations.
Tensor or NoneAttention weights of shape [batch, n_heads, seq_len, seq_len] if output_attention is True in the attention layer, otherwise None.

TransDecoder

TransDecoder(layers, norm_layer=None, projection=None)
Bases: Module Transformer decoder module for sequence-to-sequence forecasting. Stacks multiple TransDecoderLayer modules to process decoder inputs with
self-attention and cross-attention mechanisms. Optionally applies layer
normalization and a final projection layer to produce output predictions.
Parameters:
NameTypeDescriptionDefault
layerslistList of TransDecoderLayer instances to stack sequentially.required
norm_layerModuleLayer normalization module applied after all decoder layers.None
projectionModuleFinal projection layer (typically nn.Linear) to map hidden representations to output dimension.None
Returns:
TypeDescription
TensorDecoded output tensor. If projection is provided, returns tensor of shape [batch, target_seq_len, output_dim]. Otherwise, returns tensor of shape [batch, target_seq_len, hidden_size].

TransDecoderLayer

TransDecoderLayer(
    self_attention,
    cross_attention,
    hidden_size,
    conv_hidden_size=None,
    dropout=0.1,
    activation="relu",
)
Bases: Module Transformer Decoder Layer. A single layer of the transformer decoder that applies masked self-attention,
cross-attention with encoder outputs, and a position-wise feed-forward network
with residual connections and layer normalization. Dropout is applied after each
sub-layer (self-attention, cross-attention, and twice in the feed-forward network)
before the residual connection for regularization.
Parameters:
NameTypeDescriptionDefault
self_attentionAttentionLayerMasked self-attention mechanism for the decoder.required
cross_attentionAttentionLayerCross-attention mechanism to attend to encoder outputs.required
hidden_sizeintDimension of the model’s hidden representations.required
conv_hidden_sizeintDimension of the feed-forward network’s hidden layer. Defaults to 4 * hidden_size if not specified.None
dropoutfloatDropout probability applied after attention and feed-forward layers.0.1
activationstrActivation function to use in the feed-forward network. Either “relu” or “gelu”.‘relu’
Returns:
TypeDescription
TensorOutput tensor of shape [batch, target_seq_len, hidden_size] after applying masked self-attention, cross-attention, and feed-forward transformations.

AttentionLayer

AttentionLayer(attention, hidden_size, n_heads, d_keys=None, d_values=None)
Bases: Module Multi-head attention layer wrapper. This layer wraps an attention mechanism and handles the linear projections
for queries, keys, and values in multi-head attention. It projects inputs
to multiple heads, applies the inner attention mechanism, and projects back
to the original hidden dimension.
Parameters:
NameTypeDescriptionDefault
attentionModuleInner attention mechanism (e.g., FullAttention, ProbAttention) that computes attention scores and outputs.required
hidden_sizeintDimension of the model’s hidden states.required
n_headsintNumber of attention heads.required
d_keysintDimension of keys per head. If None defaults to hidden_size // n_heads.None
d_valuesintDimension of values per head. If None defaults to hidden_size // n_heads.None
Returns:
TypeDescription
TensorOutput tensor of shape [batch, seq_len, hidden_size] after applying multi-head attention.
(torch.Tensor) or None: Attention weights of shape [batch, n_heads, seq_len, seq_len] if output_attention is True in the inner attention mechanism, otherwise None.

FullAttention

FullAttention(
    mask_flag=True,
    factor=5,
    scale=None,
    attention_dropout=0.1,
    output_attention=False,
)
Bases: Module Full attention mechanism with scaled dot-product attention. Implements standard multi-head attention using scaled dot-product attention.
Supports both efficient computation via PyTorch’s scaled_dot_product_attention
and explicit attention computation when attention weights are needed. Optional
causal masking prevents attention to future positions in autoregressive models.
Parameters:
NameTypeDescriptionDefault
mask_flagboolIf True, applies causal masking to prevent attention to future positions.True
factorintAttention factor parameter (unused in FullAttention, kept for API compatibility with ProbAttention).5
scalefloatCustom scaling factor for attention scores. If None, uses 1/sqrt(d_k) where d_k is the key dimension.None
attention_dropoutfloatDropout rate applied to attention weights.0.1
output_attentionboolIf True, returns attention weights along with output. If False, uses efficient flash attention.False
Returns:
TypeDescription
TensorAttention output of shape [batch, seq_len, n_heads, head_dim].
Tensor or NoneAttention weights of shape [batch, n_heads, seq_len, seq_len] if output_attention is True, otherwise None.

TriangularCausalMask

TriangularCausalMask(B, L, device='cpu')
Triangular causal mask for autoregressive attention. Creates an upper triangular boolean mask that prevents attention mechanisms
from attending to future positions in the sequence. This ensures causality
in autoregressive models where predictions at time t should only depend on
positions before t.
The mask is created using torch.triu with diagonal=1, resulting in a mask
where positions (i, j) are True when j > i, effectively masking out future
positions during attention computation.
Parameters:
NameTypeDescriptionDefault
BintBatch size.required
LintSequence length.required
devicestrDevice to place the mask tensor on.‘cpu’
Attributes:
NameTypeDescription
_maskTensorBoolean mask tensor of shape [B, 1, L, L] where True values indicate positions to mask (future positions). }}

DataEmbedding_inverted

DataEmbedding_inverted(c_in, hidden_size, dropout=0.1)
Bases: Module Inverted data embedding module for variate-as-token transformer architectures. Transforms time series data by treating each variate (channel) as a token rather
than each time step. The input is permuted from [Batch, Time, Variate] to
[Batch, Variate, Time], then a linear layer projects the time dimension to the
hidden dimension. Optionally concatenates temporal covariates along the variate
dimension.
Parameters:
NameTypeDescriptionDefault
c_inintNumber of input time steps (sequence length).required
hidden_sizeintDimension of the embedding vectors.required
dropoutfloatDropout rate applied to the embeddings.0.1
Returns:
TypeDescription
TensorInverted embeddings of shape [batch, n_variates, hidden_size] or [batch, n_variates + n_temporal_features, hidden_size] if x_mark is provided.

DataEmbedding

DataEmbedding(
    c_in, exog_input_size, hidden_size, pos_embedding=True, dropout=0.1
)
Bases: Module Data embedding module combining value, positional, and temporal embeddings. Transforms time series data into high-dimensional embeddings by combining:
  • Value embeddings: Convolutional encoding of the time series values
  • Positional embeddings: Sinusoidal encodings for relative position within window
  • Temporal embeddings: Linear projection of absolute calendar features (optional)
Parameters:
NameTypeDescriptionDefault
c_inintNumber of input channels (variates) in the time series.required
exog_input_sizeintNumber of exogenous/temporal features. If 0, temporal embeddings are disabled.required
hidden_sizeintDimension of the embedding vectors.required
pos_embeddingboolWhether to include positional embeddings.True
dropoutfloatDropout rate applied to the final embeddings.0.1
Returns:
TypeDescription
TensorCombined embeddings of shape [batch, seq_len, hidden_size] after applying dropout to the sum of value, positional, and temporal embeddings.

TemporalEmbedding

TemporalEmbedding(d_model, embed_type='fixed', freq='h')
Bases: Module Temporal embedding module for encoding calendar-based time features. Creates learnable or fixed embeddings for temporal features including month,
day, weekday, hour, and optionally minute. These embeddings are summed to
produce a combined temporal representation.
Parameters:
NameTypeDescriptionDefault
d_modelintDimension of the embedding vectors.required
embed_typestrType of embedding to use. Options are “fixed” for FixedEmbedding (sinusoidal) or “learned” for nn.Embedding (learnable).‘fixed’
freqstrFrequency of the time series data. If “t”, includes minute embeddings.‘h’
Returns:
TypeDescription
TensorCombined temporal embeddings of shape [batch, seq_len, d_model], representing the sum of all temporal component embeddings.

FixedEmbedding

FixedEmbedding(c_in, d_model)
Bases: Module Fixed sinusoidal embedding for categorical temporal features. Creates non-trainable embeddings using sine and cosine functions at different
frequencies. Unlike PositionalEmbedding which encodes continuous positions,
FixedEmbedding is designed for discrete categorical inputs (e.g., hour of day,
day of month, month of year). The embeddings are precomputed and frozen,
making them non-learnable parameters.
Parameters:
NameTypeDescriptionDefault
c_inintNumber of categories (e.g., 24 for hours, 32 for days).required
d_modelintDimension of the embedding vectors.required
Returns:
TypeDescription
TensorFixed embeddings of shape [batch, seq_len, d_model], detached from the computation graph.

TimeFeatureEmbedding

TimeFeatureEmbedding(input_size, hidden_size)
Bases: Module Linear embedding for temporal/calendar features. Transforms time-based features (e.g., hour, day, month) into embeddings using
a single linear projection without bias. This embedding is typically used to
incorporate calendar information into transformer models, providing absolute
temporal context that complements positional encodings.
Parameters:
NameTypeDescriptionDefault
input_sizeintNumber of input temporal features (e.g., 5 for month, day, weekday, hour, minute).required
hidden_sizeintDimension of the output embeddings, matching the model’s hidden dimension.required
Returns:
TypeDescription
TensorTime feature embeddings of shape [batch, seq_len, hidden_size].

PositionalEmbedding

PositionalEmbedding(hidden_size, max_len=5000)
Bases: Module Sinusoidal positional embedding for transformer models. Generates fixed sinusoidal positional encodings using sine and cosine functions
at different frequencies. These encodings provide position information to
transformer models, allowing them to understand the relative or absolute position
of tokens in a sequence. The encodings are precomputed and stored as a buffer,
making them non-trainable.
Parameters:
NameTypeDescriptionDefault
hidden_sizeintDimension of the model’s hidden states. Must be even for proper sine/cosine pairing.required
max_lenintMaximum sequence length to precompute encodings for.5000
Returns:
TypeDescription
TensorPositional encodings of shape [1, seq_len, hidden_size] where seq_len is the length of the input sequence.

SeriesDecomp

SeriesDecomp(kernel_size)
Bases: Module Series decomposition block for trend-residual decomposition. Decomposes time series into trend and residual components using moving average filtering. The trend is extracted via a moving average filter, and the residual is computed as the difference between the input and the trend. Parameters:
NameTypeDescriptionDefault
kernel_sizeintSize of the moving average window for trend extraction.required
Returns:
TypeDescription
TensorResidual component of shape [batch, seq_len, channels], computed as the input minus the trend.
TensorTrend component of shape [batch, seq_len, channels], extracted using the moving average filter.

MovingAvg

MovingAvg(kernel_size, stride)
Bases: Module Moving average block to highlight the trend of time series. Applies a moving average filter using 1D average pooling to smooth time series
data and extract trend components. The input is padded on both ends by repeating
the first and last values to maintain the original sequence length.
Parameters:
NameTypeDescriptionDefault
kernel_sizeintSize of the moving average window.required
strideintStride for the average pooling operation.required
Returns:
TypeDescription
TensorSmoothed time series of shape [batch, seq_len, channels], representing the trend component after applying moving average.

RevIN

RevIN(
    num_features, eps=1e-05, affine=False, subtract_last=False, non_norm=False
)
Bases: Module Reversible Instance Normalization for time series forecasting. Normalizes time series data by removing the mean (or last value) and scaling by
standard deviation. The normalization can be reversed after model predictions to
restore the original scale. Optionally includes learnable affine parameters for
additional transformation flexibility.
Parameters:
NameTypeDescriptionDefault
num_featuresintThe number of features or channels in the time series.required
epsfloatA value added for numerical stability.1e-05
affineboolIf True, RevIN has learnable affine parameters (weight and bias).False
subtract_lastboolIf True, subtracts the last value instead of the mean in normalization.False
non_normboolIf True, no normalization is performed (identity operation).False
Returns:
TypeDescription
TensorNormalized tensor (if mode=“norm”) or denormalized tensor (if mode=“denorm”) of the same shape as the input [batch, seq_len, num_features].

RevINMultivariate

RevINMultivariate(
    num_features, eps=1e-05, affine=False, subtract_last=False, non_norm=False
)
Bases: Module Reversible Instance Normalization for multivariate time series models. Normalizes multivariate time series data using batch statistics computed across
the time dimension. The normalization can be reversed after model predictions to
restore the original scale. Optionally includes learnable affine parameters for
additional transformation flexibility.
Parameters:
NameTypeDescriptionDefault
num_featuresintThe number of features or channels in the time series.required
epsfloatA value added for numerical stability.1e-05
affineboolIf True, RevINMultivariate has learnable affine parameters (weight and bias).False
subtract_lastboolNot used in this implementation (kept for API compatibility).False
non_normboolNot used in this implementation (kept for API compatibility).False
Returns:
TypeDescription
TensorNormalized tensor (if mode=“norm”) or denormalized tensor (if mode=“denorm”) of the same shape as the input [batch, seq_len, num_features].