Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Transformer Hawkes Process (2002.09291v5)

Published 21 Feb 2020 in cs.LG and stat.ML

Abstract: Modern data acquisition routinely produce massive amounts of event sequence data in various domains, such as social media, healthcare, and financial markets. These data often exhibit complicated short-term and long-term temporal dependencies. However, most of the existing recurrent neural network based point process models fail to capture such dependencies, and yield unreliable prediction performance. To address this issue, we propose a Transformer Hawkes Process (THP) model, which leverages the self-attention mechanism to capture long-term dependencies and meanwhile enjoys computational efficiency. Numerical experiments on various datasets show that THP outperforms existing models in terms of both likelihood and event prediction accuracy by a notable margin. Moreover, THP is quite general and can incorporate additional structural knowledge. We provide a concrete example, where THP achieves improved prediction performance for learning multiple point processes when incorporating their relational information.

Citations (258)

Summary

  • The paper introduces the Transformer Hawkes Process to model continuous-time events by leveraging self-attention for capturing long-range dependencies.
  • It employs learned event embeddings combined with temporal encoding and causal masked multi-head attention to compute a continuous-time intensity function.
  • Empirical results show that THP outperforms RNN-based methods, achieving higher log-likelihood and prediction accuracy with greater computational efficiency.

The Transformer Hawkes Process (THP) (Transformer Hawkes Process, 2020) is a point process model that adapts the Transformer architecture to address limitations in modeling continuous-time event sequences, particularly concerning long-term dependencies and computational efficiency, prevalent in prior recurrent neural network (RNN) based approaches like the Neural Hawkes Process (NHP).

Model Architecture and Mechanism

THP models the conditional intensity function λ(tHt)\lambda(t|\mathcal{H}_t), where Ht\mathcal{H}_t denotes the event history up to time tt. Unlike traditional Hawkes processes which rely on predefined kernel functions, or RNN-based models which process events sequentially, THP leverages the self-attention mechanism to capture intricate dependencies between events regardless of their temporal distance.

The core input to the Transformer architecture in THP consists of embeddings for each historical event (tj,kj)(t_j, k_j), where tjt_j is the event time and kjk_j is the event type. These embeddings are a combination of learned event type embeddings and a temporal encoding function. The temporal encoding is crucial as events occur at continuous times, not discrete positions like tokens in natural language. This encoding is typically a deterministic function based on the absolute time tjt_j, similar to positional encoding in the original Transformer.

The self-attention mechanism computes attention scores between all pairs of events in the history. For an event at time tjt_j, its representation is a weighted sum of the representations of all preceding events ti<tjt_i < t_j, where the weights are derived from the attention scores. This direct connection, in contrast to the sequential information flow in RNNs, enables the model to attend to any relevant past event, facilitating the capture of long-range dependencies. A causal mask is applied to the attention scores to prevent attending to future events. Multiple self-attention heads allow the model to capture different types of dependencies concurrently. The output of the self-attention layers is passed through a position-wise feed-forward network to generate rich hidden representations h(tj)\mathbf{h}(t_j) for each event.

To define a continuous-time intensity function based on these discrete-time hidden representations, THP proposes an intensity function for event type kk at time t[tj,tj+1)t \in [t_j, t_{j+1}) given the history up to the last event at tjt_j:

λk(tHt)=fk(αkttjtj+wkh(tj)+bk)\lambda_k(t|\mathcal{H}_t) = f_k \Big( \alpha_k \frac{t-t_j}{t_j} + \mathbf{w}_k^\top \mathbf{h}(t_j) + b_k \Big)

where fkf_k is a non-negative activation function (typically softplus), αk\alpha_k is a learned parameter scaling the influence of elapsed time since the last event, wk\mathbf{w}_k is a learned weight vector, and bkb_k is a bias term. This formulation ensures the intensity is influenced by the historical context h(tj)\mathbf{h}(t_j) derived from the Transformer and interpolates between event times.

Implementation and Training

THP is trained by maximizing the log-likelihood of the observed event sequences, which is a standard objective for point processes. The log-likelihood function involves an integral of the intensity function over intervals where no events occur. Due to the non-linearity of the softplus function in the intensity, this integral typically does not have a closed-form solution. Practical implementations often resort to numerical approximation methods, such as Monte Carlo integration or the trapezoidal rule, to estimate the integral term during training.

Optimization is typically performed using gradient-based methods like ADAM. Techniques such as layer normalization and residual connections, standard in Transformer architectures, are employed to facilitate stable training of deep models.

From an implementation perspective, building a THP model requires defining:

  1. Event Embedding Layer: Combines learned event type embeddings and temporal encodings.
  2. Transformer Encoder Layer(s): Comprising multi-head self-attention with causal masking and position-wise feed-forward networks.
  3. Intensity Function Module: Computes the intensity λk(tHt)\lambda_k(t|\mathcal{H}_t) based on the output of the Transformer encoder and the time elapsed since the last event.
  4. Log-Likelihood Loss Function: Calculates the log-likelihood, incorporating numerical integration for the integral term.

A typical implementation framework using deep learning libraries would involve:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

class TransformerHawkesProcess(nn.Module):
    def __init__(self, num_event_types, embed_dim, num_heads, num_layers):
        super().__init__()
        self.event_type_embedding = nn.Embedding(num_event_types, embed_dim)
        self.temporal_encoding = TimeEmbedding(embed_dim) # Custom temporal encoding module
        self.transformer_layers = nn.ModuleList([
            TransformerLayer(embed_dim, num_heads) for _ in range(num_layers)
        ])
        self.intensity_func = IntensityFunction(num_event_types, embed_dim) # Custom intensity module

    def forward(self, event_types, event_times):
        # event_types: tensor of event types (batch_size, seq_len)
        # event_times: tensor of event times (batch_size, seq_len)

        type_embed = self.event_type_embedding(event_types)
        time_embed = self.temporal_encoding(event_times)
        embeddings = type_embed + time_embed

        # Apply causal mask
        seq_len = event_times.size(1)
        causal_mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1).bool()
        # In a real implementation, need to handle batching and padding masks

        hidden_states = embeddings
        for layer in self.transformer_layers:
            hidden_states = layer(hidden_states, mask=causal_mask)

        # The intensity function needs the hidden state of the *last* event
        # to compute intensity for t in [t_j, t_{j+1})
        # This part requires careful indexing based on batching and padding
        last_event_hidden_states = hidden_states[:, -1, :] # Example, needs refinement for padding

        return last_event_hidden_states # Used by the IntensityFunction in the loss calculation

def thp_log_likelihood(model, event_types, event_times, observation_interval):
    hidden_states = model(event_types, event_times)
    intensities_at_events = model.intensity_func.compute_intensity_at_times(
        hidden_states, event_times, event_types
    ) # Compute lambda_k(t_j | H_{t_j}) for each observed event

    integral_of_intensity = model.intensity_func.integrate_intensity(
        hidden_states, event_times, observation_interval
    ) # Compute integral of lambda over [0, T]

    log_likelihood = torch.sum(torch.log(intensities_at_events)) - integral_of_intensity
    return log_likelihood

The numerical integration step is a critical computational consideration. Monte Carlo integration can have high variance and requires sampling many points. Numerical rules like the trapezoidal rule are biased but can be faster. The choice depends on the desired trade-off between accuracy and computational cost.

Performance and Advantages

Empirical results presented in the paper (Transformer Hawkes Process, 2020) demonstrate that THP outperforms existing models, including RNN-based methods like NHP, in terms of both log-likelihood and event prediction accuracy across various datasets. This performance gain is attributed primarily to the self-attention mechanism's ability to effectively capture long-range temporal dependencies, which is a known challenge for RNNs due to the vanishing gradient problem and sequential processing.

A key advantage of THP is its computational efficiency during training. The parallelizable nature of the self-attention computation contrasts with the sequential state updates in RNNs, allowing THP to leverage modern hardware accelerators like GPUs more effectively. The paper reports faster training times for THP compared to NHP, even achieving better performance with fewer parameters in some cases. This suggests better parameter efficiency and scalability.

Furthermore, the non-recurrent structure of the Transformer, coupled with architectural elements like layer normalization and residual connections, facilitates the training of deeper models compared to deep RNNs, potentially enabling THP to learn more complex temporal patterns.

Extensions and Applications

The THP framework is general and can be extended to incorporate additional structural information. An example provided in (Transformer Hawkes Process, 2020) is the Structured-THP (THP-S), designed for modeling multiple point processes on a graph (e.g., events occurring at different locations or involving different entities with known relationships).

THP-S extends the event embedding to include a learned vertex embedding and modifies the attention mechanism to incorporate graph structure. The attention score between events ii and jj is influenced by the learned similarity between their respective vertices. A regularization term is added to the objective function to encourage learned vertex similarities to align with the known graph structure. This allows the model to leverage relational information to improve event sequence modeling and prediction.

This structured extension highlights the adaptability of THP. It can be applied to various domains where event sequences are associated with entities having known relationships, such as:

  • Social Networks: Modeling user activity sequences with consideration for social connections.
  • Urban Systems: Predicting events (e.g., traffic incidents, emergency calls) across different locations with geographical or functional relationships.
  • Financial Markets: Analyzing trading activities involving related assets.

The practical implementation of Structured-THP involves integrating a graph representation into the model architecture, modifying the attention mechanism, and incorporating graph-based regularization during training.

Conclusion

The Transformer Hawkes Process represents a significant advancement in modeling continuous-time event sequences by leveraging the strengths of the Transformer architecture. Its ability to effectively capture long-range dependencies, coupled with improved computational efficiency compared to RNN-based methods, makes it a powerful tool for analyzing and predicting complex event data in diverse domains. The framework's generality also allows for the incorporation of external structural information, further enhancing its modeling capabilities.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com