Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
149 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers (2106.05392v2)

Published 9 Jun 2021 in cs.CV

Abstract: In video transformers, the time dimension is often treated in the same way as the two spatial dimensions. However, in a scene where objects or the camera may move, a physical point imaged at one location in frame $t$ may be entirely unrelated to what is found at that location in frame $t+k$. These temporal correspondences should be modeled to facilitate learning about dynamic scenes. To this end, we propose a new drop-in block for video transformers -- trajectory attention -- that aggregates information along implicitly determined motion paths. We additionally propose a new method to address the quadratic dependence of computation and memory on the input size, which is particularly important for high resolution or long videos. While these ideas are useful in a range of settings, we apply them to the specific task of video action recognition with a transformer model and obtain state-of-the-art results on the Kinetics, Something--Something V2, and Epic-Kitchens datasets. Code and models are available at: https://github.com/facebookresearch/Motionformer

Citations (247)

Summary

  • The paper introduces trajectory attention, dynamically aggregating motion paths to improve temporal reasoning in video data.
  • It proposes the Orthoformer algorithm to approximate self-attention, reducing computational and memory requirements.
  • Empirical results on benchmarks like Kinetics and Something–Something V2 show up to a 2% top-1 accuracy improvement.

Analyzing Trajectory Attention in Video Transformers

This paper introduces an innovative attention mechanism for video transformers, termed trajectory attention, which has shown to significantly enhance the performance on video recognition tasks. In traditional video transformers, spatial and temporal dimensions are handled in a uniform manner. However, objects and cameras in a dynamic scene can move, rendering a spatial point in one frame potentially unrelated in subsequent frames. The proposed trajectory attention addresses temporal discrepancies by effectively aggregating information along the implicit motion paths determined within scenes.

Key Contributions

The authors' primary contribution is the introduction of the trajectory attention mechanism, which dynamically follows motion paths throughout video sequences. This is distinguished from previous methods that pool features over an entire space-time volume or axis-wise independently along time, lacking a motion-centered inductive bias. The work posits that pooling along motion trajectories allows a more coherent aggregation of information, taking into account object dynamics and reducing the influence of camera motion.

The paper also introduces an approximation approach to manage the quadratic computational complexity associated with self-attention in transformers. This is achieved through an algorithm named Orthoformer, inspired by the Nyström method, which approximates self-attention in a manner that reduces both FLOPS and memory requirements. This approximation significantly enhances computational and memory efficiency, enabling effective training of models on high-resolution or long videos without prohibitive resource demands.

Performance and Numerical Results

The proposed model, Motionformer, achieves state-of-the-art results across notable benchmarks, including the Kinetics, Something–Something V2, and Epic-Kitchens datasets. Empirically, trajectory attention results in performance gains, especially on datasets like Something–Something V2, which lean heavily on understanding motion cues. When numerical comparisons are drawn, this design outperforms existing video transformer architectures employing joint space-time or divided space-time attention. The trajectory attention mechanism improves top-1 accuracy by up to 2% on these datasets, underscoring its efficacy in capturing fine-grained motion details.

Architectural Innovations and Implications

The trajectory attention model reshapes the standard attention mechanism in transformers to focus explicitly on video dynamics. By structuring attention over motion paths, the framework naturally aligns with the underlying video structure, leading to enhanced temporal reasoning and reduced sensitivity to irrelevant spatial data.

The implications of this approach are multifaceted. Practically, it empowers video models to process dynamic scenes more accurately and efficiently, making it particularly valuable in domains requiring temporal logic such as surveillance, autonomous navigation, and interactive media applications. Theoretically, it opens avenues for deeper exploration into attention mechanisms that can be fine-tuned or tailored for other domains encompassing temporal data, like the analysis of time-series data in financial markets or biological signal processing.

Future Directions

The work hints at conductive future research avenues, notably the application of trajectory attention in tasks beyond video action classification. Opportunities exist in extending this model to object tracking, temporal localization, and online detection scenarios where understanding object movement over time is critical. Furthermore, the mathematical framework proposed for approximating attention might be optimized further for even more efficient processing.

In conclusion, this paper makes pivotal strides in the manipulation of attention structures for video transformers, offering a model that not only excels in empirical benchmarks but also introduces a robust framework for handling temporal complexity within video data. By advancing attention approximations, the work efficiently counters the computational limits traditionally associated with large-scale video processing, setting a strong precedent for future developments in the domain.