Multiview Transformers for Video Recognition

Published 12 Jan 2022 in cs.CV and cs.LG | (2201.04288v4)

Abstract: Video understanding requires reasoning at multiple spatiotemporal resolutions -- from short fine-grained motions to events taking place over longer durations. Although transformer architectures have recently advanced the state-of-the-art, they have not explicitly modelled different spatiotemporal resolutions. To this end, we present Multiview Transformers for Video Recognition (MTV). Our model consists of separate encoders to represent different views of the input video with lateral connections to fuse information across views. We present thorough ablation studies of our model and show that MTV consistently performs better than single-view counterparts in terms of accuracy and computational cost across a range of model sizes. Furthermore, we achieve state-of-the-art results on six standard datasets, and improve even further with large-scale pretraining. Code and checkpoints are available at: https://github.com/google-research/scenic/tree/main/scenic/projects/mtv.

Abstract PDF Upgrade to Chat

Citations (200)

View on Semantic Scholar

Summary

The paper presents MTV, which leverages multiview tokenization and lateral connections among transformer encoders to fuse diverse spatiotemporal features.
Experiments on datasets like Kinetics, Epic-Kitchens, and Something-Something V2 demonstrate superior accuracy and computational efficiency compared to prior models.
The study emphasizes the effectiveness of Cross-View Attention, paving the way for advanced video recognition in dynamic environments such as autonomous driving and surveillance.

Multiview Transformers for Video Recognition

The paper, "Multiview Transformers for Video Recognition," presents a novel approach to video understanding by leveraging transformer architectures. It introduces the Multiview Transformers for Video Recognition (MTV), which models videos at different spatiotemporal resolutions using separate encoders for multiple views of the input video. The key innovation lies in the lateral connections among these encoders, enabling fusion of information across multiple views and enhancing the understanding of complex temporal dynamics in video data.

Model Architecture

MTV builds upon the ViViT model by introducing multiview tokenization. In essence, the model extracts tokens over different temporal durations, forming a multiscale representation of the video. Different spatial and temporal scales are processed through transformer encoders of varying capacities, optimized for their respective view sizes. This multiview encoder includes lateral connections to fuse information efficiently, contrasting with previous pyramid-based approaches by offering direct multiscale context processing without subsampling.

Cross-View Fusion

The paper explores several methods for cross-view fusion within the transformer architecture. The study finds Cross-View Attention (CVA) to be particularly effective, allowing information transfer between different resolutions. This results in a more efficient architecture capable of retaining fine-grained temporal details while processing large amounts of data in parallel.

Experimental Validation

Extensive experiments are conducted on multiple datasets including Kinetics 400, 600, and 700, Moments in Time, Epic-Kitchens-100, and Something-Something V2. The model achieves state-of-the-art accuracy across these datasets, demonstrating superior performance in terms of accuracy/computation trade-offs compared to existing methods such as ViViT and SlowFast. For instance, on Kinetics 400, MTV outperforms other baselines with substantial computational efficiency gains.

Results and Implications

MTV consistently achieves higher accuracy and efficiency, demonstrating scalability from "Small" to "Huge" model variants. Notably, the model excels when leveraging large-scale pretraining datasets, such as JFT and Weak Textual Supervision (WTS), further enhancing its performance.

The implications of this research are significant in advancing video recognition tasks. MTV's ability to effectively model and infer over varied temporal resolutions makes it particularly useful for applications involving complex and dynamic scenarios. This approach opens avenues for more sophisticated video understanding systems in fields such as autonomous driving, surveillance, and human-computer interaction.

Future Directions

While the results are impressive, the paper highlights potential limitations and future research paths, including reducing the reliance on large-scale pretraining and exploring extensions to other multiscale transformer architectures like MViT and Swin. These directions promise further enhancements to both the efficiency and applicability of multiview video models in diverse real-world applications.

Markdown Report Issue