Revisiting 3D ResNets for Video Recognition

Published 3 Sep 2021 in cs.CV, cs.LG, and eess.IV | (2109.01696v1)

Abstract: A recent work from Bello shows that training and scaling strategies may be more significant than model architectures for visual recognition. This short note studies effective training and scaling strategies for video recognition models. We propose a simple scaling strategy for 3D ResNets, in combination with improved training strategies and minor architectural changes. The resulting models, termed 3D ResNet-RS, attain competitive performance of 81.0 on Kinetics-400 and 83.8 on Kinetics-600 without pre-training. When pre-trained on a large Web Video Text dataset, our best model achieves 83.5 and 84.3 on Kinetics-400 and Kinetics-600. The proposed scaling rule is further evaluated in a self-supervised setup using contrastive learning, demonstrating improved performance. Code is available at: https://github.com/tensorflow/models/tree/master/official.

Abstract PDF Upgrade to Chat

Authors (6)

Citations (15)

View on Semantic Scholar

Summary

The paper demonstrates that refined training methods and scaling strategies, rather than complex architecture modifications, can significantly boost video recognition performance.
It incorporates optimized elements such as the ResNet-D stem, squeeze-and-excitation modules, and self-gating to enhance 3D ResNet models.
Quantitative results show top-1 accuracies up to 83.8% on Kinetics benchmarks, validating the effectiveness of improved augmentation and temporal resolution scaling.

Revisiting 3D ResNets for Video Recognition: An Expert Overview

The paper "Revisiting 3D ResNets for Video Recognition" by Du et al. presents a re-evaluation of 3D ResNet architectures, emphasizing the significance of training and scaling strategies over architectural complexity in the context of video recognition. The authors propose a suite of techniques that enhance the performance of 3D ResNets, resulting in the new models termed 3D ResNet-RS.

Core Contributions

The work builds on Bello et al.’s previous findings, illustrating that improvements in model performance can be achieved through modern training methods rather than drastic architectural changes. Key contributions include:

Enhanced Architectural Elements: Incorporation of the ResNet-D stem and Squeeze-and-Excitation modules, along with the addition of self-gating, are used to optimize the architecture.
Improved Training and Data Augmentation: Strategies like data augmentations, label smoothing, and stochastic depth are employed to boost model robustness. The augmentation strategy is uniformly applied to all frames, tailoring it for video inputs.
Scaling Strategies: A simple scaling rule combining model depth increase with temporal resolution scaling of video inputs is proposed. The authors demonstrate that increasing the temporal resolution offers more substantial improvements than scaling up spatial resolution.

Quantitative Results

The experimental results are promising, showing that the proposed 3D ResNet-RS models achieve competitive top-1 accuracies of 81.0% and 83.8% on the Kinetics-400 and Kinetics-600 benchmarks from scratch. Pretraining on larger datasets, such as the Web Video Text dataset, further enhances performance, reaching 83.5% and 84.3% accuracy, respectively.

The models demonstrate a +3.8% improvement over the baseline R3D-50 model and further improvements by scaling it to R3D-RS-200 with 48 input frames. Key ablation studies reveal that techniques like label smoothing and squeeze-and-excitation significantly contribute to these gains.

Broader Implications

This research underscores the potential paradigm shift from complex architectural designs towards leveraging training refinements and scaling techniques for better performance in video recognition tasks. The methodologies proposed may guide future advancements in video action recognition and related applications.

Future Directions

Looking forward, this approach opens avenues for exploring:

Extending these strategies to other domains within computer vision or modalities like audio and text.
Examining the interplay of additional architectural modifications with advanced training techniques.
Investigating the scalability of these methods on more extensive and diverse datasets.

In summary, this paper contributes significantly to the ongoing discussion about the relative importance of architecture versus training techniques in deep learning, specifically within video recognition. The findings provide a robust framework for developing high-performance video models, emphasizing efficiency and practical applicability.

Markdown Report Issue