Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RAFT: Recurrent All-Pairs Field Transforms for Optical Flow (2003.12039v3)

Published 26 Mar 2020 in cs.CV

Abstract: We introduce Recurrent All-Pairs Field Transforms (RAFT), a new deep network architecture for optical flow. RAFT extracts per-pixel features, builds multi-scale 4D correlation volumes for all pairs of pixels, and iteratively updates a flow field through a recurrent unit that performs lookups on the correlation volumes. RAFT achieves state-of-the-art performance. On KITTI, RAFT achieves an F1-all error of 5.10%, a 16% error reduction from the best published result (6.10%). On Sintel (final pass), RAFT obtains an end-point-error of 2.855 pixels, a 30% error reduction from the best published result (4.098 pixels). In addition, RAFT has strong cross-dataset generalization as well as high efficiency in inference time, training speed, and parameter count. Code is available at https://github.com/princeton-vl/RAFT.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Zachary Teed (10 papers)
  2. Jia Deng (93 papers)
Citations (2,232)

Summary

  • The paper introduces a novel deep network that employs recurrent updates and all-pairs correlation volumes to estimate optical flow with high precision.
  • It maintains a single high-resolution flow field throughout iterative refinement, leading to improved accuracy on challenging datasets like KITTI and Sintel.
  • The approach reduces parameter count to 2.7M and achieves real-time performance, making it ideal for video analysis and autonomous systems.

RAFT: Recurrent All-Pairs Field Transforms for Optical Flow

Abstract and Introduction

The paper introduces Recurrent All-Pairs Field Transforms (RAFT), a novel deep network architecture for optical flow estimation. Optical flow involves determining per-pixel motion between video frames, and historically, it has been approached via hand-crafted optimization techniques. However, due to the inherent challenges such as handling fast-moving objects and occlusions, these traditional methods face significant limitations. RAFT proposes a different paradigm where features are learned rather than hand-designed.

Core Components and Methodology

RAFT's architecture can be divided into three primary components:

  1. Feature Encoder: This module extracts per-pixel features from input images, reducing them to a lower resolution.
  2. Correlation Layer: This layer builds a multi-scale 4D correlation volume for all pairs of pixels from the feature maps. Importantly, this volume captures visual similarity between pixels across multiple scales.
  3. Update Operator: A recurrent GRU-based update operator that iteratively updates the flow field via lookups on these correlation volumes.

The essential innovation lies in using these three components to operate at a single high resolution, unlike previous architectures that employ a coarse-to-fine strategy. By maintaining a high-resolution flow field throughout the iterative process, RAFT achieves higher precision, especially in challenging scenarios involving small, fast-moving objects and large displacements.

Architectural and Operational Advantages

The crafted design brings numerous operational advantages. RAFT processes high-resolution 1088x436 videos at an impressive 10 FPS using a 1080Ti GPU. The recurrent update operator is notably lightweight, consisting of only 2.7M parameters, and is capable of being iterated over 100 times without divergence. This contrasts sharply with approaches like IRR that are limited by their pyramid structure or the number of parameters.

Numerical Results and Performance Analysis

RAFT demonstrates significant performance improvements over prior methods. On the KITTI dataset, RAFT achieves an F1-all error of 5.10%, a substantial 16% reduction from the state-of-the-art at 6.10%. Similarly, on the Sintel dataset (final pass), RAFT records an end-point-error (EPE) of 2.855 pixels, a 30% reduction from the previous best-registered EPE of 4.098 pixels.

Theoretical and Practical Implications

The theoretical foundation of RAFT is grounded in the principles of traditional optimization but enhances these principles by integrating learned features and iterative updates. This combination leads to a highly efficient and effective system for optical flow estimation. The capability to maintain a single high-resolution flow field and the efficient utilization of correlation volumes to propose descent directions underpin RAFT’s robust performance.

Practically, RAFT's efficiency in inference time and parameter count signifies a leap forward in practical implementations, enabling the application of optical flow estimation in real-time scenarios such as video analysis, autonomous driving, and robotics.

Ablation Studies and Comparisons

Extensive ablation studies justify the design choices. Specifically, tied weights across iterations improve generalization and performance, verified via comparisons on datasets like Sintel and KITTI. The inclusion of a context encoder and multi-scale correlation volumes, despite their simplicity, are critical for the observed performance gains. Additionally, RAFT’s performance on high-resolution videos from the DAVIS dataset showcases its scalability.

Future Directions

Future developments could explore further optimization of the recurrent update operator, potentially integrating more sophisticated recurrent units like LSTMs for even better convergence properties. Enhancing generalization capabilities across even more diverse and synthetic datasets will cement RAFT's applicability in varied real-world tasks.

In conclusion, RAFT introduces a methodologically novel and practically efficient solution to the long-standing problem of optical flow estimation, achieving state-of-the-art results through innovative yet effective design choices. The implications of this research are profound, potentially influencing future architectures in optical flow and related computer vision tasks.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub

Youtube Logo Streamline Icon: https://streamlinehq.com