RAFT-3D: Scene Flow using Rigid-Motion Embeddings

Published 1 Dec 2020 in cs.CV | (2012.00726v2)

Abstract: We address the problem of scene flow: given a pair of stereo or RGB-D video frames, estimate pixelwise 3D motion. We introduce RAFT-3D, a new deep architecture for scene flow. RAFT-3D is based on the RAFT model developed for optical flow but iteratively updates a dense field of pixelwise SE3 motion instead of 2D motion. A key innovation of RAFT-3D is rigid-motion embeddings, which represent a soft grouping of pixels into rigid objects. Integral to rigid-motion embeddings is Dense-SE3, a differentiable layer that enforces geometric consistency of the embeddings. Experiments show that RAFT-3D achieves state-of-the-art performance. On FlyingThings3D, under the two-view evaluation, we improved the best published accuracy (d < 0.05) from 34.3% to 83.7%. On KITTI, we achieve an error of 5.77, outperforming the best published method (6.31), despite using no object instance supervision. Code is available at https://github.com/princeton-vl/RAFT-3D.

Abstract PDF Upgrade to Chat

Authors (2)

Citations (115)

View on Semantic Scholar

Summary

The paper introduces a Dense-SE3 layer that refines pixelwise 3D motion using iterative rigid-motion embeddings, extending RAFT's 2D optical flow approach.
The paper achieves significant performance improvements on benchmarks, boosting FlyingThings3D accuracy from 34.3% to 83.7% and reducing KITTI error from 6.31 to 5.77.
The paper eliminates the need for object instance supervision, enabling end-to-end learning of 3D motion for applications in autonomous driving, AR, and robotics.

Overview of RAFT-3D: Scene Flow using Rigid-Motion Embeddings

The paper introduces RAFT-3D, an innovative architecture designed to estimate pixelwise 3D motion in stereo or RGB-D video settings. This work builds upon the existing RAFT model, leveraging its iterative refinement approach for optical flow estimation while extending it to handle 3D scene flow through SE3 motion fields. The principal advancement of RAFT-3D lies in its use of rigid-motion embeddings, which facilitate the grouping of pixels into coherent, rigidly moving entities within a scene.

In the RAFT-3D architecture, the core concept revolves around Dense-SE3, a differentiable layer that ensures geometric consistency in the motion embeddings. The model iteratively updates these embeddings so that pixels with similar motion patterns are assigned to the same rigid body transformation, thereby refining the scene flow estimates. These embeddings allow the network to capture the inherent piecewise constancy associated with 3D scenes consisting of rigidly moving objects.

Technical Achievements and Comparisons

RAFT-3D achieves outstanding performance benchmarks on standard datasets. For instance, on the FlyingThings3D dataset under two-view evaluation, RAFT-3D advances the best published accuracy from a previous 34.3% to a substantial 83.7%. Similarly, on the KITTI dataset, the model attains an error rate of 5.77, which is an improvement over the preceding best method registering an error of 6.31. Notably, these achievements are realized without relying on object instance supervision, highlighting the robustness and efficacy of the proposed method.

Methodological Contribution

The methodological contribution of RAFT-3D is anchored in its innovative use of rigid-motion embeddings and the Dense-SE3 layer. These components collectively enable RAFT-3D to operate as an end-to-end differentiable system that naturally segments scenes into rigid objects without explicit instance supervision. By maintaining a dense field of SE3 transformations, RAFT-3D can recover both optical flow and depth change, resulting in accurate 3D motion estimates. This structured approach contrasts with conventional methods that often necessitate object detection or segmentation as separate stages, typically involving non-differentiable or supervised components.

Implications and Future Directions

The implications of RAFT-3D are significant for areas requiring precise motion estimation, such as autonomous driving, augmented reality, and robotics. The model’s ability to deliver state-of-the-art results without object instance supervision is particularly advantageous for real-world applications where labeled data availability is often limited.

Looking forward, further research could explore enhancing RAFT-3D's scalability to even larger scenes or more diverse datasets. Moreover, integrating RAFT-3D with systems that handle dynamic environments, including mixed rigid and non-rigid object interactions, could broaden its practical applicability. Additionally, examining the potential of RAFT-3D's architecture to improve its runtime efficiency and memory usage would be valuable for deploying this approach on edge devices.

In conclusion, RAFT-3D represents a notable step forward in scene flow estimation, effectively bridging the gap between 2D optical flow and comprehensive 3D scene understanding with its recursive and geometrically consistent approach.

Markdown Report Issue