Emergent Mind

EgoLifter: Open-world 3D Segmentation for Egocentric Perception

(2403.18118)
Published Mar 26, 2024 in cs.CV

Abstract

In this paper we present EgoLifter, a novel system that can automatically segment scenes captured from egocentric sensors into a complete decomposition of individual 3D objects. The system is specifically designed for egocentric data where scenes contain hundreds of objects captured from natural (non-scanning) motion. EgoLifter adopts 3D Gaussians as the underlying representation of 3D scenes and objects and uses segmentation masks from the Segment Anything Model (SAM) as weak supervision to learn flexible and promptable definitions of object instances free of any specific object taxonomy. To handle the challenge of dynamic objects in ego-centric videos, we design a transient prediction module that learns to filter out dynamic objects in the 3D reconstruction. The result is a fully automatic pipeline that is able to reconstruct 3D object instances as collections of 3D Gaussians that collectively compose the entire scene. We created a new benchmark on the Aria Digital Twin dataset that quantitatively demonstrates its state-of-the-art performance in open-world 3D segmentation from natural egocentric input. We run EgoLifter on various egocentric activity datasets which shows the promise of the method for 3D egocentric perception at scale.

EgoLifter achieves simultaneous 3D reconstruction, open-world segmentation from egocentric videos using 3D Gaussian Splatting.

Overview

  • EgoLifter introduces a novel system for automatically segmenting scenes from egocentric sensors into individual 3D objects using 3D Gaussians and segmentation masks from the SAM for weak supervision.

  • The system addresses challenges unique to egocentric data, such as incomplete coverage of scenes and the dynamic nature of objects encountered.

  • EgoLifter's architecture includes innovations like flexible object decomposition, 3D Gaussian representation, and transient object handling to improve reconstruction and segmentation accuracy.

  • It demonstrates state-of-the-art performance in open-world 3D segmentation, with significant implications for AR/VR and autonomous systems, setting a foundation for future advancements in 3D perception.

EgoLifter: Advancing Egocentric Perception through Open-world 3D Segmentation

Introduction to EgoLifter

EgoLifter introduces a novel system engineered to automatically segment scenes captured by egocentric sensors into individual 3D objects. This development is significant as it caters specifically to the nuances of egocentric data, characterized by natural motion capturing hundreds of objects. By utilizing 3D Gaussians for the representation of 3D scenes and objects, and employing segmentation masks from the Segment Anything Model (SAM) for weak supervision, EgoLifter navigates the challenges of dynamic objects in egocentric videos with a transient prediction module designed to filter out these dynamic elements during the 3D reconstruction process.

Challenges in Egocentric 3D Perception

Egocentric perception must confront two main challenges. First, the coverage problem, where egocentric videos, unlike datasets captured with deliberate scanning motions, may not fully cover the scene leading to incomplete multi-view observations. Second, the dynamic nature of objects encountered in such videos, necessitating robust recognition and reconstruction capabilities under frequent human-object interactions.

EgoLifter's Architectural Innovations

EgoLifter employs several key innovations to address these challenges:

  • 3D Gaussian Representation: By adopting 3D Gaussians, EgoLifter captures the geometry of scenes and objects efficiently, facilitating accurate photometric reconstruction.
  • Flexible Object Decomposition: Leveraging SAM for object identification and employing contrastive learning, EgoLifter benefits from a flexible object instance definition without relying on a predefined object taxonomy.
  • Transient Object Handling: The inclusion of a transient prediction network underlines EgoLifter's capability to selectively focus on static parts of the scene, thereby enhancing both the reconstruction and segmentation accuracy of static objects.

Prominent Results and Applications

EgoLifter showcases significant numerical results, particularly in open-world 3D segmentation. By demonstrating state-of-the-art performance on the Aria Digital Twin dataset, EgoLifter evidences its utility in real-world scenarios. Its ability to decompose 3D scenes into object instances underscores its potential impact on numerous applications, including Augmented Reality (AR) and Virtual Reality (VR).

Practical Implications and Future Directions

The practical implications of EgoLifter span a wide range of domains, from enhancing user interaction in AR/VR environments to improving autonomous systems' understanding of their surroundings. By offering a bridge between 2D understanding and 3D perception, EgoLifter sets the stage for more intuitive and interaction-rich computing experiences. Future developments may focus on refining transient object discrimination, expanding object classification breadth, and optimizing scalability to accommodate increasingly large and complex datasets.

Conclusion

EgoLifter represents a significant step forward in the field of egocentric perception, enabling accurate open-world 3D segmentation and reconstruction from naturally captured egocentric videos. Its innovative approach to handling the dynamic nature of egocentric data, coupled with the flexibility of its object instance definitions, positions it as a promising tool for advancing 3D understanding in both academic and practical contexts.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube