EgoLifter: Open-world 3D Segmentation for Egocentric Perception (2403.18118v2)

Published 26 Mar 2024 in cs.CV

Abstract: In this paper we present EgoLifter, a novel system that can automatically segment scenes captured from egocentric sensors into a complete decomposition of individual 3D objects. The system is specifically designed for egocentric data where scenes contain hundreds of objects captured from natural (non-scanning) motion. EgoLifter adopts 3D Gaussians as the underlying representation of 3D scenes and objects and uses segmentation masks from the Segment Anything Model (SAM) as weak supervision to learn flexible and promptable definitions of object instances free of any specific object taxonomy. To handle the challenge of dynamic objects in ego-centric videos, we design a transient prediction module that learns to filter out dynamic objects in the 3D reconstruction. The result is a fully automatic pipeline that is able to reconstruct 3D object instances as collections of 3D Gaussians that collectively compose the entire scene. We created a new benchmark on the Aria Digital Twin dataset that quantitatively demonstrates its state-of-the-art performance in open-world 3D segmentation from natural egocentric input. We run EgoLifter on various egocentric activity datasets which shows the promise of the method for 3D egocentric perception at scale.

References (1)

OpenAI: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

Citations (10)

View on Semantic Scholar

Summary

The paper introduces a novel system that leverages 3D Gaussian representations and SAM-based segmentation masks to reconstruct egocentric scenes.
It employs flexible object decomposition and contrastive learning to overcome challenges from incomplete scene coverage in egocentric videos.
The method incorporates a transient prediction network to filter dynamic objects, enhancing segmentation and reconstruction accuracy on real-world datasets.

EgoLifter: Advancing Egocentric Perception through Open-world 3D Segmentation

Introduction to EgoLifter

EgoLifter introduces a novel system engineered to automatically segment scenes captured by egocentric sensors into individual 3D objects. This development is significant as it caters specifically to the nuances of egocentric data, characterized by natural motion capturing hundreds of objects. By utilizing 3D Gaussians for the representation of 3D scenes and objects, and employing segmentation masks from the Segment Anything Model (SAM) for weak supervision, EgoLifter navigates the challenges of dynamic objects in egocentric videos with a transient prediction module designed to filter out these dynamic elements during the 3D reconstruction process.

Challenges in Egocentric 3D Perception

Egocentric perception must confront two main challenges. First, the coverage problem, where egocentric videos, unlike datasets captured with deliberate scanning motions, may not fully cover the scene leading to incomplete multi-view observations. Second, the dynamic nature of objects encountered in such videos, necessitating robust recognition and reconstruction capabilities under frequent human-object interactions.

EgoLifter's Architectural Innovations

EgoLifter employs several key innovations to address these challenges:

3D Gaussian Representation: By adopting 3D Gaussians, EgoLifter captures the geometry of scenes and objects efficiently, facilitating accurate photometric reconstruction.
Flexible Object Decomposition: Leveraging SAM for object identification and employing contrastive learning, EgoLifter benefits from a flexible object instance definition without relying on a predefined object taxonomy.
Transient Object Handling: The inclusion of a transient prediction network underlines EgoLifter's capability to selectively focus on static parts of the scene, thereby enhancing both the reconstruction and segmentation accuracy of static objects.

Prominent Results and Applications

EgoLifter showcases significant numerical results, particularly in open-world 3D segmentation. By demonstrating state-of-the-art performance on the Aria Digital Twin dataset, EgoLifter evidences its utility in real-world scenarios. Its ability to decompose 3D scenes into object instances underscores its potential impact on numerous applications, including Augmented Reality (AR) and Virtual Reality (VR).

Practical Implications and Future Directions

The practical implications of EgoLifter span a wide range of domains, from enhancing user interaction in AR/VR environments to improving autonomous systems' understanding of their surroundings. By offering a bridge between 2D understanding and 3D perception, EgoLifter sets the stage for more intuitive and interaction-rich computing experiences. Future developments may focus on refining transient object discrimination, expanding object classification breadth, and optimizing scalability to accommodate increasingly large and complex datasets.

Conclusion

EgoLifter represents a significant step forward in the field of egocentric perception, enabling accurate open-world 3D segmentation and reconstruction from naturally captured egocentric videos. Its innovative approach to handling the dynamic nature of egocentric data, coupled with the flexibility of its object instance definitions, positions it as a promising tool for advancing 3D understanding in both academic and practical contexts.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1773194569909190975

https://twitter.com/janusch_patas/status/1773201006429454478

https://twitter.com/ducha_aiki/status/1840321553046593761

https://twitter.com/arxivsanitybot/status/1773340777193615816

https://twitter.com/CSVisionPapers/status/1773365933274288459

https://twitter.com/fly51fly/status/1773468514684441017

YouTube

Show All Videos