Emergent Mind

Abstract

Pre-training by numerous image data has become de-facto for robust 2D representations. In contrast, due to the expensive data acquisition and annotation, a paucity of large-scale 3D datasets severely hinders the learning for high-quality 3D features. In this paper, we propose an alternative to obtain superior 3D representations from 2D pre-trained models via Image-to-Point Masked Autoencoders, named as I2P-MAE. By self-supervised pre-training, we leverage the well learned 2D knowledge to guide 3D masked autoencoding, which reconstructs the masked point tokens with an encoder-decoder architecture. Specifically, we first utilize off-the-shelf 2D models to extract the multi-view visual features of the input point cloud, and then conduct two types of image-to-point learning schemes on top. For one, we introduce a 2D-guided masking strategy that maintains semantically important point tokens to be visible for the encoder. Compared to random masking, the network can better concentrate on significant 3D structures and recover the masked tokens from key spatial cues. For another, we enforce these visible tokens to reconstruct the corresponding multi-view 2D features after the decoder. This enables the network to effectively inherit high-level 2D semantics learned from rich image data for discriminative 3D modeling. Aided by our image-to-point pre-training, the frozen I2P-MAE, without any fine-tuning, achieves 93.4% accuracy for linear SVM on ModelNet40, competitive to the fully trained results of existing methods. By further fine-tuning on on ScanObjectNN's hardest split, I2P-MAE attains the state-of-the-art 90.11% accuracy, +3.68% to the second-best, demonstrating superior transferable capacity. Code will be available at https://github.com/ZrrSkywalker/I2P-MAE.

Leverage 2D pre-trained models to enhance 3D autoencoder training, minimizing the need for large datasets.

Overview

  • The paper introduces Image-to-Point Masked Autoencoders (I2P-MAE), a novel method that leverages pre-trained 2D models to improve 3D representation learning.

  • I2P-MAE uses an encoder-decoder architecture and employs two key strategies: 2D-guided masking and 2D-semantic reconstruction, enabling high-quality 3D representations without the need for extensive 3D datasets.

  • Experimental results on tasks like classification and part segmentation show that I2P-MAE outperforms existing methods, demonstrating superior performance and faster convergence rates.

Learning 3D Representations from 2D Pre-trained Models via Image-to-Point Masked Autoencoders

Introduction

The acquisition of robust 3D representations remains a significant challenge within computer vision due to the scarcity of large-scale 3D datasets. The authors of this paper introduce an innovative approach named Image-to-Point Masked Autoencoders (I2P-MAE). Through leveraging pre-trained 2D models, I2P-MAE employs an encoder-decoder architecture aimed at self-supervised pre-training for 3D point clouds. This methodology facilitates the creation of high-quality 3D representations without the requisite for extensive 3D datasets, thereby transferring rich 2D knowledge into the 3D domain.

Methodology and Architecture

Basic 3D Architecture

The essential framework of I2P-MAE aligns closely with existing MAE methodologies for 3D point clouds. It incorporates a token embedding module, an encoder-decoder transformer, and a reconstruction head for masked 3D coordinates. Using a point cloud as input, points are downsampled and aggregated into tokens representing local spatial regions. This input is then subjected to a high masking ratio, where only visible tokens are fed to a transformer encoder. Subsequently, an asymmetric decoder reconstructs the masked points from visible tokens.

2D Pre-trained Representations

To bridge the gap between 2D and 3D data, the authors project the point cloud onto multiple image planes, creating depth maps. These projections are passed through pre-trained 2D models to extract multi-view 2D features and saliency maps. The 2D features encapsulate high-level semantics learned from large-scale image data, whereas the saliency maps indicate semantic significance across different regions.

Image-to-Point Learning Schemes

Two primary schemes are employed for the effective transfer of 2D knowledge:

  1. 2D-guided Masking: This strategy uses 2D saliency maps to guide which point tokens are visible during masking. Tokens with higher semantic importance are prioritized to remain visible, aiding the network in focusing on significant 3D structures.
  2. 2D-semantic Reconstruction: Beyond reconstructing 3D coordinates, visible tokens are used to reconstruct aggregated 2D semantics from multi-view features, blending low-level spatial understanding with high-level semantic knowledge.

Experimental Results

Pre-training and Evaluation

The authors conducted exhaustive experiments using ShapeNet for self-supervised pre-training, subsequently evaluating the encoded features using linear SVM for classification tasks on ModelNet40 and ScanObjectNN datasets. The reported results indicate an impressive performance, where I2P-MAE exhibited the highest transfer capability with significantly faster convergence rates compared to Point-MAE and Point-M2AE.

Downstream Task Performance

On real-world classification tasks using ScanObjectNN, I2P-MAE achieved a leading 90.11% accuracy on the hardest split, surpassing Point-M2AE by +3.68%. For synthetic 3D classification on ModelNet40, I2P-MAE also outperformed existing methods both before and after fine-tuning.

Furthermore, for part segmentation tasks as assessed on ShapeNetPart, I2P-MAE attained state-of-the-art performance with mIoU$C$ and mIoU$I$ scores of 85.15% and 86.76%, respectively, indicating its proficiency in understanding fine-grained 3D structures.

Ablation Studies

The ablation studies demonstrated the efficacy of the proposed 2D-guided masking and 2D-semantic reconstruction strategies. The 2D guidance significantly improved the model performance, highlighting the importance of preserving semantically important 3D structures and leveraging high-level 2D semantics.

Implications and Future Directions

The proposed methodology, I2P-MAE, holds significant implications for advancing 3D representation learning. The ability to utilize pre-trained 2D models effectively mitigates the challenges posed by the paucity of large-scale 3D datasets. The improved convergence rates and superior performance on diverse downstream tasks underline the practical applicability of this approach.

Future research could explore more sophisticated mechanisms for image-to-point knowledge transfer, potentially expanding to tasks such as 3D object detection and visual grounding. The integration of additional 2D and 3D pre-training paradigms may further enhance the efficacy and applicability of the proposed framework.

Conclusion

This paper presents a substantial contribution to the field of 3D representation learning by introducing a novel point cloud pre-training framework that leverages the rich semantic knowledge of pre-trained 2D models. The results affirm that I2P-MAE significantly advances the quality of learned 3D representations, offering a robust alternative amidst the challenges of acquiring extensive 3D datasets. The proposed methodologies promise to steer future research towards more efficient and semantically enriched 3D representation learning frameworks.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.