Emergent Mind

PointCLIP: Point Cloud Understanding by CLIP

(2112.02413)
Published Dec 4, 2021 in cs.CV , cs.AI , and cs.RO

Abstract

Recently, zero-shot and few-shot learning via Contrastive Vision-Language Pre-training (CLIP) have shown inspirational performance on 2D visual recognition, which learns to match images with their corresponding texts in open-vocabulary settings. However, it remains under explored that whether CLIP, pre-trained by large-scale image-text pairs in 2D, can be generalized to 3D recognition. In this paper, we identify such a setting is feasible by proposing PointCLIP, which conducts alignment between CLIP-encoded point cloud and 3D category texts. Specifically, we encode a point cloud by projecting it into multi-view depth maps without rendering, and aggregate the view-wise zero-shot prediction to achieve knowledge transfer from 2D to 3D. On top of that, we design an inter-view adapter to better extract the global feature and adaptively fuse the few-shot knowledge learned from 3D into CLIP pre-trained in 2D. By just fine-tuning the lightweight adapter in the few-shot settings, the performance of PointCLIP could be largely improved. In addition, we observe the complementary property between PointCLIP and classical 3D-supervised networks. By simple ensembling, PointCLIP boosts baseline's performance and even surpasses state-of-the-art models. Therefore, PointCLIP is a promising alternative for effective 3D point cloud understanding via CLIP under low resource cost and data regime. We conduct thorough experiments on widely-adopted ModelNet10, ModelNet40 and the challenging ScanObjectNN to demonstrate the effectiveness of PointCLIP. The code is released at https://github.com/ZrrSkywalker/PointCLIP.

PointCLIP's point cloud to multi-view depth maps for 3D recognition using CLIP; zero-shot and few-shot options.

Overview

  • PointCLIP extends the capabilities of Contrastive Vision-Language Pre-training (CLIP) from 2D visual recognition to 3D point cloud understanding, demonstrating cross-modality knowledge transfer.

  • The approach employs multi-view projection to convert 3D point clouds into 2D depth maps, zero-shot classification by leveraging CLIP's pre-trained 2D visual encoder, and few-shot learning using an inter-view adapter.

  • Experimental results show that PointCLIP achieves substantial performance gains on 3D point cloud datasets like ModelNet10 and ModelNet40, outperforming traditional 3D networks in few-shot settings.

PointCLIP: Point Cloud Understanding by CLIP

This paper introduces PointCLIP, a novel approach aimed at extending the capabilities of Contrastive Vision-Language Pre-training (CLIP) to the domain of 3D point cloud recognition. Traditionally, CLIP has been employed with significant success in 2D visual recognition tasks. However, its application in 3D recognition has remained unexplored. PointCLIP addresses this gap by leveraging CLIP's pre-trained 2D image-text pair knowledge to perform 3D point cloud understanding, demonstrating the potential for cross-modality knowledge transfer.

Key Methodologies and Approach

Multi-View Projection

PointCLIP bridges the modal gap between 3D point clouds and 2D images by projecting the point cloud into multi-view depth maps. This process does not involve any post-rendering and incurs minimal computational overhead. By representing the 3D point cloud from multiple perspectives, it translates the sparse and irregularly distributed 3D data into a format more readily processed by CLIP's 2D visual encoder.

Zero-Shot Classification

For zero-shot classification, PointCLIP generates visual features for each view using CLIP's pre-trained visual encoder. Category names are framed in a hand-crafted textual template and encoded by CLIP's textual encoder, forming a zero-shot classifier. The final classification probabilities are then obtained by aggregating the predictions from each view, weighted according to hyperparameters that designate the importance of each view. This approach enables PointCLIP to classify 3D objects without any additional 3D training, solely based on pre-trained 2D knowledge.

Few-Shot Learning with Inter-View Adapter

To improve performance in few-shot learning settings, PointCLIP introduces an inter-view adapter. This lightweight three-layer Multi-layer Perceptron (MLP), comprising bottleneck linear layers, is fine-tuned on few-shot 3D datasets. The adapter aggregates multi-view features to construct a global representation of the point cloud. Adapted features are then generated for each view and fused with the original CLIP-encoded features. This design enables effective integration of few-shot 3D knowledge with pre-existing 2D priors, significantly enhancing classification accuracy without overfitting.

Experimental Validation

Zero-Shot Performance

The zero-shot experiments on ModelNet10, ModelNet40, and ScanObjectNN demonstrate the feasibility of applying CLIP's pre-trained 2D representations to 3D point clouds. PointCLIP achieves commendable performance, with notable results of 30.23% accuracy on ModelNet10, indicating successful cross-modality knowledge transfer.

Few-Shot Performance

In few-shot settings, PointCLIP significantly outperforms classical 3D networks, including PointNet and PointNet++. For instance, on ModelNet40, PointCLIP shows a 12.29% improvement over CurveNet with just one shot per category, demonstrating its robustness and efficiency in low-data regimes. The addition of the inter-view adapter markedly elevates the performance, achieving comparable results to models trained on entire datasets.

Implications and Future Directions

PointCLIP’s methodologies have several implications for the field of 3D point cloud recognition:

  1. Cross-Modality Knowledge Transfer: It showcases the practicality and effectiveness of transferring 2D pre-trained models to 3D recognition tasks, paving the way for future innovations in utilizing large-scale 2D datasets for other 3D applications.
  2. Efficiency in Few-Shot Learning: The inter-view adapter exemplifies an efficient strategy to enhance few-shot learning, ensuring robustness without the risk of overfitting through lightweight fine-tuning.
  3. Advancements in Multi-Source Inference: As demonstrated, PointCLIP can complement existing 3D models through ensembling, achieving state-of-the-art performance by integrating diverse knowledge sources.

Future research could explore extending PointCLIP to other 3D domain tasks, such as object detection and segmentation, by leveraging the contrastive vision-language pre-training paradigm. Additionally, investigating adaptive multi-modal fusion techniques could further augment the efficacy of cross-modality learning in increasingly complex environments.

In conclusion, PointCLIP provides a promising and effective approach to 3D point cloud understanding by utilizing pre-trained 2D knowledge from CLIP, achieving significant advancements in zero-shot and few-shot learning, and outperforming conventional 3D models through strategic ensembling. Its methodologies and findings contribute valuable insights and opportunities for future exploration in the field of 3D computer vision.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.