Emergent Mind

Towards Transferable Multi-modal Perception Representation Learning for Autonomy: NeRF-Supervised Masked AutoEncoder

(2311.13750)

Published Nov 23, 2023 in cs.CV , cs.AI , cs.LG , and cs.RO

This work proposes a unified self-supervised pre-training framework for transferable multi-modal perception representation learning via masked multi-modal reconstruction in Neural Radiance Field (NeRF), namely NeRF-Supervised Masked AutoEncoder (NS-MAE). Specifically, conditioned on certain view directions and locations, multi-modal embeddings extracted from corrupted multi-modal input signals, i.e., Lidar point clouds and images, are rendered into projected multi-modal feature maps via neural rendering. Then, original multi-modal signals serve as reconstruction targets for the rendered multi-modal feature maps to enable self-supervised representation learning. Extensive experiments show that the representation learned via NS-MAE shows promising transferability for diverse multi-modal and single-modal (camera-only and Lidar-only) perception models on diverse 3D perception downstream tasks (3D object detection and BEV map segmentation) with diverse amounts of fine-tuning labeled data. Moreover, we empirically find that NS-MAE enjoys the synergy of both the mechanism of masked autoencoder and neural radiance field. We hope this study can inspire exploration of more general multi-modal representation learning for autonomous agents.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a summary of this paper on our Pro plan:

We ran into a problem analyzing this paper.