xMUDA: Cross-Modal Unsupervised Domain Adaptation for 3D Semantic Segmentation

Published 28 Nov 2019 in cs.CV | (1911.12676v2)

Abstract: Unsupervised Domain Adaptation (UDA) is crucial to tackle the lack of annotations in a new domain. There are many multi-modal datasets, but most UDA approaches are uni-modal. In this work, we explore how to learn from multi-modality and propose cross-modal UDA (xMUDA) where we assume the presence of 2D images and 3D point clouds for 3D semantic segmentation. This is challenging as the two input spaces are heterogeneous and can be impacted differently by domain shift. In xMUDA, modalities learn from each other through mutual mimicking, disentangled from the segmentation objective, to prevent the stronger modality from adopting false predictions from the weaker one. We evaluate on new UDA scenarios including day-to-night, country-to-country and dataset-to-dataset, leveraging recent autonomous driving datasets. xMUDA brings large improvements over uni-modal UDA on all tested scenarios, and is complementary to state-of-the-art UDA techniques. Code is available at https://github.com/valeoai/xmuda.

Abstract PDF Upgrade to Chat

Authors (5)

Citations (182)

View on Semantic Scholar

Summary

The paper presents a novel cross-modal UDA framework that enhances 3D semantic segmentation by integrating information from 2D images and 3D point clouds.
It utilizes a dual-stream architecture with a KL divergence-based cross-modal loss to effectively mitigate domain shifts across scenarios like day-to-night transitions and geographical variations.
Experiments on datasets such as nuScenes, A2D2, and SemanticKITTI demonstrate significant improvements over uni-modal approaches, especially when combined with pseudo-labeling.

The paper presents a significant exploration into addressing the challenges of Unsupervised Domain Adaptation (UDA) for 3D semantic segmentation using a novel approach termed cross-modal UDA (xMUDA). The proposed methodology leverages both 2D images and 3D point clouds to overcome the limitations posed by domain shift when transitioning semantic segmentation models from a source domain with labeled data to a target domain lacking such annotations. This research primarily addresses scenarios prevalent in autonomous driving applications but extends to any domain requiring robust 3D scene understanding.

A key innovation of xMUDA is its cross-modal learning framework which facilitates information exchange between the 2D and 3D modalities. This is achieved via a mutual mimicry mechanism, where the modalities learn from each other, making informed predictions by mimicking outputs, thereby preventing the stronger modality from adopting inaccurate predictions from the weaker one. This model is evaluated over several real-to-real adaptation scenarios such as day-to-night shifts, geographical domain shifts (country-to-country), and variations in sensor setups (dataset-to-dataset).

The architecture underpinning xMUDA involves a dual-stream network setup, where each modality (2D and 3D) remains independent but contributes to a shared learning objective. The use of a disentangled two-stream architecture enables each modality to maintain its specialized network design while also aligning its outputs through a cross-modal loss function, specifically KL divergence. This design allows for robust segmentation even amidst considerable domain shifts, as is prevalent when adapting semantic recognition models across different environmental conditions or geographical locations.

The findings reported highlight substantial improvements over existing uni-modal UDA methodologies. xMUDA demonstrates its versatility and efficacy by showing notable advancements across different scenarios, particularly when compared with self-training leveraging pseudo-labels and existing state-of-the-art UDA techniques. The hybrid model combining xMUDA with pseudo-labeling, referred to as xMUDA\textsubscript{PL}, shows superior performance.

Experimentally, the approach is validated using contemporary autonomous driving datasets, including nuScenes, A2D2, and SemanticKITTI, which provide multi-modal data crucial for xMUDA's training methodology. Numerical results showcased detailed improvements in mIoU scores across tested scenarios, underscoring the potential of xMUDA framework to handle domain shifts efficiently.

Moreover, an extension to fusion scenarios is discussed, wherein xMUDA is applied beyond individual modality pairings to fusion architectures, further reinforcing the utility of cross-modal learning for domain adaptation. The exploration suggests that xMUDA fusion architectures can yield higher accuracy, enhancing performance consistency across different environments.

The implications of this research are manifold. Practically, xMUDA's framework paves the way for more robust and adaptable 3D semantic segmentation systems, particularly in dynamic environments like autonomous vehicles. Theoretically, it enriches the discourse on the integration of multi-modality in machine learning, advocating for a more aligned, cooperative learning strategy among heterogeneous data inputs. Future developments might explore extending xMUDA to include other sensing modalities or its application in other domains requiring high-fidelity environmental understanding under domain shift constraints.

Markdown Report Issue