A Sim2Real Deep Learning Approach for the Transformation of Images from Multiple Vehicle-Mounted Cameras to a Semantically Segmented Image in Bird's Eye View

Published 8 May 2020 in cs.CV and cs.LG | (2005.04078v1)

Abstract: Accurate environment perception is essential for automated driving. When using monocular cameras, the distance estimation of elements in the environment poses a major challenge. Distances can be more easily estimated when the camera perspective is transformed to a bird's eye view (BEV). For flat surfaces, Inverse Perspective Mapping (IPM) can accurately transform images to a BEV. Three-dimensional objects such as vehicles and vulnerable road users are distorted by this transformation making it difficult to estimate their position relative to the sensor. This paper describes a methodology to obtain a corrected 360{\deg} BEV image given images from multiple vehicle-mounted cameras. The corrected BEV image is segmented into semantic classes and includes a prediction of occluded areas. The neural network approach does not rely on manually labeled data, but is trained on a synthetic dataset in such a way that it generalizes well to real-world data. By using semantically segmented images as input, we reduce the reality gap between simulated and real-world data and are able to show that our method can be successfully applied in the real world. Extensive experiments conducted on the synthetic data demonstrate the superiority of our approach compared to IPM. Source code and datasets are available at https://github.com/ika-rwth-aachen/Cam2BEV

Abstract PDF Upgrade to Chat

Citations (94)

View on Semantic Scholar

Summary

The paper introduces a deep learning framework that transforms vehicle-mounted camera images into semantically segmented BEV images without the need for manual labeling.
It utilizes synthetic datasets and innovative network architectures like DeepLabv3+ and uNetXST to enhance spatial consistency and accurately predict occluded areas.
The approach achieves high MIoU scores, demonstrating the potential of sim-to-real transfer to improve the environment perception of automated vehicles.

A Sim2Real Deep Learning Approach for Vehicle-Mounted Camera Data Transformation to Bird's Eye View

The research paper presents a novel methodology designed to transform images captured by multiple vehicle-mounted cameras into semantically segmented images in a bird's eye view (BEV). This transformation is crucial in the context of automated vehicles (AVs), which rely heavily on precise environment perception for safety and operational efficacy. Traditional approaches, such as Inverse Perspective Mapping (IPM), while effective for flat surfaces, introduce significant distortions when applied to three-dimensional structures. This paper's contribution lies in addressing these limitations through a deep learning approach that bridges the sim-to-real gap.

Methodology Overview

The authors propose a convolutional neural network-based methodology that does not depend on manually labeled real-world data. Instead, it capitalizes on synthetic datasets to generalize well to real-world scenarios. The transformation process incorporates semantic segmentation as a preprocessing step, which helps reduce the reality gap. The approach involves creating a corrected 360-degree BEV image using semantically segmented inputs, accurately predicting occluded areas, and employing IPM to guide spatial consistency during network learning.

Two variations of network architectures are explored. The first is a single-input model that precomputes a homography image using IPM, thus enhancing spatial consistency between inputs and outputs. The DeepLabv3+ architecture is employed here with variations in network backbones. The second is a multi-input model, uNetXST, which incorporates multiple input streams aligned via in-network spatial transformers to rectify the spatial inconsistencies without distorting feature maps.

Experimental Insights

The methodology was validated using a comprehensive synthetic dataset generated in a simulation environment, Virtual Test Drive (VTD). The dataset encompasses both realistic and semantically segmented images of a \ang{360} surround view. Various network configurations were trained and compared using Intersection-over-Union (IoU) scores as the evaluation metric, focusing both on individual classes and the overall Mean IoU (MIoU).

The uNetXST model notably achieved the highest MIoU on the validation set, outperforming other network settings, including DeepLabv3+ with Xception and MobileNetV2 backbones. This demonstrates uNetXST’s ability to extract meaningful features from non-transformed images, averting early errors from IPM. The results affirm the potential of deep learning approaches in improving upon classical geometric methods, achieving substantially better accuracy and localization of dynamic objects.

Practical and Theoretical Implications

The presented work has significant implications for advancing AV technology. By overcoming the limitations of IPM through a robust learning framework, AVs can achieve more reliable environment perception, a cornerstone for real-world navigation and safety. The successful application of the methodology to real-world scenarios without relying on extensive manual labeling highlights the practicality of sim-to-real transfers. Moreover, the prediction of occluded areas adds a pivotal layer to understanding complex scene geometry and enhances dynamic scene comprehensions.

Future Directions

The research opens pathways for incorporating additional data inputs, such as depth information, which could be derived from stereo cameras or LiDAR systems to further enhance BEV transformations. Additionally, real-world testing with a full \ang{360} multi-camera rig would provide further insights into the robustness and scalability of the proposed solution in dynamic real-world conditions. Addressing these aspects could cement the role of this methodology in forthcoming AV systems, paving the way for more sophisticated perception frameworks.

In summary, this paper contributes a significant advance in transforming vehicle camera data into actionable BEV representations, vital for enhancing automated driving systems. The proposed models bridge the gap between simulation and real-world application effectively, demonstrating a meaningful step forward in the automated vehicles' domain.

Markdown Report Issue