NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for Autonomous Driving Scenario

Published 24 May 2023 in cs.CV | (2305.14836v2)

Abstract: We introduce a novel visual question answering (VQA) task in the context of autonomous driving, aiming to answer natural language questions based on street-view clues. Compared to traditional VQA tasks, VQA in autonomous driving scenario presents more challenges. Firstly, the raw visual data are multi-modal, including images and point clouds captured by camera and LiDAR, respectively. Secondly, the data are multi-frame due to the continuous, real-time acquisition. Thirdly, the outdoor scenes exhibit both moving foreground and static background. Existing VQA benchmarks fail to adequately address these complexities. To bridge this gap, we propose NuScenes-QA, the first benchmark for VQA in the autonomous driving scenario, encompassing 34K visual scenes and 460K question-answer pairs. Specifically, we leverage existing 3D detection annotations to generate scene graphs and design question templates manually. Subsequently, the question-answer pairs are generated programmatically based on these templates. Comprehensive statistics prove that our NuScenes-QA is a balanced large-scale benchmark with diverse question formats. Built upon it, we develop a series of baselines that employ advanced 3D detection and VQA techniques. Our extensive experiments highlight the challenges posed by this new task. Codes and dataset are available at https://github.com/qiantianwen/NuScenes-QA.

Abstract PDF Upgrade to Chat

Authors (5)

Citations (82)

View on Semantic Scholar

Summary

The paper presents NuScenes-QA, a novel multi-modal VQA benchmark specifically designed for autonomous driving scenarios.
It leverages 34,149 visual scenes and 460,000 Q&A pairs, combining image and LiDAR data to address dynamic real-world challenges.
Empirical evaluations show that multi-modal fusion improves accuracy, highlighting a gap with models using perfect ground-truth inputs.

Overview of the NuScenes-QA Benchmark for Autonomous Driving VQA

In this study, the authors introduce NuScenes-QA, a Visual Question Answering (VQA) benchmark specifically designed to address the complexities of autonomous driving scenarios. This work stands out in the VQA domain by focusing on multi-modal, multi-frame, and outdoor data, integrating both images and point clouds. Unlike traditional VQA benchmarks that often deal with static, single-modal indoor data, NuScenes-QA advances the field by incorporating dynamic real-world driving environments.

Dataset and Methodology

The creation of NuScenes-QA was driven by the limitations of existing VQA datasets in capturing the autonomous driving milieu. The benchmark comprises 34,149 visual scenes and a comprehensive set of 460,000 question-answer pairs, generated using scene graphs derived from 3D detection annotations. This dataset is significantly larger than previous 3D VQA efforts, such as ScanQA, which involves handcrafted questions based on a smaller set of indoor scenes.

Questions are crafted from templates covering five types: existence, counting, object recognition, status querying, and comparison. These templates were carefully designed to require zero-hop or one-hop reasoning, posing a balanced and diverse cognitive load on the models. The approach involved both automatic scene graph creation and manual question template design, ensuring diverse and contextually rich question-answer pairs that accurately reflect autonomous driving challenges.

Baseline Models and Evaluation

The benchmark tests the limits of current VQA methodologies by offering multiple baselines built on existing 3D perception and VQA technologies. These encompass image-based, point cloud-based, and multi-modal fusion approaches. For instance, models like BEVDet, CenterPoint, and MSMDFusion were employed for feature extraction, demonstrating differential efficacy based on the modality used.

Empirical evaluations reveal that current multi-modal systems, which integrate image and LiDAR data, achieve the highest accuracy, underscoring the complementary nature of these modalities in understanding complex street scenes. However, performances lag significantly behind models using perfect ground-truth object inputs, indicating substantial room for improvement in real-world VQA tasks.

Implications and Future Directions

The introduction of NuScenes-QA has notable implications for both theoretical research and practical applications in AI and autonomous driving. The dataset urges exploration into better multi-modal fusion strategies, leveraging the distinct strengths of image and point cloud data. Furthermore, the performance gap between models and ground-truth inputs suggests opportunities in enhancing 3D detection and reasoning models for vehicle autonomy.

The research does not stop at benchmarking; it provides a ripe ground for future exploration in several directions. These include developing advanced QA-head architectures tailored to outdoor dynamics, enhancing textual diversity in the question set, and integrating perceptual tasks like object tracking to broaden the dataset's practical utility.

In conclusion, NuScenes-QA serves as a pivotal benchmark, challenging existing visual reasoning paradigms and catalyzing advancements in understanding and interacting with real-world autonomous driving environments. This work advances the dialogue between AI perception systems and human-like semantic understanding, laying groundwork for safer and smarter transportation systems.

Markdown Report Issue