DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

Published 19 Feb 2024 in cs.CV | (2402.12289v5)

Abstract: A primary hurdle of autonomous driving in urban environments is understanding complex and long-tail scenarios, such as challenging road conditions and delicate human behaviors. We introduce DriveVLM, an autonomous driving system leveraging Vision-LLMs (VLMs) for enhanced scene understanding and planning capabilities. DriveVLM integrates a unique combination of reasoning modules for scene description, scene analysis, and hierarchical planning. Furthermore, recognizing the limitations of VLMs in spatial reasoning and heavy computational requirements, we propose DriveVLM-Dual, a hybrid system that synergizes the strengths of DriveVLM with the traditional autonomous driving pipeline. Experiments on both the nuScenes dataset and our SUP-AD dataset demonstrate the efficacy of DriveVLM and DriveVLM-Dual in handling complex and unpredictable driving conditions. Finally, we deploy the DriveVLM-Dual on a production vehicle, verifying it is effective in real-world autonomous driving environments.

Abstract PDF Upgrade to Chat

Authors (10)

Citations (58)

View on Semantic Scholar

Summary

The paper introduces DriveVLM and DriveVLM-Dual architectures that merge Vision-Language Models with traditional planning to enhance scene understanding.
It applies a Chain-of-Thought reasoning process alongside high-frequency planning modules, improving real-time performance and 3D spatial reasoning.
The study offers new evaluation metrics via the SUP-AD dataset, paving the way for more interpretable and robust autonomous driving systems.

Insightful Overview of DriveVLM: The Convergence of Autonomous Driving and Vision-LLMs

The quest for a truly autonomous driving experience in complex urban environments continues to face challenges associated with scene understanding, particularly in unpredictable and long-tailed scenarios such as adverse weather conditions, intricate road layouts, and unusual human behaviors. Recent advancements in Vision-LLMs (VLMs) have provided new avenues for enhancing the capabilities of autonomous vehicles beyond traditional perception and planning systems. In this context, the paper "DriveVLM: The Convergence of Autonomous Driving and Large Vision-LLMs" introduces DriveVLM and DriveVLM-Dual, highlighting how VLMs can be leveraged to improve scene understanding and planning in autonomous driving.

DriveVLM aims to enhance the decision-making process in autonomous vehicles by integrating Vision-LLMs with traditional perception-planning pipelines. The architecture of DriveVLM includes the Chain-of-Thought (CoT) reasoning process, which comprises modules for scene description, scene analysis, and hierarchical planning. This approach is designed to identify critical objects in the driving environment and assess their influence on the ego vehicle. By linguistically describing the scene and predicting interactions at a decision level rather than merely a trajectory level, DriveVLM enables autonomous vehicles to navigate complex and dynamic driving scenarios more effectively.

To address the limitations of VLMs, such as their computational intensity and challenges in spatial reasoning, the paper proposes the hybrid DriveVLM-Dual system. This system combines DriveVLM with traditional high-frequency planning modules to improve real-time capabilities and 3D spatial understanding without compromising the robustness of VLMs in scene comprehension. Experimental results indicate that DriveVLM-Dual outperforms existing end-to-end motion planning approaches, particularly in challenging driving conditions as demonstrated on the nuScenes dataset and the novel SUP-AD dataset.

The implications of this research are significant for the trajectory of autonomous vehicle technology. The integration of large Vision-LLMs within the autonomous driving ecosystem suggests a shift towards more interpretable and flexible models capable of understanding and reacting to complex driving environments. Moreover, the novel cooperative approach in DriveVLM-Dual can serve as a foundational framework for future developments in real-time autonomous driving solutions, addressing both the computational demands and interpretability constraints of existing models.

The SUP-AD dataset introduced in the paper, created through an innovative data mining and annotation process, provides an invaluable resource for evaluating autonomous driving systems in diverse and challenging scenarios. By offering new evaluation metrics for scene understanding and planning tasks, this research advances the field's ability to gauge the efficacy of models like DriveVLM in handling real-world complexities.

In conclusion, this work presents a compelling case for the use of Vision-LLMs in autonomous driving, demonstrating their potential to transform how machines comprehend and navigate the world. As large VLMs continue to evolve, their application in domains such as autonomous driving underscores the need for interdisciplinary approaches and robust evaluation methods to harness their full potential. Future research could build on this foundation by refining the integration techniques and exploring more generalized applications of VLMs across various autonomous systems.

Markdown Report Issue