VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation (2312.03275v1)

Published 6 Dec 2023 in cs.RO and cs.AI

Abstract: Understanding how humans leverage semantic knowledge to navigate unfamiliar environments and decide where to explore next is pivotal for developing robots capable of human-like search behaviors. We introduce a zero-shot navigation approach, Vision-Language Frontier Maps (VLFM), which is inspired by human reasoning and designed to navigate towards unseen semantic objects in novel environments. VLFM builds occupancy maps from depth observations to identify frontiers, and leverages RGB observations and a pre-trained vision-LLM to generate a language-grounded value map. VLFM then uses this map to identify the most promising frontier to explore for finding an instance of a given target object category. We evaluate VLFM in photo-realistic environments from the Gibson, Habitat-Matterport 3D (HM3D), and Matterport 3D (MP3D) datasets within the Habitat simulator. Remarkably, VLFM achieves state-of-the-art results on all three datasets as measured by success weighted by path length (SPL) for the Object Goal Navigation task. Furthermore, we show that VLFM's zero-shot nature enables it to be readily deployed on real-world robots such as the Boston Dynamics Spot mobile manipulation platform. We deploy VLFM on Spot and demonstrate its capability to efficiently navigate to target objects within an office building in the real world, without any prior knowledge of the environment. The accomplishments of VLFM underscore the promising potential of vision-LLMs in advancing the field of semantic navigation. Videos of real-world deployment can be viewed at naoki.io/vlfm.

Citations (49)

View on Semantic Scholar

Summary

The paper introduces VLFM, a novel zero-shot navigation approach that uses pre-trained vision-language models to construct frontier and value maps for efficient object search.
The paper details a methodology that builds occupancy maps from depth readings to identify exploration frontiers and applies linguistic cues to prioritize promising waypoints.
The paper demonstrates VLFM’s effectiveness through state-of-the-art performance in simulated and real-world scenarios, significantly improving success rates and SPL metrics.

Introduction to VLFM

In exploring unfamiliar environments, humans draw upon a wealth of semantic knowledge to navigate towards specific objects without any prior knowledge of the surroundings. Developing similar capabilities in AI and robotics systems is a challenging feat yet pivotal for creating autonomous agents capable of navigating complex spaces. This is the domain of object goal navigation (ObjectNav), a task where an agent is tasked with finding objects in an unknown environment.

Vision-Language Frontier Maps (VLFM)

The paper introduces the Vision-Language Frontier Maps (VLFM), a novel zero-shot navigation approach designed to harness the power of pre-trained vision-LLMs (VLMs). VLFM doesn't rely on pre-built maps, task-specific training, or prior knowledge of the environment. Instead, it constructs occupancy maps from depth observations to earmark the frontiers of explored space. These are regions where the known space meets the unknown, making them candidates for exploration. VLFM then employs a pre-trained vision-LLM to generate a value map grounded in language, interpreting visual semantic cues to assess which of these frontiers are most likely to be fruitful in the search for the target object.

How VLFM Functions

The approach can be broken down into initialization, exploration, and goal navigation phases. Upon spinning in place, the robot builds out its crucial frontier and value maps. As exploration begins, these maps are regularly updated, creating waypoints, among which the robot selects the one with the highest potential for locating the sought-after object. When the target is detected, the robot transitions into the goal navigation phase, making its way to the detected object and signaling mission completion upon approach.

A Leap in the Field of Semantic Navigation

The paper presents evidence of VLFM's effectiveness through benchmarking in photorealistic simulation environments and a real-world office space using a Boston Dynamics Spot robot. The technique achieved state-of-the-art results across three major datasets, showcasing significant improvement in efficiency indicated by the increase in success rate and Success weighted by inverse Path Length (SPL). Moreover, these benefits are apparent in comparison to both other zero-shot methods and models trained directly on the ObjectNav task, underlining VLFM's potential in opening new frontiers in the field of semantic navigation.

PDF Markdown

Related Papers

Tweets

https://twitter.com/GRASPlab/status/1794028750436184229

YouTube

Show All Videos