Emergent Mind

VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation

(2312.03275)
Published Dec 6, 2023 in cs.RO and cs.AI

Abstract

Understanding how humans leverage semantic knowledge to navigate unfamiliar environments and decide where to explore next is pivotal for developing robots capable of human-like search behaviors. We introduce a zero-shot navigation approach, Vision-Language Frontier Maps (VLFM), which is inspired by human reasoning and designed to navigate towards unseen semantic objects in novel environments. VLFM builds occupancy maps from depth observations to identify frontiers, and leverages RGB observations and a pre-trained vision-language model to generate a language-grounded value map. VLFM then uses this map to identify the most promising frontier to explore for finding an instance of a given target object category. We evaluate VLFM in photo-realistic environments from the Gibson, Habitat-Matterport 3D (HM3D), and Matterport 3D (MP3D) datasets within the Habitat simulator. Remarkably, VLFM achieves state-of-the-art results on all three datasets as measured by success weighted by path length (SPL) for the Object Goal Navigation task. Furthermore, we show that VLFM's zero-shot nature enables it to be readily deployed on real-world robots such as the Boston Dynamics Spot mobile manipulation platform. We deploy VLFM on Spot and demonstrate its capability to efficiently navigate to target objects within an office building in the real world, without any prior knowledge of the environment. The accomplishments of VLFM underscore the promising potential of vision-language models in advancing the field of semantic navigation. Videos of real-world deployment can be viewed at naoki.io/vlfm.

Overview

  • The paper presents a novel zero-shot navigation approach, Vision-Language Frontier Maps (VLFM), which enables AI and robotics systems to navigate in unknown environments without prior knowledge or task-specific training.

  • VLFM constructs occupancy maps from depth observations to identify frontiers, where it applies a pre-trained vision-language model for semantic assessment and value mapping.

  • The robot using VLFM performs an initialization by spinning in place to create maps, followed by exploration to update maps and select waypoints, and finally transitions to goal navigation upon detecting the target object.

  • VLFM's effectiveness is showcased through benchmarking, achieving state-of-the-art results in photorealistic simulation environments and a real-world test with a Boston Dynamics Spot robot.

  • It demonstrates increased efficiency in terms of success rate and Success weighted by inverse Path Length (SPL) when compared to both other zero-shot methods and models trained on ObjectNav task.

Introduction to VLFM

In exploring unfamiliar environments, humans draw upon a wealth of semantic knowledge to navigate towards specific objects without any prior knowledge of the surroundings. Developing similar capabilities in AI and robotics systems is a challenging feat yet pivotal for creating autonomous agents capable of navigating complex spaces. This is the domain of object goal navigation (ObjectNav), a task where an agent is tasked with finding objects in an unknown environment.

Vision-Language Frontier Maps (VLFM)

The paper introduces the Vision-Language Frontier Maps (VLFM), a novel zero-shot navigation approach designed to harness the power of pre-trained vision-language models (VLMs). VLFM doesn't rely on pre-built maps, task-specific training, or prior knowledge of the environment. Instead, it constructs occupancy maps from depth observations to earmark the frontiers of explored space. These are regions where the known space meets the unknown, making them candidates for exploration. VLFM then employs a pre-trained vision-language model to generate a value map grounded in language, interpreting visual semantic cues to assess which of these frontiers are most likely to be fruitful in the search for the target object.

How VLFM Functions

The approach can be broken down into initialization, exploration, and goal navigation phases. Upon spinning in place, the robot builds out its crucial frontier and value maps. As exploration begins, these maps are regularly updated, creating waypoints, among which the robot selects the one with the highest potential for locating the sought-after object. When the target is detected, the robot transitions into the goal navigation phase, making its way to the detected object and signaling mission completion upon approach.

A Leap in the Field of Semantic Navigation

The paper presents evidence of VLFM's effectiveness through benchmarking in photorealistic simulation environments and a real-world office space using a Boston Dynamics Spot robot. The technique achieved state-of-the-art results across three major datasets, showcasing significant improvement in efficiency indicated by the increase in success rate and Success weighted by inverse Path Length (SPL). Moreover, these benefits are apparent in comparison to both other zero-shot methods and models trained directly on the ObjectNav task, underlining VLFM's potential in opening new frontiers in the field of semantic navigation.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube