Bridging Zero-shot Object Navigation and Foundation Models through Pixel-Guided Navigation Skill

Published 19 Sep 2023 in cs.RO | (2309.10309v2)

Abstract: Zero-shot object navigation is a challenging task for home-assistance robots. This task emphasizes visual grounding, commonsense inference and locomotion abilities, where the first two are inherent in foundation models. But for the locomotion part, most works still depend on map-based planning approaches. The gap between RGB space and map space makes it difficult to directly transfer the knowledge from foundation models to navigation tasks. In this work, we propose a Pixel-guided Navigation skill (PixNav), which bridges the gap between the foundation models and the embodied navigation task. It is straightforward for recent foundation models to indicate an object by pixels, and with pixels as the goal specification, our method becomes a versatile navigation policy towards all different kinds of objects. Besides, our PixNav is a pure RGB-based policy that can reduce the cost of home-assistance robots. Experiments demonstrate the robustness of the PixNav which achieves 80+% success rate in the local path-planning task. To perform long-horizon object navigation, we design an LLM-based planner to utilize the commonsense knowledge between objects and rooms to select the best waypoint. Evaluations across both photorealistic indoor simulators and real-world environments validate the effectiveness of our proposed navigation strategy. Code and video demos are available at https://github.com/wzcai99/Pixel-Navigator.

Abstract PDF HTML Upgrade to Chat

References (41)

Citations (28)

View on Semantic Scholar

Summary

The paper presents PixNav, a novel pixel-guided navigation policy that replaces map-based methods with a simple, efficient RGB-only input.
It integrates foundation models and large language models to transform visual data into actionable textual plans for effective zero-shot navigation.
Empirical results demonstrate competitive success rates and robust performance in long-horizon, real-world environments with cost-effective hardware.

The paper under review introduces a novel approach to zero-shot object navigation, a task of considerable importance in the development of home-assistance robots. The focus lies on bridging the gap between foundational models—known for their visual and language perception capabilities—and robot locomotion, which has traditionally relied on map-based planning methods. The proposed solution, termed Pixel-guided Navigation skill (PixNav), offers a pure RGB-based navigation policy, which stands in contrast to the map-based systems that require depth sensing and can be cost-prohibitive.

Overview of Contributions

The core contributions of this paper are primarily centered around three areas:

Pixel Navigation: The authors propose a pixel-guided navigation policy as a substitute for traditional path-planning methods in map-based navigation tasks. PixNav relies solely on RGB input, simplifying hardware requirements without sacrificing navigational efficacy.
Integration with Foundation Models: The research explores leveraging strong zero-shot recognition capabilities of foundation models to enhance navigation tasks. The proposed system aligns the foundational models' robust visual perception with the pixel navigation methodology.
Utilization of LLMs: A hierarchical policy is introduced where LLMs serve as planners, utilizing commonsense priors to enhance the robot's path-planning capabilities. This involves transforming visual data into textual inputs, enabling sophisticated decision-making processes.

Methodological Details

PixNav transforms object navigation into pixel-targeting, where the task of navigating to an object is redefined as reaching a designated pixel. This approach effectively leverages RGB-based data, circumventing the need for depth perception. The acquisition of training data for PixNav is notably more efficient, as it can generate diverse trajectories by specifying navigation goals with different pixels, as opposed to the single trajectory constraint of object goal navigation.

In practical implementations, PixNav is coupled with a visual-LLM, LLama-Adapter, to convert panoramic visual observations into detailed textual descriptions. This translation aids a LLM in crafting a navigation plan. The planning framework consists of summarizing and clustering the spatial environment into a structured format that guides efficient room-to-room navigation.

Evaluation and Findings

Empirical evaluations conducted within the HM3D dataset demonstrate the proposed method's competence. PixNav’s ability to generalize to varying RGB camera settings indicates its robustness and potential applicability in varied real-world contexts. Compared to conventional zero-shot object navigation baselines, PixNav exhibits competitive success rates and demonstrates promising SPL (Success weighted by Path Length) metrics.

In the context of long-horizon navigation, the integration of an LLM-based planner proves beneficial. Through methodical prompting, the LLM can effectively navigate complex environments, demonstrating an ability to exploit commonsense reasoning for spatial exploration.

Implications and Future Directions

The implications of this research are multifaceted. Practically, PixNav offers an accessible and cost-effective navigation solution by eliminating the need for complex sensory inputs beyond RGB. Theoretically, this work opens avenues for further exploration of non-traditional sensory inputs in robotics, particularly the role of pixels as a target in navigation systems.

Future developments might focus on fine-tuning data-driven policies via large-scale, diverse datasets, potentially enhancing the long-horizon navigational capabilities of PixNav. Additionally, extending the methodology to other modalities such as LIDAR or multispectral imaging for complex navigational environments could yield valuable insights.

In summary, the paper presents a compelling argument for the viability of pixel-guided navigation in zero-shot object navigation tasks. It leverages the capabilities of foundation models and LLMs, presenting a significant step forward in the quest for efficient, versatile, and scalable navigation systems for home-assistance robots.

Markdown Report Issue