ETPNav: Evolving Topological Planning for Vision-Language Navigation in Continuous Environments

Published 6 Apr 2023 in cs.CV, cs.CL, and cs.RO | (2304.03047v3)

Abstract: Vision-language navigation is a task that requires an agent to follow instructions to navigate in environments. It becomes increasingly crucial in the field of embodied AI, with potential applications in autonomous navigation, search and rescue, and human-robot interaction. In this paper, we propose to address a more practical yet challenging counterpart setting - vision-language navigation in continuous environments (VLN-CE). To develop a robust VLN-CE agent, we propose a new navigation framework, ETPNav, which focuses on two critical skills: 1) the capability to abstract environments and generate long-range navigation plans, and 2) the ability of obstacle-avoiding control in continuous environments. ETPNav performs online topological mapping of environments by self-organizing predicted waypoints along a traversed path, without prior environmental experience. It privileges the agent to break down the navigation procedure into high-level planning and low-level control. Concurrently, ETPNav utilizes a transformer-based cross-modal planner to generate navigation plans based on topological maps and instructions. The plan is then performed through an obstacle-avoiding controller that leverages a trial-and-error heuristic to prevent navigation from getting stuck in obstacles. Experimental results demonstrate the effectiveness of the proposed method. ETPNav yields more than 10% and 20% improvements over prior state-of-the-art on R2R-CE and RxR-CE datasets, respectively. Our code is available at https://github.com/MarSaKi/ETPNav.

Abstract PDF Upgrade to Chat

Authors (7)

Citations (40)

View on Semantic Scholar

Summary

The paper presents a novel framework that dynamically constructs topological maps to enable real-time vision-language navigation in continuous environments.
It leverages a transformer-based cross-modal planner with a rotate-then-forward control scheme and trial-and-error obstacle avoidance.
ETPNav achieves significant improvements on R2R-CE and RxR-CE benchmarks, enhancing key success metrics over state-of-the-art methods.

The paper "ETPNav: Evolving Topological Planning for Vision-Language Navigation in Continuous Environments" introduces ETPNav, a novel framework specifically designed to address the challenges posed by vision-language navigation (VLN) in continuous environments. This task is a crucial component of embodied AI systems, aiming to enable autonomous entities to interpret and execute natural language instructions for navigation in complex, real-world terrains. The paper acknowledges the limitations of existing VLN solutions, which often simplify navigation by restricting it to predefined discrete graphs, thereby failing to reflect the intricacies of real-world navigation.

ETPNav addresses these challenges by introducing a robust architecture that leverages topological maps for effective navigation planning and control in continuous spaces. The framework operationalizes this through two primary modules: a topological map building process complemented by a transformer-based cross-modal navigation planner and a control mechanism designed to avoid obstacles effectively.

Methodology

Topological Mapping:

The ETPNav framework constructs a topological map online, a process inspired by cognitive science principles. This map abstracts visited or observed locations into graph representations with nodes and edges, reflecting place connectivity and distance.
Unlike previous approaches that either require predefined graphs or pre-explored environment data, ETPNav constructs these maps dynamically in real-time by self-organizing predicted waypoints. These waypoints are derived from a depth-only evaluation, emphasizing spatial accessibility without relying on semantic RGB data. This design enhances generalization capabilities across new environments.

Cross-modal Planning:

The strategy integrates language and visual inputs via a cross-modal graph encoder which uses a novel Graph-Aware Self-Attention mechanism. This enhances the model's ability to capture the spatial layout and connectivity information critical for effective navigation.
The navigation process is decomposed into generating a long-term plan using the topological map, which is then executed through a sequence of subgoals guiding the agent to the destination.

Control Mechanism:

The ETPNav model employs a rotate-then-forward control schema complemented by a trial-and-error heuristic for obstacle avoidance. This heuristic, referred to as Tryout, is crucial when navigating sliding-forbidden scenarios where agents are prone to getting stuck on encountering obstacles.

Evaluation and Results

The paper reports substantial advancements over prior state-of-the-art methods across several benchmarks:

On the R2R-CE (Room-to-Room Continuous Environment) dataset, ETPNav improves over existing methods with a notable increase in Success Rate (SR) and Success weighted by (normalized inverse) Path Length (SPL).
Similarly, on the RxR-CE (Room Across Rooms Continuous Environment) dataset, which is a multilingual and more challenging benchmark, ETPNav achieves significant gains in primary metrics such as Normalize Dynamic Time Wrapping (NDTW) and Success weighted by normalized DTW (SDTW).
These improvements underscore ETPNav's ability to handle the dynamics of continuous environments and complex path finding, facilitated by its robust topological planning.

Implications and Future Work

The introduction of ETPNav has strong implications for advancing embodied AI systems, particularly those requiring seamless integration of visual and linguistic data for real-time navigation in unconstrained environments. By enhancing long-range planning capabilities and obstacle avoidance mechanisms, ETPNav paves the way for more practical deployment of autonomous systems in real-world scenarios.

Future refinement could explore incorporating solutions to address noise in sensor readings which is a notable consideration during real-world navigation. Additionally, further development in perception and localization strategies could be pursued to enhance robustness, particularly in varied and dynamic environments outside of training data distributions.

In summary, ETPNav represents a step forward in bringing the deployment of language-guided navigation agents closer to real-world applications. By generating comprehensive topological maps in real-time and effectively planning and executing complex navigational tasks, it sets a strong precedent for future research in this domain.

Markdown Report Issue