Yell At Your Robot: Improving On-the-Fly from Language Corrections (2403.12910v1)

Published 19 Mar 2024 in cs.RO, cs.AI, and cs.LG

Abstract: Hierarchical policies that combine language and low-level control have been shown to perform impressively long-horizon robotic tasks, by leveraging either zero-shot high-level planners like pretrained language and vision-LLMs (LLMs/VLMs) or models trained on annotated robotic demonstrations. However, for complex and dexterous skills, attaining high success rates on long-horizon tasks still represents a major challenge -- the longer the task is, the more likely it is that some stage will fail. Can humans help the robot to continuously improve its long-horizon task performance through intuitive and natural feedback? In this paper, we make the following observation: high-level policies that index into sufficiently rich and expressive low-level language-conditioned skills can be readily supervised with human feedback in the form of language corrections. We show that even fine-grained corrections, such as small movements ("move a bit to the left"), can be effectively incorporated into high-level policies, and that such corrections can be readily obtained from humans observing the robot and making occasional suggestions. This framework enables robots not only to rapidly adapt to real-time language feedback, but also incorporate this feedback into an iterative training scheme that improves the high-level policy's ability to correct errors in both low-level execution and high-level decision-making purely from verbal feedback. Our evaluation on real hardware shows that this leads to significant performance improvement in long-horizon, dexterous manipulation tasks without the need for any additional teleoperation. Videos and code are available at https://yay-robot.github.io/.

References (70)

Citations (35)

View on Semantic Scholar

Summary

The paper introduces a hierarchical framework that leverages natural language corrections to enhance on-the-fly robotic manipulation in long-horizon tasks.
It employs a high-level policy with Vision Transformer and transformer layers paired with a low-level corrective policy using DistilBERT embeddings for precise motor actions.
Empirical results indicate a 15–50% improvement in task success rates and a 20–45% gain via policy finetuning, outperforming flat imitation learning baselines.

YAY Robot: Hierarchical Language-Guided Correction and Continuous Improvement for Long-Horizon Robotic Manipulation

Introduction

The paper introduces YAY Robot, a hierarchical framework for robotic manipulation that leverages natural language corrections to improve both real-time and autonomous performance on long-horizon, dexterous tasks. The system is designed to address the compounding error problem in multi-stage tasks by enabling human users to provide intuitive, fine-grained verbal feedback, which is then incorporated into the robot's high-level policy through iterative post-training. The approach is evaluated on three challenging bimanual manipulation tasks—bag packing, trail mix preparation, and plate cleaning—using real hardware.

Figure 1: Overview of YAY Robot's hierarchical setup, enabling human intervention via language corrections and subsequent high-level policy finetuning.

Hierarchical Policy Architecture

YAY Robot operates with a two-level policy hierarchy:

High-Level Policy: Generates language instructions based on visual observations and temporal context. It is implemented using a Vision Transformer (ViT) backbone initialized with CLIP weights, followed by Transformer and MLP layers to produce language embeddings. Temporal context is encoded via sinusoidal position embeddings over sequences of images.
Low-Level Policy: Executes fine-grained motor actions conditioned on both visual input and language instructions. The policy uses Action Chunking with Transformers (ACT) with EfficientNet-b3 for visual encoding and FiLM layers for multimodal fusion. Language instructions are embedded using DistilBERT.
Figure 2: Policy architecture showing the flow from RGB images and joint positions through ViT and ACT modules to motor actions, mediated by language embeddings.

The hierarchical design allows the high-level policy to orchestrate complex sequences by composing primitive skills, while the low-level policy provides the flexibility to execute a diverse set of behaviors, including corrective actions.

Data Collection and Annotation

Efficient data collection is achieved through live narration, where operators verbally annotate skill segments during teleoperation. Audio is transcribed using Whisper and synchronized with robot trajectories. To distinguish between instructions and corrections, operators use foot pedals, enabling rapid filtering of suboptimal segments. Correction skills are iteratively expanded based on observed failure modes during policy rollouts, ensuring coverage of relevant recovery behaviors.

On-the-Fly Adaptation and Continuous Improvement

During deployment, human users can override the high-level policy by issuing verbal corrections, which are directly fed to the low-level policy for immediate behavioral adjustment. These interventions are logged and used to finetune the high-level policy, aligning its predictions with human feedback and improving autonomous performance over time. The iterative post-training process is conceptually analogous to Human-Gated DAgger, but operates over the space of language instructions rather than low-level actions.

Figure 3: Real-world task rollouts illustrating sub-tasks, failure modes, verbal corrections, and resulting robot behaviors for three manipulation tasks.

Experimental Results

Quantitative Performance

YAY Robot demonstrates substantial improvements in task success rates:

On-the-fly corrections: Real-time language interventions yield 15–50% increases in success rates across all tasks.
Autonomous improvement: Finetuning the high-level policy with correction data leads to 20–45% higher success rates compared to the base policy.
Figure 4: Quantitative evaluations showing a 20% improvement in success rates over the base policy due to language corrections and policy finetuning.

Iterative post-training enables the high-level policy to autonomously generate corrective instructions, with performance approaching that of an oracle policy as more feedback is incorporated.

Figure 5: Success rates for packing different numbers of items improve with each iteration of user feedback and policy finetuning.

Hierarchical vs. Flat Policies

Hierarchical policies consistently outperform flat imitation learning baselines (ACT trained without hierarchy), especially in later stages of long-horizon tasks, indicating superior robustness to compounding errors.

Ablation Studies

Scripted High-Level Policy: Replacing the learned high-level policy with a fixed sequence of instructions results in up to 30% lower performance, highlighting the necessity of dynamic, context-aware correction.
Vision-LLMs (VLMs): Off-the-shelf VLMs (GPT-4V) fail to reliably reason about spatial relationships and manipulation states, even with optimal camera inputs.
Language vs. One-Hot Encoding: Substituting language embeddings with one-hot skill encodings degrades performance, underscoring the importance of semantic compositionality in language-conditioned policies.
Data Quality: Training on filtered, high-quality data yields more stable and higher performance than using larger, mixed-quality datasets.
Figure 6: Ablation results showing the impact of scripted policies, VLMs, and one-hot encodings on performance.

Policy Proficiency and Behavioral Analysis

Fine-tuning with human feedback leads to broader and more effective coverage in tasks such as plate cleaning, as visualized by heatmaps of wiping efficacy.

Figure 7: Heatmaps showing increased cleaning coverage after policy finetuning with human feedback.

The ratio of corrective to non-corrective commands shifts markedly after finetuning, resulting in more targeted and effective behaviors.

Figure 8: Shift from non-correction to correction commands post-finetuning, enhancing task coverage and success.

Language Command Diversity

The dataset contains a large and diverse set of language instructions, with correction skills being more varied but less frequent than task-oriented commands.

Figure 9: Word cloud of the most frequent 200 commands in the bag packing dataset, illustrating the diversity of language instructions.

Implications and Future Directions

YAY Robot demonstrates that natural language corrections can be effectively leveraged for both immediate adaptation and continuous improvement in hierarchical robotic systems. The results suggest that robust high-level policies must be tightly coupled with expressive, language-conditioned low-level skills. The approach is limited by the capabilities of the low-level policy; further gains will require advances in large-scale language-conditioned imitation learning and multimodal policy architectures. Extending the framework to incorporate non-verbal feedback (e.g., gestures, pointing) and integrating pretrained VLMs with post-training on interaction data are promising avenues for future research.

Conclusion

YAY Robot provides a scalable, user-friendly mechanism for improving robotic manipulation through verbal corrections, achieving significant gains in long-horizon task performance. The hierarchical language-guided approach enables both on-the-fly adaptation and autonomous improvement, with strong empirical results and clear evidence for the necessity of dynamic, context-aware high-level policies. The framework sets a foundation for future systems that learn from natural human supervision, with potential extensions to multimodal feedback and broader task domains.