Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 37 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 10 tok/s Pro
GPT-5 High 15 tok/s Pro
GPT-4o 84 tok/s Pro
Kimi K2 198 tok/s Pro
GPT OSS 120B 448 tok/s Pro
Claude Sonnet 4 31 tok/s Pro
2000 character limit reached

Yell At Your Robot: Improving On-the-Fly from Language Corrections (2403.12910v1)

Published 19 Mar 2024 in cs.RO, cs.AI, and cs.LG

Abstract: Hierarchical policies that combine language and low-level control have been shown to perform impressively long-horizon robotic tasks, by leveraging either zero-shot high-level planners like pretrained language and vision-LLMs (LLMs/VLMs) or models trained on annotated robotic demonstrations. However, for complex and dexterous skills, attaining high success rates on long-horizon tasks still represents a major challenge -- the longer the task is, the more likely it is that some stage will fail. Can humans help the robot to continuously improve its long-horizon task performance through intuitive and natural feedback? In this paper, we make the following observation: high-level policies that index into sufficiently rich and expressive low-level language-conditioned skills can be readily supervised with human feedback in the form of language corrections. We show that even fine-grained corrections, such as small movements ("move a bit to the left"), can be effectively incorporated into high-level policies, and that such corrections can be readily obtained from humans observing the robot and making occasional suggestions. This framework enables robots not only to rapidly adapt to real-time language feedback, but also incorporate this feedback into an iterative training scheme that improves the high-level policy's ability to correct errors in both low-level execution and high-level decision-making purely from verbal feedback. Our evaluation on real hardware shows that this leads to significant performance improvement in long-horizon, dexterous manipulation tasks without the need for any additional teleoperation. Videos and code are available at https://yay-robot.github.io/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (70)
  1. Do as i can, not as i say: Grounding language in robotic affordances. Conference On Robot Learning, 2022.
  2. Hierarchical reinforcement learning with natural language subgoals. arXiv preprint arXiv:2309.11564, 2023.
  3. Learning with latent language. arXiv preprint arXiv:1711.00482, 2017.
  4. Multi-task learning for continuous control. arXiv preprint arXiv:1802.01034, 2018.
  5. Affordances from human videos as a versatile representation for robotics. Computer Vision And Pattern Recognition, 2023. doi: 10.1109/CVPR52729.2023.01324.
  6. Rt-h: Action hierarchies using language. arXiv preprint arXiv: 2403.01823, 2024.
  7. Real-time natural language corrections for assistive robotic manipulators. The International Journal of Robotics Research, 36(5-7):684–698, 2017.
  8. Rt-1: Robotics transformer for real-world control at scale. Robotics: Science And Systems, 2022. doi: 10.48550/arXiv.2212.06817.
  9. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023.
  10. Latte: Language trajectory transformer. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 7287–7294. IEEE, 2023.
  11. Multimodal error correction with natural language and pointing gestures. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pages 1976–1986, October 2023.
  12. ”no, to the right” - online language corrections for robotic manipulation via shared autonomy. Ieee/acm International Conference On Human-robot Interaction, 2023. doi: 10.1145/3568162.3578623.
  13. Learning parameterized skills. arXiv preprint arXiv:1206.6398, 2012.
  14. Multi-task policy search for robotics. In 2014 IEEE international conference on robotics and automation (ICRA), pages 3876–3881. IEEE, 2014.
  15. Task and motion planning with large language models for object rearrangement. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2086–2092, 2023. doi: 10.1109/IROS55552.2023.10342169.
  16. An image is worth 16x16 words: Transformers for image recognition at scale. International Conference On Learning Representations, 2020.
  17. Palm-e: An embodied multimodal language model. International Conference On Machine Learning, 2023. doi: 10.48550/arXiv.2303.03378.
  18. Learning to sequence multiple tasks with competing constraints. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2672–2678. IEEE, 2019.
  19. Lisa: Learning interpretable skill abstractions from language. Advances in Neural Information Processing Systems, 35:21711–21724, 2022.
  20. Integrated task and motion planning. Annu. Rev. Control. Robotics Auton. Syst., 4:265–293, 2021. doi: 10.1146/ANNUREV-CONTROL-091420-084139.
  21. Thriftydagger: Budget-aware novelty and risk gating for interactive imitation learning. arXiv preprint arXiv:2109.08273, 2021a.
  22. Lazydagger: Reducing context switching in interactive imitation learning. In 2021 IEEE 17th International Conference on Automation Science and Engineering (CASE), pages 502–509. IEEE, 2021b.
  23. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. arXiv preprint arXiv:2201.07207, 2022.
  24. Voxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973, 2023.
  25. Interactive task planning through natural language. In Proceedings of IEEE International Conference on Robotics and Automation, volume 1, pages 24–29 vol.1, 1996. doi: 10.1109/ROBOT.1996.503568.
  26. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning, pages 991–1002. PMLR, 2022.
  27. Language as an abstraction for hierarchical deep reinforcement learning. Advances in Neural Information Processing Systems, 32, 2019.
  28. Hg-dagger: Interactive imitation learning with human experts. Ieee International Conference On Robotics And Automation, 2018. doi: 10.1109/ICRA.2019.8793698.
  29. Reinforcement learning to adjust parametrized motor primitives to new situations. Autonomous Robots, 33:361–379, 2012.
  30. Toward understanding natural language directions. In 2010 5th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pages 259–266. IEEE, 2010a.
  31. Grounding verbs of motion in natural language commands to robots. In International Symposium on Experimental Robotics, 2010b.
  32. Code as policies: Language model programs for embodied control. Ieee International Conference On Robotics And Automation, 2022. doi: 10.1109/ICRA48891.2023.10160591.
  33. Learning to learn faster from human feedback with language model predictive control. 2024.
  34. Robot learning on the job: Human-in-the-loop autonomy and learning during deployment. arXiv preprint arXiv:2211.08416, 2022.
  35. Interactive robot learning from verbal correction. arXiv preprint arXiv: 2310.17555, 2023.
  36. Multi-stage cable routing through hierarchical imitation learning. Ieee Transactions On Robotics, 2023. doi: 10.48550/arXiv.2307.08927.
  37. Grounding language in play. arXiv preprint arXiv:2005.07648, 3, 2020a.
  38. Language conditioned imitation learning over unstructured data. arXiv preprint arXiv:2005.07648, 2020b.
  39. Interactive language: Talking to robots in real time. IEEE Robotics and Automation Letters, 2023.
  40. Walk the talk: Connecting language, knowledge, and action in route instructions. Def, 2(6):4, 2006.
  41. Human-in-the-loop imitation learning using remote teleoperation. arXiv preprint arXiv:2012.06733, 2020.
  42. Learning reusable manipulation strategies. Conference On Robot Learning, 2023. doi: 10.48550/arXiv.2311.03293.
  43. Listen, attend, and walk: Neural mapping of navigational instructions to action sequences. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016.
  44. Structured world models from human videos. Robotics: Science And Systems, 2023. doi: 10.15607/RSS.2023.XIX.012.
  45. Tell me dave: Context-sensitive grounding of natural language to manipulation instructions. The International Journal of Robotics Research, 35(1-3):281–300, 2016.
  46. Gpt-4 technical report. arXiv preprint arXiv: 2303.08774, 2023.
  47. Tadam: Task dependent adaptive metric for improved few-shot learning. Advances in neural information processing systems, 31, 2018.
  48. Actor-mimic: Deep multitask and transfer reinforcement learning. arXiv preprint arXiv:1511.06342, 2015.
  49. Film: Visual reasoning with a general conditioning layer. Aaai Conference On Artificial Intelligence, 2017. doi: 10.1609/aaai.v32i1.11671.
  50. Learning transferable visual models from natural language supervision. International Conference On Machine Learning, 2021.
  51. Robust speech recognition via large-scale weak supervision. International Conference On Machine Learning, 2022. doi: 10.48550/arXiv.2212.04356.
  52. A reduction of imitation learning and structured prediction to no-regret online learning. International Conference On Artificial Intelligence And Statistics, 2010.
  53. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. NEURIPS, 2019.
  54. Skill induction and planning with latent language. arXiv preprint arXiv:2110.01517, 2021.
  55. Correcting robot plans with natural language feedback. arXiv preprint arXiv:2204.05186, 2022.
  56. Cliport: What and where pathways for robotic manipulation. In Conference on Robot Learning, pages 894–906. PMLR, 2022.
  57. Hierarchical and interpretable skill acquisition in multi-task reinforcement learning. arXiv preprint arXiv:1712.07294, 2017.
  58. Robotic telekinesis: Learning a robotic hand imitator by watching humans on youtube. Robotics: Science And Systems, 2022. doi: 10.15607/rss.2022.xviii.023.
  59. Language-conditioned imitation learning for robot manipulation tasks. Advances in Neural Information Processing Systems, 33:13139–13150, 2020.
  60. Efficientnet: Rethinking model scaling for convolutional neural networks. International Conference On Machine Learning, 2019.
  61. ALOHA 2 Team. Aloha 2: An enhanced low-cost hardware for bimanual teleoperation, 2024. URL https://aloha-2.github.io/.
  62. Distral: Robust multitask reinforcement learning. Advances in neural information processing systems, 30, 2017.
  63. Understanding natural language commands for robotic navigation and mobile manipulation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 25, pages 1507–1514, 2011.
  64. Attention is all you need. arXiv preprint arXiv: 1706.03762, 2017.
  65. Chatgpt for robotics: Design principles and model abilities. Microsoft Auton. Syst. Robot. Res, 2:20, 2023.
  66. Neural Semantic Parsing with Anonymization for Command Understanding in General-Purpose Service Robots, pages 337–350. Springer International Publishing, 2019. doi: 10.1007/978-3-030-35699-6˙26.
  67. Mosaic: A modular system for assistive and interactive cooking. arXiv preprint arXiv: 2402.18796, 2024.
  68. Deep imitation learning for bimanual robotic manipulation. Neural Information Processing Systems, 2020.
  69. Multi-step tasks learning based on bayesian segmentation and dynamic movement primitive. In 2021 IEEE 11th Annual International Conference on CYBER Technology in Automation, Control, and Intelligent Systems (CYBER), pages 265–270. IEEE, 2021.
  70. Learning fine-grained bimanual manipulation with low-cost hardware. arXiv preprint arXiv: 2304.13705, 2023.
Citations (35)

Summary

  • The paper introduces a hierarchical framework that leverages natural language corrections to enhance on-the-fly robotic manipulation in long-horizon tasks.
  • It employs a high-level policy with Vision Transformer and transformer layers paired with a low-level corrective policy using DistilBERT embeddings for precise motor actions.
  • Empirical results indicate a 15–50% improvement in task success rates and a 20–45% gain via policy finetuning, outperforming flat imitation learning baselines.

YAY Robot: Hierarchical Language-Guided Correction and Continuous Improvement for Long-Horizon Robotic Manipulation

Introduction

The paper introduces YAY Robot, a hierarchical framework for robotic manipulation that leverages natural language corrections to improve both real-time and autonomous performance on long-horizon, dexterous tasks. The system is designed to address the compounding error problem in multi-stage tasks by enabling human users to provide intuitive, fine-grained verbal feedback, which is then incorporated into the robot's high-level policy through iterative post-training. The approach is evaluated on three challenging bimanual manipulation tasks—bag packing, trail mix preparation, and plate cleaning—using real hardware. Figure 1

Figure 1: Overview of YAY Robot's hierarchical setup, enabling human intervention via language corrections and subsequent high-level policy finetuning.

Hierarchical Policy Architecture

YAY Robot operates with a two-level policy hierarchy:

  • High-Level Policy: Generates language instructions based on visual observations and temporal context. It is implemented using a Vision Transformer (ViT) backbone initialized with CLIP weights, followed by Transformer and MLP layers to produce language embeddings. Temporal context is encoded via sinusoidal position embeddings over sequences of images.
  • Low-Level Policy: Executes fine-grained motor actions conditioned on both visual input and language instructions. The policy uses Action Chunking with Transformers (ACT) with EfficientNet-b3 for visual encoding and FiLM layers for multimodal fusion. Language instructions are embedded using DistilBERT. Figure 2

    Figure 2: Policy architecture showing the flow from RGB images and joint positions through ViT and ACT modules to motor actions, mediated by language embeddings.

The hierarchical design allows the high-level policy to orchestrate complex sequences by composing primitive skills, while the low-level policy provides the flexibility to execute a diverse set of behaviors, including corrective actions.

Data Collection and Annotation

Efficient data collection is achieved through live narration, where operators verbally annotate skill segments during teleoperation. Audio is transcribed using Whisper and synchronized with robot trajectories. To distinguish between instructions and corrections, operators use foot pedals, enabling rapid filtering of suboptimal segments. Correction skills are iteratively expanded based on observed failure modes during policy rollouts, ensuring coverage of relevant recovery behaviors.

On-the-Fly Adaptation and Continuous Improvement

During deployment, human users can override the high-level policy by issuing verbal corrections, which are directly fed to the low-level policy for immediate behavioral adjustment. These interventions are logged and used to finetune the high-level policy, aligning its predictions with human feedback and improving autonomous performance over time. The iterative post-training process is conceptually analogous to Human-Gated DAgger, but operates over the space of language instructions rather than low-level actions. Figure 3

Figure 3: Real-world task rollouts illustrating sub-tasks, failure modes, verbal corrections, and resulting robot behaviors for three manipulation tasks.

Experimental Results

Quantitative Performance

YAY Robot demonstrates substantial improvements in task success rates:

  • On-the-fly corrections: Real-time language interventions yield 15–50% increases in success rates across all tasks.
  • Autonomous improvement: Finetuning the high-level policy with correction data leads to 20–45% higher success rates compared to the base policy. Figure 4

    Figure 4: Quantitative evaluations showing a 20% improvement in success rates over the base policy due to language corrections and policy finetuning.

Iterative post-training enables the high-level policy to autonomously generate corrective instructions, with performance approaching that of an oracle policy as more feedback is incorporated. Figure 5

Figure 5: Success rates for packing different numbers of items improve with each iteration of user feedback and policy finetuning.

Hierarchical vs. Flat Policies

Hierarchical policies consistently outperform flat imitation learning baselines (ACT trained without hierarchy), especially in later stages of long-horizon tasks, indicating superior robustness to compounding errors.

Ablation Studies

  • Scripted High-Level Policy: Replacing the learned high-level policy with a fixed sequence of instructions results in up to 30% lower performance, highlighting the necessity of dynamic, context-aware correction.
  • Vision-LLMs (VLMs): Off-the-shelf VLMs (GPT-4V) fail to reliably reason about spatial relationships and manipulation states, even with optimal camera inputs.
  • Language vs. One-Hot Encoding: Substituting language embeddings with one-hot skill encodings degrades performance, underscoring the importance of semantic compositionality in language-conditioned policies.
  • Data Quality: Training on filtered, high-quality data yields more stable and higher performance than using larger, mixed-quality datasets. Figure 6

    Figure 6: Ablation results showing the impact of scripted policies, VLMs, and one-hot encodings on performance.

Policy Proficiency and Behavioral Analysis

Fine-tuning with human feedback leads to broader and more effective coverage in tasks such as plate cleaning, as visualized by heatmaps of wiping efficacy. Figure 7

Figure 7: Heatmaps showing increased cleaning coverage after policy finetuning with human feedback.

The ratio of corrective to non-corrective commands shifts markedly after finetuning, resulting in more targeted and effective behaviors. Figure 8

Figure 8: Shift from non-correction to correction commands post-finetuning, enhancing task coverage and success.

Language Command Diversity

The dataset contains a large and diverse set of language instructions, with correction skills being more varied but less frequent than task-oriented commands. Figure 9

Figure 9: Word cloud of the most frequent 200 commands in the bag packing dataset, illustrating the diversity of language instructions.

Implications and Future Directions

YAY Robot demonstrates that natural language corrections can be effectively leveraged for both immediate adaptation and continuous improvement in hierarchical robotic systems. The results suggest that robust high-level policies must be tightly coupled with expressive, language-conditioned low-level skills. The approach is limited by the capabilities of the low-level policy; further gains will require advances in large-scale language-conditioned imitation learning and multimodal policy architectures. Extending the framework to incorporate non-verbal feedback (e.g., gestures, pointing) and integrating pretrained VLMs with post-training on interaction data are promising avenues for future research.

Conclusion

YAY Robot provides a scalable, user-friendly mechanism for improving robotic manipulation through verbal corrections, achieving significant gains in long-horizon task performance. The hierarchical language-guided approach enables both on-the-fly adaptation and autonomous improvement, with strong empirical results and clear evidence for the necessity of dynamic, context-aware high-level policies. The framework sets a foundation for future systems that learn from natural human supervision, with potential extensions to multimodal feedback and broader task domains.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.