MuTT: A Multimodal Trajectory Transformer for Robot Skills (2407.15660v2)
Abstract: High-level robot skills represent an increasingly popular paradigm in robot programming. However, configuring the skills' parameters for a specific task remains a manual and time-consuming endeavor. Existing approaches for learning or optimizing these parameters often require numerous real-world executions or do not work in dynamic environments. To address these challenges, we propose MuTT, a novel encoder-decoder transformer architecture designed to predict environment-aware executions of robot skills by integrating vision, trajectory, and robot skill parameters. Notably, we pioneer the fusion of vision and trajectory, introducing a novel trajectory projection. Furthermore, we illustrate MuTT's efficacy as a predictor when combined with a model-based robot skill optimizer. This approach facilitates the optimization of robot skill parameters for the current environment, without the need for real-world executions during optimization. Designed for compatibility with any representation of robot skills, MuTT demonstrates its versatility across three comprehensive experiments, showcasing superior performance across two different skill representations.
- L. Johannsmeier, M. Gerchow, and S. Haddadin, “A Framework for Robot Manipulation: Skill Formalism, Meta Learning and Adaptive Control,” in 2019 International Conference on Robotics and Automation (ICRA), May 2019, pp. 5844–5850, iSSN: 2577-087X.
- J. A. Marvel, W. S. Newman, D. P. Gravel, G. Zhang, Jianjun Wang, and T. Fuhlbrigge, “Automated learning for parameter optimization of robotic assembly tasks utilizing genetic algorithms,” in 2008 IEEE International Conference on Robotics and Biomimetics, Feb. 2009, pp. 179–184.
- U. Thomas, G. Hirzinger, B. Rumpe, C. Schulze, and A. Wortmann, “A new skill based robot programming language using UML/P Statecharts,” in 2013 IEEE International Conference on Robotics and Automation, May 2013, pp. 461–466, iSSN: 1050-4729.
- M. R. Pedersen, L. Nalpantidis, R. S. Andersen, C. Schou, S. Bøgh, V. Krüger, and O. Madsen, “Robot skills for manufacturing: From concept to industrial deployment,” Robotics and Computer-Integrated Manufacturing, vol. 37, pp. 282–291, Feb. 2016. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0736584515000575
- A. J. Ijspeert, J. Nakanishi, H. Hoffmann, P. Pastor, and S. Schaal, “Dynamical movement primitives: learning attractor models for motor behaviors,” Neural computation, vol. 25, no. 2, pp. 328–373, 2013, publisher: MIT Press One Rogers Street, Cambridge, MA 02142-1209, USA journals-info ….
- G. Li, Z. Jin, M. Volpp, F. Otto, R. Lioutikov, and G. Neumann, “ProDMP: A Unified Perspective on Dynamic and Probabilistic Movement Primitives,” IEEE Robotics and Automation Letters, vol. 8, no. 4, pp. 2325–2332, 2023, publisher: IEEE. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/10050558
- S. Schaal, “Dynamic Movement Primitives -A Framework for Motor Control in Humans and Humanoid Robotics,” in Adaptive Motion of Animals and Machines, H. Kimura, K. Tsuchiya, A. Ishiguro, and H. Witte, Eds. Tokyo: Springer, 2006, pp. 261–280. [Online]. Available: https://doi.org/10.1007/4-431-31381-8˙23
- H. Bruyninckx and J. De Schutter, “Specification of force-controlled actions in the ”task frame formalism”-a synthesis,” IEEE Transactions on Robotics and Automation, vol. 12, no. 4, pp. 581–589, Aug. 1996, conference Name: IEEE Transactions on Robotics and Automation. [Online]. Available: https://ieeexplore.ieee.org/document/508440
- A. Hussein, M. M. Gaber, E. Elyan, and C. Jayne, “Imitation Learning: A Survey of Learning Methods,” ACM Computing Surveys, vol. 50, no. 2, pp. 1–35, Mar. 2018. [Online]. Available: https://dl.acm.org/doi/10.1145/3054912
- B. Alt, D. Katic, R. Jäkel, A. K. Bozcuoglu, and M. Beetz, “Robot Program Parameter Inference via Differentiable Shadow Program Inversion,” in 2021 IEEE International Conference on Robotics and Automation (ICRA), May 2021, pp. 4672–4678, iSSN: 2577-087X.
- J. Gu, F. Xiang, X. Li, Z. Ling, X. Liu, T. Mu, et al., “ManiSkill2: A Unified Benchmark for Generalizable Manipulation Skills,” Feb. 2023, arXiv:2302.04659 [cs]. [Online]. Available: http://arxiv.org/abs/2302.04659
- T. Le, H. T. Nguyen, and M. L. Nguyen, “Vision And Text Transformer For Predicting Answerability On Visual Question Answering,” in 2021 IEEE International Conference on Image Processing (ICIP), Sept. 2021, pp. 934–938, iSSN: 2381-8549.
- J. Wu, Y. Peng, S. Zhang, W. Qi, and J. Zhang, “Masked Vision-Language Transformers for Scene Text Recognition,” Nov. 2022, arXiv:2211.04785 [cs]. [Online]. Available: http://arxiv.org/abs/2211.04785
- Y. Khare, V. Bagal, M. Mathew, A. Devi, U. D. Priyakumar, and C. Jawahar, “MMBERT: Multimodal BERT Pretraining for Improved Medical VQA,” in 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), Apr. 2021, pp. 1033–1036, iSSN: 1945-8452.
- W. Kim, B. Son, and I. Kim, “Vilt: Vision-and-language transformer without convolution or region supervision,” in International Conference on Machine Learning. PMLR, 2021, pp. 5583–5594.
- J. Ao, R. Wang, L. Zhou, C. Wang, S. Ren, Y. Wu, et al., “SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing,” May 2022, arXiv:2110.07205 [cs, eess]. [Online]. Available: http://arxiv.org/abs/2110.07205
- B. Shi, W.-N. Hsu, K. Lakhotia, and A. Mohamed, “Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction,” Mar. 2022, arXiv:2201.02184 [cs, eess]. [Online]. Available: http://arxiv.org/abs/2201.02184
- L. H. Li, M. Yatskar, D. Yin, C.-J. Hsieh, and K.-W. Chang, “VisualBERT: A Simple and Performant Baseline for Vision and Language,” Aug. 2019, arXiv:1908.03557 [cs]. [Online]. Available: http://arxiv.org/abs/1908.03557
- M. Reid, N. Savinov, D. Teplyashin, D. Lepikhin, T. Lillicrap, J.-b. Alayrac, et al., “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,” Mar. 2024, arXiv:2403.05530 [cs]. [Online]. Available: http://arxiv.org/abs/2403.05530
- A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, X. Chen, K. Choromanski, et al., “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control,” July 2023, arXiv:2307.15818 [cs]. [Online]. Available: http://arxiv.org/abs/2307.15818
- D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, et al., “Octo: An Open-Source Generalist Robot Policy.”
- D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, et al., “PaLM-E: An Embodied Multimodal Language Model,” Mar. 2023, arXiv:2303.03378 [cs]. [Online]. Available: http://arxiv.org/abs/2303.03378
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, et al., “Learning Transferable Visual Models From Natural Language Supervision,” in Proceedings of the 38th International Conference on Machine Learning. PMLR, July 2021, pp. 8748–8763, iSSN: 2640-3498. [Online]. Available: https://proceedings.mlr.press/v139/radford21a.html
- W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021, publisher: IEEE.
- A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, J. Dabis, C. Finn, et al., “RT-1: Robotics Transformer for Real-World Control at Scale,” Aug. 2023, arXiv:2212.06817 [cs]. [Online]. Available: http://arxiv.org/abs/2212.06817
- V. Lim, H. Huang, L. Y. Chen, J. Wang, J. Ichnowski, D. Seita, et al., “Real2Sim2Real: Self-Supervised Learning of Physical Single-Step Dynamic Actions for Planar Robot Casting,” in 2022 International Conference on Robotics and Automation (ICRA), May 2022, pp. 8282–8289. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/9811651
- B. Sukhija, N. Köhler, M. Zamora, S. Zimmermann, S. Curi, A. Krause, and S. Coros, “Gradient-Based Trajectory Optimization With Learned Dynamics,” in 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 1011–1018.
- Y. Du, S. Yang, B. Dai, H. Dai, O. Nachum, J. Tenenbaum, et al., “Learning Universal Policies via Text-Guided Video Generation,” Advances in Neural Information Processing Systems, vol. 36, pp. 9156–9172, Dec. 2023. [Online]. Available: https://proceedings.neurips.cc/paper˙files/paper/2023/hash/1d5b9233ad716a43be5c0d3023cb82d0-Abstract-Conference.html
- T. Zhang, C. Yuan, and Y. Zou, “Online Optimization Method of Controller Parameters for Robot Constant Force Grinding Based on Deep Reinforcement Learning Rainbow,” Journal of Intelligent & Robotic Systems, vol. 105, no. 4, p. 85, Aug. 2022. [Online]. Available: https://doi.org/10.1007/s10846-022-01688-z
- S. Höfer, K. Bekris, A. Handa, J. C. Gamboa, M. Mozifian, F. Golemo, et al., “Sim2Real in Robotics and Automation: Applications and Challenges,” IEEE Transactions on Automation Science and Engineering, vol. 18, no. 2, pp. 398–400, Apr. 2021, conference Name: IEEE Transactions on Automation Science and Engineering.
- B. Alt, D. Katic, R. Jäkel, and M. Beetz, “Heuristic-free Optimization of Force-Controlled Robot Search Strategies in Stochastic Environments,” in 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Oct. 2022, pp. 8887–8893, iSSN: 2153-0866. [Online]. Available: https://ieeexplore.ieee.org/document/9982093
- J. Kennedy and R. Eberhart, “Particle swarm optimization,” in Proceedings of ICNN’95 - International Conference on Neural Networks, vol. 4, Nov. 1995, pp. 1942–1948 vol.4.
- B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas, “Taking the Human Out of the Loop: A Review of Bayesian Optimization,” Proceedings of the IEEE, vol. 104, no. 1, pp. 148–175, Jan. 2016, conference Name: Proceedings of the IEEE.
- T. Bäck and H.-P. Schwefel, “An overview of evolutionary algorithms for parameter optimization,” Evol. Comput., vol. 1, no. 1, pp. 1–23, Mar. 1993. [Online]. Available: https://dl.acm.org/doi/10.1162/evco.1993.1.1.1
- F. Berkenkamp, A. Krause, and A. P. Schoellig, “Bayesian Optimization with Safety Constraints: Safe and Automatic Parameter Tuning in Robotics,” arXiv:1602.04450 [cs], Apr. 2020. [Online]. Available: http://arxiv.org/abs/1602.04450
- R. Calandra, A. Seyfarth, J. Peters, and M. P. Deisenroth, “Bayesian optimization for learning gaits under uncertainty,” Ann Math Artif Intell, vol. 76, no. 1, pp. 5–23, Feb. 2016. [Online]. Available: https://doi.org/10.1007/s10472-015-9463-9
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, et al., “Attention is All you Need,” in Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc., 2017. [Online]. Available: https://proceedings.neurips.cc/paper˙files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
- E. Bugliarello, R. Cotterell, N. Okazaki, and D. Elliott, “Multimodal pretraining unmasked: A meta-analysis and a unified framework of vision-and-language BERTs,” Transactions of the Association for Computational Linguistics, vol. 9, pp. 978–994, 2021, publisher: MIT Press One Rogers Street, Cambridge, MA 02142-1209, USA journals-info …. [Online]. Available: https://direct.mit.edu/tacl/article-abstract/doi/10.1162/tacl˙a˙00408/107279
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” June 2021, arXiv:2010.11929 [cs]. [Online]. Available: http://arxiv.org/abs/2010.11929
- M. Janner, Q. Li, and S. Levine, “Offline Reinforcement Learning as One Big Sequence Modeling Problem,” Nov. 2021, arXiv:2106.02039 [cs]. [Online]. Available: http://arxiv.org/abs/2106.02039
- J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin, “Convolutional sequence to sequence learning,” in International conference on machine learning. PMLR, 2017, pp. 1243–1252. [Online]. Available: http://proceedings.mlr.press/v70/gehring17a.html?ref=https://githubhelp.com
- S. Karamcheti, S. Nair, A. S. Chen, T. Kollar, C. Finn, D. Sadigh, and P. Liang, “Language-Driven Representation Learning for Robotics,” Feb. 2023, arXiv:2302.12766 [cs]. [Online]. Available: http://arxiv.org/abs/2302.12766
- D. Wierstra, T. Schaul, J. Peters, and J. Schmidhuber, “Episodic Reinforcement Learning by Logistic Reward-Weighted Regression,” in Artificial Neural Networks - ICANN 2008, V. Kůrková, R. Neruda, and J. Koutník, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008, vol. 5163, pp. 407–416, iSSN: 0302-9743, 1611-3349 Series Title: Lecture Notes in Computer Science. [Online]. Available: http://link.springer.com/10.1007/978-3-540-87536-9˙42
- I. Kostrikov, A. Nair, and S. Levine, “Offline Reinforcement Learning with Implicit Q-Learning,” Oct. 2021, issue: arXiv:2110.06169 arXiv:2110.06169 [cs]. [Online]. Available: http://arxiv.org/abs/2110.06169
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.