We survey applications of pretrained foundation models in robotics. Traditional deep learning models in robotics are trained on small datasets tailored for specific tasks, which limits their adaptability across diverse applications. In contrast, foundation models pretrained on internet-scale data appear to have superior generalization capabilities, and in some instances display an emergent ability to find zero-shot solutions to problems that are not present in the training data. Foundation models may hold the potential to enhance various components of the robot autonomy stack, from perception to decision-making and control. For example, LLMs can generate code or provide common sense reasoning, while vision-language models enable open-vocabulary visual recognition. However, significant open research challenges remain, particularly around the scarcity of robot-relevant training data, safety guarantees and uncertainty quantification, and real-time execution. In this survey, we study papers that have used or built foundation models to solve robotics problems. We explore how foundation models contribute to improving robot capabilities in the domains of perception, decision-making, and control. We discuss the challenges hindering the adoption of foundation models in robot autonomy and provide opportunities and potential pathways for future advancements. The GitHub project corresponding to this paper (Preliminary release. We are committed to further enhancing and updating this work to ensure its quality and relevance) can be found here: https://github.com/robotics-survey/Awesome-Robotics-Foundation-Models
Foundation models are large-scale machine learning models pre-trained on diverse datasets and can be fine-tuned for various robotics tasks.
They can enhance decision-making, control, perception, and task planning in robotics, leveraging models like GPT-3, CLIP, and DALL-E.
Robotics faces challenges integrating foundation models, including data scarcity, safety concerns, uncertain decision-making, and the need for real-time processing.
Techniques such as unstructured play data usage, uncertainty quantification methods, and high-fidelity simulators are proposed to address these challenges.
Future research in robotics aims to develop reliable, efficient, and safe robots that can adapt to a broad range of tasks using foundation models.
Foundation models are a type of machine learning model that is pre-trained on massive, diverse data sets, enabling them to learn general-purpose representations and skills. These models can then be fine-tuned or adapted to a wide array of downstream tasks. Examples include BERT for text processing and GPT for text generation, as well as models like CLIP and DALL-E that work across both vision and language. In robotics, these models hold promise for enhancing perception, decision-making, control, and even task planning. They can generate code, provide common-sense reasoning, and recognize visual concepts in an open-ended manner. However, realizing their potential in robotics also presents unique challenges, particularly regarding training data scarcity, safety, uncertainty quantification, and achieving real-time performance.
Foundation models offer significant advancements for robotics in several areas:
Decision Making and Control:
Perception Capabilities:
Embodied AI and Generalist Agents:
Incorporating foundation models into robotics comes with several challenges:
Training Data Scarcity:
Uncertainty and Safety in Decision Making:
Real-Time Performance:
Variability in Robotic Settings:
Benchmarking and Reproducibility:
The integration of foundation models in robotics is an active area of development. Future research directions include creating reliable, real-time capable models, generating robotics-specific training data, and building safety mechanisms for autonomous operations. The ultimate goal is to develop versatile robots that can operate safely and effectively in complex real-world scenarios, leveraging the vast learning potential of foundation models.