Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 27 tok/s Pro
GPT-4o 73 tok/s Pro
Kimi K2 199 tok/s Pro
GPT OSS 120B 434 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge (2206.08853v2)

Published 17 Jun 2022 in cs.LG, cs.AI, cs.CL, and cs.CV

Abstract: Autonomous agents have made great strides in specialist domains like Atari games and Go. However, they typically learn tabula rasa in isolated environments with limited and manually conceived objectives, thus failing to generalize across a wide spectrum of tasks and capabilities. Inspired by how humans continually learn and adapt in the open world, we advocate a trinity of ingredients for building generalist agents: 1) an environment that supports a multitude of tasks and goals, 2) a large-scale database of multimodal knowledge, and 3) a flexible and scalable agent architecture. We introduce MineDojo, a new framework built on the popular Minecraft game that features a simulation suite with thousands of diverse open-ended tasks and an internet-scale knowledge base with Minecraft videos, tutorials, wiki pages, and forum discussions. Using MineDojo's data, we propose a novel agent learning algorithm that leverages large pre-trained video-LLMs as a learned reward function. Our agent is able to solve a variety of open-ended tasks specified in free-form language without any manually designed dense shaping reward. We open-source the simulation suite, knowledge bases, algorithm implementation, and pretrained models (https://minedojo.org) to promote research towards the goal of generally capable embodied agents.

Citations (305)

Summary

  • The paper presents an integrated framework for training AI agents using a multimodal internet-scale data base from Minecraft to achieve open-ended learning.
  • It introduces MineCLIP, a contrastive video-language model that drives reinforcement learning without predefined rewards, matching or exceeding manual tuning.
  • The approach leverages diverse tasks and extensive media sources, paving the way for scalable, human-like adaptability in generalist AI.

MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge

Introduction

The paper "MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge" presents a comprehensive framework to develop generally interpretable and capable AI agents. This framework is constructed upon the popular intricate game of Minecraft, aiming to model agents that can seamlessly adapt and learn new open-ended tasks from an internet-scale, multimodal knowledge base. Figure 1

Figure 1: MineDojo framework architecture combining diverse open-ended tasks and multimodal internet-scale knowledge bases.

Trinity of Generalist Agent Requirements

To truly mimic human-like adaptability and sustained learning, the paper proposes three essential components for fostering generalist agents:

  1. Open-Ended Environment: This component involves varied tasks and objectives that extend infinitely, similar to Earth’s diverse ecological systems that nurture the evolution of life forms.
  2. Comprehensive Multimodal Knowledge Database: Just as humans utilize internet resources, agents should draw from extensive video demos, multimedia tutorials, and discussion forums.
  3. Flexible and Scalable Agent Architecture: The framework advocates for a unified observation/action space, natural language task prompts, and Transformer-based pre-training to meld large-scale sources into actionable insights.

This framework capitalizes on Minecraft's inherently rich landscape— where players explore, construct, and survive in uninhibited environments—offering a sandbox that mimics authentic, open-ended scenarios. In this context, agents skilled in comprehensive tasks such as "building a house" or "navigating to a treasure site" are evaluated by leveraging an extensive evaluation metric that matches human expectations. Figure 2

Figure 2: Visualization of agent's learned behavior based on task descriptions.

Internet-Scale Knowledge Base

MineDojo’s knowledge base is vast, comprising over 730K YouTube videos, 6K+ Wiki pages, and a variety of Reddit posts. This extensive collection is gathered to fuel the agent with relevant, diverse, and domain-specific knowledge analogous to a human learning through media consumption.

Two key elements from the knowledge base include:

  • YouTube Videos: Human player content from video streams detailing intricate maneuvers, crafting techniques, and combat strategies.
  • Wiki and Reddit: Structured knowledge containing comprehensive game mechanics explanations, procedural guides for crafting, and community discussions. Figure 3

    Figure 3: Multimodal representation of MineDojo's knowledge base reflecting YouTube and Wiki-derived insights.

Agent Learning Algorithm

The novel agent learning algorithm hinges on a contrastive video-LLM, dubbed MineCLIP. This model is employed to ascertain the reward mechanism by evaluating the correlation between language descriptions and video snippets, forming a dense reward structure for reinforcement learning (RL).

This approach eliminates the need for predefined precise rewards, allowing the agent to solve complex, novel tasks through an open-vocabulary multi-task framework. The viability of the reward model extends to serving as a robust evaluative metric in open-ended tasks, showcasing high alignment with human judgement on task success. Figure 4

Figure 4: MineCLIP architecture design illustrating the training of video-text correlation for deep RL applications.

Empirical Evaluations

In assessing the efficacy of MineCLIP, experiments cover both programmatic and creative tasks, comparing performances against baseline reward models. Agents trained under MineCLIP guidance exhibit near parity, or outperform, manually tuned rewards in various tasks, such as resource harvesting and combat mechanics.

The MineDojo framework underscores the system’s scalability and success in task generalization owing to the diverse and dynamic nature of the task suite and internet-leveraged knowledge base. These results are crucial in indicating potential pathways for agents to tackle tasks unseen during training, validating the robustness and adaptability of the architecture.

Conclusion

MineDojo is positioned as a breakthrough in open-ended agent development, offering the AI community a platform to develop and assess agents in dynamic, complex environments. The emphasis on utilizing large-scale pre-training, combined with a rich diversity of tasks, marks an important milestone toward realizing generalist AI, akin to human learning and adaptability. This initiative paves the way for future research initiatives in embodied AI, fostering broader intersections between AI research and experiential learning from multimedia internet sources.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 2 tweets and received 1 like.

Upgrade to Pro to view all of the tweets about this paper:

Youtube Logo Streamline Icon: https://streamlinehq.com