Emergent Mind

MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World

(2401.08577)
Published Jan 16, 2024 in cs.CV , cs.AI , cs.CL , cs.LG , and cs.RO

Abstract

Human beings possess the capability to multiply a melange of multisensory cues while actively exploring and interacting with the 3D world. Current multi-modal LLMs, however, passively absorb sensory data as inputs, lacking the capacity to actively interact with the objects in the 3D environment and dynamically collect their multisensory information. To usher in the study of this area, we propose MultiPLY, a multisensory embodied large language model that could incorporate multisensory interactive data, including visual, audio, tactile, and thermal information into LLMs, thereby establishing the correlation among words, actions, and percepts. To this end, we first collect Multisensory Universe, a large-scale multisensory interaction dataset comprising 500k data by deploying an LLM-powered embodied agent to engage with the 3D environment. To perform instruction tuning with pre-trained LLM on such generated data, we first encode the 3D scene as abstracted object-centric representations and then introduce action tokens denoting that the embodied agent takes certain actions within the environment, as well as state tokens that represent the multisensory state observations of the agent at each time step. In the inference time, MultiPLY could generate action tokens, instructing the agent to take the action in the environment and obtain the next multisensory state observation. The observation is then appended back to the LLM via state tokens to generate subsequent text or action tokens. We demonstrate that MultiPLY outperforms baselines by a large margin through a diverse set of embodied tasks involving object retrieval, tool use, multisensory captioning, and task decomposition.

MultiPLY encodes scenes abstractly, revealing object details through agent actions and updating via state tokens.

Overview

  • MultiPLY introduces LLMs that interact with a 3D environment, utilizing multiple senses.

  • It utilizes the Multisensory Universe dataset with over half a million multisensory interaction instances.

  • Powered by an LLM, a virtual agent collects sensory data and utilizes object-centric representations and action tokens.

  • State tokens allow the model to update its understanding and decide on further actions, improving task performance.

  • MultiPLY outperforms previous models in tasks like object retrieval and multisensory captioning through iterative multisensory interactions.

Overview of MultiPLY

The recently introduced MultiPLY framework signifies a transformative approach in LLMs by enabling them to not only absorb multisensory data passively but also to actively interact with three-dimensional (3D) environments. This capability injects an unprecedented level of dynamism into AI agents, allowing them to glean information from the environment through multiple senses—visual, auditory, tactile, and thermal.

Data Collection and Representation

Underpinning this framework is the newly established Multisensory Universe dataset, which provides over half a million instances of multisensory interaction data. To amass this dataset, a virtual agent, powered by an LLM, is deployed within diverse 3D settings to collect observations across several sensory modalities. These 3D environments are abstractly encoded as object-centric representations that inform the LLM of the objects present and their spatial arrangement. Along with this high-level view, the LLM is designed to recognize and employ action tokens that correspond to specific interactions, such as navigating to an object or touching it to acquire tactile information.

State Tokens and Inference

After performing an action, the collected multisensory observation is communicated back to the LLM using state tokens, allowing the model to continuously update its understanding of the environment and determine the next action. This cycle repeats, enabling the agent to methodically explore its surroundings and gather comprehensive sensory data to generate text or further action tokens. MultiPLY's performance exceeds existing baselines across various tasks, including object retrieval, tool usage, multisensory captioning, and task decomposition.

Experimental Findings

Through its unique interactive and multisensory capabilities, MultiPLY demonstrates superiority over previous models that solely process passive data and generate one-off outputs. This is particularly evident in its object retrieval ability, where taking into account the multiple modalities heavily influences the success of identifying the correct object among visually similar candidates. In scenarios that require tool usage, MultiPLY's detailed interaction with its environment allows it to reason more effectively about the functionality of objects given their multisensory attributes, thus providing more accurate solutions. Furthermore, in multisensory captioning tasks, the model's prowess in leveraging various sensory inputs to comprehensively describe objects is evident. Finally, MultiPLY's iterative interaction approach lends itself well to tasks that involve breaking down complex activities into sequential actions.

By establishing a more intricate and closer-to-human method of environmental interaction, MultiPLY marks a significant stride in the direction of embodied AI research. This innovation not only expands the potential uses of LLMs but also enriches the overall modality of how AI systems can learn from and engage with the world around them.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.