Emergent Mind

Embodied Question Answering via Multi-LLM Systems

(2406.10918)
Published Jun 16, 2024 in cs.LG , cs.AI , and cs.CL

Abstract

Embodied Question Answering (EQA) is an important problem, which involves an agent exploring the environment to answer user queries. In the existing literature, EQA has exclusively been studied in single-agent scenarios, where exploration can be time-consuming and costly. In this work, we consider EQA in a multi-agent framework involving multiple LLMs (LLM) based agents independently answering queries about a household environment. To generate one answer for each query, we use the individual responses to train a Central Answer Model (CAM) that aggregates responses for a robust answer. Using CAM, we observe a $50\%$ higher EQA accuracy when compared against aggregation methods for ensemble LLM, such as voting schemes and debates. CAM does not require any form of agent communication, alleviating it from the associated costs. We ablate CAM with various nonlinear (neural network, random forest, decision tree, XGBoost) and linear (logistic regression classifier, SVM) algorithms. Finally, we present a feature importance analysis for CAM via permutation feature importance (PFI), quantifying CAMs reliance on each independent agent and query context.

Multi-LLM framework: Agents answer a Yes/No question; central network weighs the first agent's decision more.

Overview

  • The paper introduces a novel Embodied Question Answering (EQA) approach using a Multi-LLM framework, involving multiple Large Language Model-based agents that independently address queries about a household environment and the Central Answer Model (CAM) that aggregates their responses.

  • Through extensive experimentation, the proposed CAM-based method demonstrates up to 50% higher accuracy compared to traditional aggregation methods like majority voting, especially when integrated with SOTA exploration methods such as LGX.

  • The approach enhances the efficiency and accuracy of EQA in dynamic and unstructured environments, with future research aimed at tackling dynamic household settings, subjective queries, and expanding the framework to broader contexts.

Overview of Embodied Question Answering via Multi-LLM Systems

The paper "Embodied Question Answering via Multi-LLM Systems," co-authored by Bhrij Patel, Vishnu Sashank Dorbala, Amrit Singh Bedi, and Dinesh Manocha, tackles the challenging problem of Embodied Question Answering (EQA) within a multi-agent framework. The authors propose a novel approach that utilizes multiple Large Language Model (LLM) based agents to independently address queries about a household environment. They introduce the Central Answer Model (CAM), which is trained to aggregate the responses from these independent agents to generate a robust final answer.

Context and Motivation

Embodied Question Answering involves an autonomous agent that navigates and explores an environment to answer user queries based on its observations. Traditional approaches have limited the scenario to a single agent, leading to high exploration costs and low environmental coverage. The advent of Embodied Artificial Intelligence (EAI) and the success of LLMs in natural language understanding and common-sense reasoning present an opportunity to scale EQA to multi-agent systems. However, achieving consensus from multiple agents' outputs in an embodied setting introduces challenges, particularly when individual agents' answers conflict due to partial observations or varying interpretation capabilities of LLMs.

Proposed Approach

The authors propose a Multi-LLM framework where multiple LLM-based agents explore a household environment. Each agent independently answers a set of binary embodied questions. These responses are then utilized to train a Central Answer Model (CAM) using a variety of machine learning algorithms, both linear (e.g., logistic regression, SVM) and non-linear (e.g., neural networks, random forest, decision tree, XGBoost). CAM aggregates these responses without requiring intra-agent communication, thereby mitigating associated costs.

Key Contributions

  1. Novel Multi-LLM EQA Framework: The introduction of CAM for EQA in a multi-agent setting represents a critical innovation. The central model is trained to deliver consistent answers by weighting the reliability of each agent's response, significantly improving accuracy compared to traditional aggregation methods like majority voting or LLM debates.
  2. Integration with Exploration Systems: The authors evaluate their multi-agent framework using SOTA LLM-based exploration methods on the Matterport3D environments, demonstrating successful synergy with state-of-the-art methods.
  3. Feature Importance Analysis: A feature importance analysis using permutation feature importance (PFI) is conducted to quantify CAM's reliance on the responses of individual agents and query context, providing insights into the model's decision-making process.

Experimental Validation

The performance of CAM was validated through extensive experimentation:

  • Accuracy Improvement: In comparison with majority vote and debate-based aggregation methods, CAM achieved up to a 50% higher accuracy. This improvement underscores the benefit of using supervised learning to weigh the agents' responses effectively.
  • Scalability with Multiple Agents: Tests with varying numbers of agents showed that CAM consistently outperforms traditional methods, with non-linear models (e.g., XGBoost) providing the best results.
  • Practicality with SOTA Exploration Methods: When combined with LGX, a SOTA exploration method, CAM continued to demonstrate superior performance, illustrating its potential for real-world application.

Implications and Future Directions

The practical implications of this work are significant. By leveraging a multi-agent system and a robust aggregation model, the proposed approach can dramatically improve the efficiency and accuracy of EQA tasks in dynamic and unstructured environments like households. This can accelerate the development of more reliable and intelligent in-home robots and personalized assistants.

Future research could focus on addressing some limitations identified in the study, such as dynamic household environments and queries that are subjective or non-binary. Additionally, expanding the framework to handle non-explicit queries and applying it to other contexts like long video understanding could further broaden its applicability.

Conclusion

This paper presents an innovative and effective solution for Embodied Question Answering using multiple LLM-based agents. By introducing and validating the Central Answer Model, the authors provide a robust method for aggregating responses from multiple agents, significantly enhancing the accuracy and feasibility of EQA tasks. The integration of this method with exploration systems and the detailed feature importance analysis contribute valuable insights for future advancements in multi-agent systems and embodied AI.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.