Multi-LLM QA with Embodied Exploration (2406.10918v5)

Published 16 Jun 2024 in cs.LG, cs.AI, and cs.CL

Abstract: LLMs have grown in popularity due to their natural language interface and pre trained knowledge, leading to rapidly increasing success in question-answering (QA) tasks. More recently, multi-agent systems with LLM-based agents (Multi-LLM) have been utilized increasingly more for QA. In these scenarios, the models may each answer the question and reach a consensus or each model is specialized to answer different domain questions. However, most prior work dealing with Multi-LLM QA has focused on scenarios where the models are asked in a zero-shot manner or are given information sources to extract the answer. For question answering of an unknown environment, embodied exploration of the environment is first needed to answer the question. This skill is necessary for personalizing embodied AI to environments such as households. There is a lack of insight into whether a Multi-LLM system can handle question-answering based on observations from embodied exploration. In this work, we address this gap by investigating the use of Multi-Embodied LLM Explorers (MELE) for QA in an unknown environment. Multiple LLM-based agents independently explore and then answer queries about a household environment. We analyze different aggregation methods to generate a single, final answer for each query: debating, majority voting, and training a central answer module (CAM). Using CAM, we observe a $46\%$ higher accuracy compared against the other non-learning-based aggregation methods. We provide code and the query dataset for further research.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces a novel multi-LLM framework that employs a Central Answer Model to aggregate independent agent responses for enhanced accuracy.
The paper integrates state-of-the-art exploration methods in household environments, demonstrating improved scalability and reduced exploration costs.
The paper validates its approach with experiments showing up to 50% accuracy improvement over traditional majority vote and debate-based aggregation methods.

Overview of Embodied Question Answering via Multi-LLM Systems

The paper "Embodied Question Answering via Multi-LLM Systems," co-authored by Bhrij Patel, Vishnu Sashank Dorbala, Amrit Singh Bedi, and Dinesh Manocha, tackles the challenging problem of Embodied Question Answering (EQA) within a multi-agent framework. The authors propose a novel approach that utilizes multiple LLM based agents to independently address queries about a household environment. They introduce the Central Answer Model (CAM), which is trained to aggregate the responses from these independent agents to generate a robust final answer.

Context and Motivation

Embodied Question Answering involves an autonomous agent that navigates and explores an environment to answer user queries based on its observations. Traditional approaches have limited the scenario to a single agent, leading to high exploration costs and low environmental coverage. The advent of Embodied Artificial Intelligence (EAI) and the success of LLMs in natural language understanding and common-sense reasoning present an opportunity to scale EQA to multi-agent systems. However, achieving consensus from multiple agents' outputs in an embodied setting introduces challenges, particularly when individual agents' answers conflict due to partial observations or varying interpretation capabilities of LLMs.

Proposed Approach

The authors propose a Multi-LLM framework where multiple LLM-based agents explore a household environment. Each agent independently answers a set of binary embodied questions. These responses are then utilized to train a Central Answer Model (CAM) using a variety of machine learning algorithms, both linear (e.g., logistic regression, SVM) and non-linear (e.g., neural networks, random forest, decision tree, XGBoost). CAM aggregates these responses without requiring intra-agent communication, thereby mitigating associated costs.

Key Contributions

Novel Multi-LLM EQA Framework: The introduction of CAM for EQA in a multi-agent setting represents a critical innovation. The central model is trained to deliver consistent answers by weighting the reliability of each agent's response, significantly improving accuracy compared to traditional aggregation methods like majority voting or LLM debates.
Integration with Exploration Systems: The authors evaluate their multi-agent framework using SOTA LLM-based exploration methods on the Matterport3D environments, demonstrating successful synergy with state-of-the-art methods.
Feature Importance Analysis: A feature importance analysis using permutation feature importance (PFI) is conducted to quantify CAM's reliance on the responses of individual agents and query context, providing insights into the model's decision-making process.

Experimental Validation

The performance of CAM was validated through extensive experimentation:

Accuracy Improvement: In comparison with majority vote and debate-based aggregation methods, CAM achieved up to a 50% higher accuracy. This improvement underscores the benefit of using supervised learning to weigh the agents' responses effectively.
Scalability with Multiple Agents: Tests with varying numbers of agents showed that CAM consistently outperforms traditional methods, with non-linear models (e.g., XGBoost) providing the best results.
Practicality with SOTA Exploration Methods: When combined with LGX, a SOTA exploration method, CAM continued to demonstrate superior performance, illustrating its potential for real-world application.

Implications and Future Directions

The practical implications of this work are significant. By leveraging a multi-agent system and a robust aggregation model, the proposed approach can dramatically improve the efficiency and accuracy of EQA tasks in dynamic and unstructured environments like households. This can accelerate the development of more reliable and intelligent in-home robots and personalized assistants.

Future research could focus on addressing some limitations identified in the paper, such as dynamic household environments and queries that are subjective or non-binary. Additionally, expanding the framework to handle non-explicit queries and applying it to other contexts like long video understanding could further broaden its applicability.

Conclusion

This paper presents an innovative and effective solution for Embodied Question Answering using multiple LLM-based agents. By introducing and validating the Central Answer Model, the authors provide a robust method for aggregating responses from multiple agents, significantly enhancing the accuracy and feasibility of EQA tasks. The integration of this method with exploration systems and the detailed feature importance analysis contribute valuable insights for future advancements in multi-agent systems and embodied AI.

PDF Markdown