Emergent Mind

Abstract

Multi-agent interactions between Large Language Model (LLM) agents have shown major improvements on diverse reasoning tasks. However, these involve long generations from multiple models across several rounds, making them expensive. Moreover, these multi-agent approaches fail to provide a final, single model for efficient inference. To address this, we introduce MAGDi, a new method for structured distillation of the reasoning interactions between multiple LLMs into smaller LMs. MAGDi teaches smaller models by representing multi-agent interactions as graphs, augmenting a base student model with a graph encoder, and distilling knowledge using three objective functions: next-token prediction, a contrastive loss between correct and incorrect reasoning, and a graph-based objective to model the interaction structure. Experiments on seven widely-used commonsense and math reasoning benchmarks show that MAGDi improves the reasoning capabilities of smaller models, outperforming several methods that distill from a single teacher and multiple teachers. Moreover, MAGDi also demonstrates an order of magnitude higher efficiency over its teachers. We conduct extensive analyses to show that MAGDi (1) enhances the generalizability to out-of-domain tasks, (2) scales positively with the size and strength of the base student model, and (3) obtains larger improvements (via our multi-teacher training) when applying self-consistency - an inference technique that relies on model diversity.

Distillation method where multiple teacher-LLMs discuss a problem, creating a multi-agent interaction graph.

Overview

  • The paper introduces Multi-Agent Interaction Graphs Distillation (MAGD I), which distills reasoning interactions from large models into smaller ones.

  • MAGD I employs a graph encoder and three specific objective functions to transfer knowledge effectively.

  • Testing on seven reasoning benchmarks shows MAGD I improves smaller models' reasoning abilities while maintaining efficiency.

  • The method displays generalizability and scalability across different domains, model sizes, and tasks.

  • MAGD I supports diverse outputs and compliments self-consistency inference techniques, enriching the response spectrum of models.

Introduction

LLMs have played a crucial role in enhancing performance on reasoning tasks. Yet their benefits are typically offset by high computational costs due to extensive generation requirements and multiple model instances interacting over several rounds. A pressing issue is the lack of a final, efficient model for inference, as multi-agent frameworks do not consolidate reasoning skills into a standalone model.

Multi-Agent Distillation

To combat these challenges, this study introduces Multi-Agent Interaction Graphs Distillation (MAGD I). This novel approach structures the distillation of reasoning interactions from numerous LLMs into more compact language models. It employs a graph encoder within a student model and distills knowledge using three tailored objective functions. These include next-token prediction, contrastive loss between correct/incorrect reasoning, and a graph-based objective to encompass interaction structures.

Experimentation and Results

The efficacy of MAGD I has been rigorously tested on seven prominent reasoning benchmarks. The evaluations show that the methodology not only improves smaller models' reasoning abilities substantially but also maintains operationally efficiency levels that are an order of magnitude better than the multi-agent teacher setups. For instance, MAGD I-distilled models reduce the token generation by up to 9x at inference time, while surpassing all single-teacher distillation baselines in performance.

Scalability and Generalizability

Further exploration of MAGD I's applications showcases that its benefits carry over to generalizability and scalability across various domains and model sizes. When utilized to construct a universal multi-task learning model, MAGD I performs comparably on multiple tasks simultaneously and exhibits competence even on out-of-domain tasks. Moreover, this method scales positively with the underlying student model's size and sophistication, indicating its long-term applicability as foundational models evolve.

Diversity and Inference Techniques

MAGD I also has the potential to enhance model diversity, which is demonstrated through the method's compatibility with self-consistency inference techniques that depend on varied model outputs. The student models, trained via MAGD I, achieve notable performance jumps when used in combination with such ensemble methods, suggesting that structured distillation may imbue models with a richer response spectrum.

Conclusion

This paper posits the innovative MAGD I method as a solution to infuse LLMs' reasoning prowess into smaller models without incurring prohibitive computational expenses. The empirical results underscore the potential of structured distillation in creating efficient and robust reasoning models, capable of transfer learning and preserving diversity for advanced inference applications.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.