MUSCLE: A Model Update Strategy for Compatible LLM Evolution (2407.09435v2)

Published 12 Jul 2024 in cs.AI

Abstract: LLMs are regularly updated to enhance performance, typically through changes in data or architecture. Within the update process, developers often prioritize improving overall performance metrics, paying less attention to maintaining compatibility with earlier model versions. Instance-level degradation (instance regression) of performance from one model version to the next can interfere with a user's mental model of the capabilities of a particular LLM. Users having to adapt their mental model with every update can lead to dissatisfaction, especially when the new model has degraded compared to a prior version for a known use case (model update regression). We find that when pretrained LLM base models are updated, fine-tuned user-facing downstream task adapters experience negative flips -- previously correct instances are now predicted incorrectly. We observe model update regression between different model versions on a diverse set of tasks and models, even when the downstream task training procedures remain identical. We argue for the importance of maintaining model update compatibility during updates, and present evaluation metrics designed specifically for generative tasks, while also being applicable to discriminative tasks. We propose a training strategy to minimize the extent of instance regression in model updates, involving training of a compatibility adapter that can enhance task fine-tuned LLMs. We show negative flips reduce by up to 40% e.g. when updating Llama 1 to Llama 2 with our proposed method.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces novel evaluation metrics and a compatibility adapter to minimize negative flips during LLM updates.
Experimental results show up to a 40% reduction in negative flip rates across various tasks and models.
The approach ensures user-centric model evolution, enabling smoother transitions between LLM versions with minimal regression.

MUSCLE: A Model Update Strategy for Compatible LLM Evolution

The paper "MUSCLE: A Model Update Strategy for Compatible LLM Evolution" addresses a crucial yet often overlooked challenge in the evolution of LLMs: maintaining compatibility between different versions of the models. This problem is significant as model updates typically focus on enhancing performance metrics without considering how changes might impact user experience, particularly for those who have developed a mental model of the LLM's capabilities and behaviors.

Key Contributions

The authors make several noteworthy contributions to the field:

Compatibility Metrics: They introduce new evaluation metrics for measuring compatibility between different versions of LLMs. This includes extending the traditional negative flip rate (NFR) used for classification tasks to generative tasks as well. Metrics like backward trust compatibility (BTC) and negative flip impact are adapted to encompass both positive flips (correct to correct) and negative flips (correct to incorrect), as well as inconsistencies where both models are incorrect but change their predictions.
Compatibility Adapter: The paper proposes a novel training strategy using a compatibility adapter. This adapter is fine-tuned to ensure minimal regression when updating from an older model version to a newer one. By leveraging knowledge distillation, the authors train the adapter to align with both the old and new models, significantly reducing negative flips by aligning more closely with the user's expectations.
Experimental Validation: The authors validate their approach across a diverse set of tasks and models. They show that the compatibility adapter can reduce negative flips by up to 40% in some scenarios (e.g., from Llama 1 to Llama 2), without compromising the overall performance improvements brought by the new model versions.

Experimental Setup

The experiments are comprehensive, considering updates across multiple LLM families, including Llama and Vicuna, and evaluated on various downstream tasks like HellaSwag, PIQA, GSM8k, and SAMsum. For all tasks, parameter-efficient fine-tuning using Low-Rank Adapters (LoRA) is employed.

Numerical Results

The results are compelling:

HellaSwag: A reduction in negative flip rate (NFR) by 40.60% (Llama 1 to Llama 2) and 38.74% (Vicuna 1.3 to Vicuna 1.5) while also achieving significant accuracy gains.
PIQA: Similar improvements are noted with a reduction in NFR by 34.25% (Llama 1 to Llama 2).
GSM8k: Up to 29% reduction in NFR when updated from Phi 1.5 to Phi 2, demonstrating the approach's utility in math reasoning tasks.
SAMsum: For generative tasks, the compatibility adapter reduces ROUGE-1 score regression by 27.46% for Phi 1.5 to Phi 2 updates.

Theoretical and Practical Implications

The proposed methodology has significant theoretical and practical implications:

User-Centric Model Updates: By maintaining compatibility, users can experience more consistent model behaviors, leading to improved satisfaction and reduced cognitive load.
Extending to Generative Models: Extending compatibility metrics to generative models broadens the applicability of this approach, making it relevant for a larger array of tasks.

Speculations on Future Developments

This work paves the way for future research focused on achieving seamless model evolution. Key areas for future exploration might include:

Tokenization and Vocabulary Changes: Investigating strategies to handle updates involving changes in tokenization or vocabulary size.
Bias Mitigation: Ensuring that while compatibility is maintained, biases inherent in older models do not get perpetuated.

In conclusion, this paper provides a robust framework for addressing model compatibility in LLM updates, offering both a theoretical foundation and practical solutions for reducing negative flips and maintaining user trust during model evolution. The introduction of compatibility adapters and extended metrics represents a significant advance in the field, providing a more user-centered approach to LLM development and deployment.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1812678473292374156

https://twitter.com/FedPernici/status/1814794559479533623

https://twitter.com/realmofresearch/status/1813201342003093936

https://twitter.com/gm8xx8/status/1812662654399340743