LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression (2403.12968v2)

Published 19 Mar 2024 in cs.CL and cs.LG

Abstract: This paper focuses on task-agnostic prompt compression for better generalizability and efficiency. Considering the redundancy in natural language, existing approaches compress prompts by removing tokens or lexical units according to their information entropy obtained from a causal LLM such as LLaMa-7B. The challenge is that information entropy may be a suboptimal compression metric: (i) it only leverages unidirectional context and may fail to capture all essential information needed for prompt compression; (ii) it is not aligned with the prompt compression objective. To address these issues, we propose a data distillation procedure to derive knowledge from an LLM to compress prompts without losing crucial information, and meantime, introduce an extractive text compression dataset. We formulate prompt compression as a token classification problem to guarantee the faithfulness of the compressed prompt to the original one, and use a Transformer encoder as the base architecture to capture all essential information for prompt compression from the full bidirectional context. Our approach leads to lower latency by explicitly learning the compression objective with smaller models such as XLM-RoBERTa-large and mBERT. We evaluate our method on both in-domain and out-of-domain datasets, including MeetingBank, LongBench, ZeroScrolls, GSM8K, and BBH. Despite its small size, our model shows significant performance gains over strong baselines and demonstrates robust generalization ability across different LLMs. Additionally, our model is 3x-6x faster than existing prompt compression methods, while accelerating the end-to-end latency by 1.6x-2.9x with compression ratios of 2x-5x. Our code is available at https://aka.ms/LLMLingua-2.

References (35)

Citations (48)

View on Semantic Scholar

Summary

The paper presents a novel data distillation procedure using GPT-4 to generate compression datasets, enhancing efficiency and fidelity in LLM prompt compression.
The methodology treats prompt compression as a token classification task, leveraging a Transformer encoder to retain crucial context and reduce redundancy.
Evaluations reveal improved QA F1, BLEU, ROUGE, and BERTScore metrics along with 1.6x to 2.9x latency reduction, underscoring LLMLingua-2’s practical advantages.

LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression

Abstract

This document presents an overview and an in-depth analysis of the research presented in "LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression." The paper addresses the challenges of prompt compression in the context of LLMs, presenting LLMLingua-2, a novel approach designed to streamline and enhance the efficiency of prompt compression while ensuring the retention of critical information.

Introduction

Recent advancements in prompting techniques for LLMs, such as Chain-of-Thought (COT), In-context Learning (ICL), and Retrieval Augmented Generation (RAG), have significantly extended the capabilities of these models. However, lengthy prompts, though rich in information, pose substantial computational and financial burdens. Prompt compression aims to mitigate these issues by reducing prompt length without compromising essential information.

LLMLingua-2 departs from existing methods relying on information entropy by introducing a data distillation procedure that leverages knowledge from an LLM to achieve more efficient and faithful prompt compression. This method is task-agnostic, enhancing generalizability and efficiency across various applications.

Methodology

Data Distillation Procedure

LLMLingua-2's data distillation procedure involves using GPT-4 to generate a text compression dataset composed of original and compressed text pairs. This dataset is constructed by prompting GPT-4 to compress texts according to specific instructions focused on retaining crucial information while eliminating redundancy. The prompt compression task is reframed as a token classification problem, allowing a Transformer encoder to leverage bidirectional context for optimal compression.

Extractive Text Compression Dataset

The dataset comprises original texts from MeetingBank and their compressed counterparts, annotated to indicate whether each token should be preserved or discarded. Quality control metrics, such as Variation Rate (VR) and Alignment Gap (AG), ensure the fidelity and effectiveness of the annotation process.

Model Architecture and Training

The token classification model employs a Transformer encoder as the feature extractor, followed by a linear classification layer to predict token retention probabilities. The model is trained on the MeetingBank compression dataset using cross-entropy loss. Crucially, this approach guarantees the faithfulness of the compressed prompts by maintaining the original token sequence and leveraging bidirectional context.

Results

In-Domain Evaluation

The model's performance was evaluated on both QA and summarization tasks within the MeetingBank dataset. LLMLingua-2 demonstrated significant improvements over existing baselines, including Selective-Context and the original LLMLingua. Notably, despite being smaller than LLaMA-2-7B, LLMLingua-2 outperformed these models in terms of QA F1 scores and summary metrics such as BLEU, ROUGE, and BERTScore.

Out-of-Domain Evaluation

The robustness of LLMLingua-2 was further tested on long-context datasets such as LongBench and ZeroSCROLLS, as well as reasoning benchmarks like GSM8K and BBH. The results underscored LLMLingua-2's superior generalizability, achieving higher performance compared to task-agnostic baselines. Even the smaller LLMLingua-2 model (based on multilingual-BERT) surpassed the performance of LLaMA-2-7B-based models.

Efficiency and Latency

LLMLingua-2's model size and efficiency contribute to significant reductions in latency and GPU memory usage. The model accelerates end-to-end latency by 1.6x to 2.9x, offering a compelling advantage in practical deployments. Additionally, the model's peak GPU memory usage is considerably lower than that of comparative models, further enhancing its applicability in resource-constrained environments.

Implications and Future Directions

LLMLingua-2's approach to prompt compression represents a significant stride in improving the efficiency and reliability of LLM applications. The task-agnostic nature of the model ensures broad applicability, while the data distillation procedure guarantees high-quality compression without sacrificing essential information. Future research could explore extending the dataset to cover a wider range of domains, enhancing the model's generalizability further.

Overall, LLMLingua-2 sets a new standard for prompt compression in LLMs, balancing efficiency and fidelity to meet the demands of diverse real-world applications. The model's integration with existing compression frameworks and potential for expansion points to a promising trajectory for ongoing advancements in AI-driven language processing.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1770304615180816485

https://twitter.com/dippatel1994/status/1770530008391405955

https://twitter.com/woojinrad/status/1771899868581114311

https://twitter.com/Wichita_9/status/1777758969076871371

https://twitter.com/gm8xx8/status/1770265120137990418

https://twitter.com/javaeeeee1/status/1770411445982740874