Emergent Mind

Abstract

This paper focuses on task-agnostic prompt compression for better generalizability and efficiency. Considering the redundancy in natural language, existing approaches compress prompts by removing tokens or lexical units according to their information entropy obtained from a causal language model such as LLaMa-7B. The challenge is that information entropy may be a suboptimal compression metric: (i) it only leverages unidirectional context and may fail to capture all essential information needed for prompt compression; (ii) it is not aligned with the prompt compression objective. To address these issues, we propose a data distillation procedure to derive knowledge from an LLM to compress prompts without losing crucial information, and meantime, introduce an extractive text compression dataset. We formulate prompt compression as a token classification problem to guarantee the faithfulness of the compressed prompt to the original one, and use a Transformer encoder as the base architecture to capture all essential information for prompt compression from the full bidirectional context. Our approach leads to lower latency by explicitly learning the compression objective with smaller models such as XLM-RoBERTa-large and mBERT. We evaluate our method on both in-domain and out-of-domain datasets, including MeetingBank, LongBench, ZeroScrolls, GSM8K, and BBH. Despite its small size, our model shows significant performance gains over strong baselines and demonstrates robust generalization ability across different LLMs. Additionally, our model is 3x-6x faster than existing prompt compression methods, while accelerating the end-to-end latency by 1.6x-2.9x with compression ratios of 2x-5x.

Overview of LLMLingua-2 model.

Overview

  • LLMLingua-2 proposes a data distillation method for prompt compression in LLMs, which retains the essential information while reducing prompt length to mitigate computational and financial burdens.

  • The approach involves generating a text compression dataset via GPT-4, treating prompt compression as a token classification problem for efficient annotation, and training a Transformer model on this dataset.

  • The model outperforms existing baselines in both in-domain and out-of-domain evaluations, demonstrating improved efficiency, reduced latency, and lower GPU memory usage, thus showing promise for various real-world applications.

LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression

Abstract This document presents an overview and an in-depth analysis of the research presented in "LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression." The paper addresses the challenges of prompt compression in the context of LLMs, presenting LLMLingua-2, a novel approach designed to streamline and enhance the efficiency of prompt compression while ensuring the retention of critical information.

Introduction Recent advancements in prompting techniques for LLMs, such as Chain-of-Thought (COT), In-context Learning (ICL), and Retrieval Augmented Generation (RAG), have significantly extended the capabilities of these models. However, lengthy prompts, though rich in information, pose substantial computational and financial burdens. Prompt compression aims to mitigate these issues by reducing prompt length without compromising essential information.

LLMLingua-2 departs from existing methods relying on information entropy by introducing a data distillation procedure that leverages knowledge from an LLM to achieve more efficient and faithful prompt compression. This method is task-agnostic, enhancing generalizability and efficiency across various applications.

Methodology

Data Distillation Procedure

LLMLingua-2's data distillation procedure involves using GPT-4 to generate a text compression dataset composed of original and compressed text pairs. This dataset is constructed by prompting GPT-4 to compress texts according to specific instructions focused on retaining crucial information while eliminating redundancy. The prompt compression task is reframed as a token classification problem, allowing a Transformer encoder to leverage bidirectional context for optimal compression.

Extractive Text Compression Dataset

The dataset comprises original texts from MeetingBank and their compressed counterparts, annotated to indicate whether each token should be preserved or discarded. Quality control metrics, such as Variation Rate (VR) and Alignment Gap (AG), ensure the fidelity and effectiveness of the annotation process.

Model Architecture and Training The token classification model employs a Transformer encoder as the feature extractor, followed by a linear classification layer to predict token retention probabilities. The model is trained on the MeetingBank compression dataset using cross-entropy loss. Crucially, this approach guarantees the faithfulness of the compressed prompts by maintaining the original token sequence and leveraging bidirectional context.

Results

In-Domain Evaluation

The model's performance was evaluated on both QA and summarization tasks within the MeetingBank dataset. LLMLingua-2 demonstrated significant improvements over existing baselines, including Selective-Context and the original LLMLingua. Notably, despite being smaller than LLaMA-2-7B, LLMLingua-2 outperformed these models in terms of QA F1 scores and summary metrics such as BLEU, ROUGE, and BERTScore.

Out-of-Domain Evaluation

The robustness of LLMLingua-2 was further tested on long-context datasets such as LongBench and ZeroSCROLLS, as well as reasoning benchmarks like GSM8K and BBH. The results underscored LLMLingua-2's superior generalizability, achieving higher performance compared to task-agnostic baselines. Even the smaller LLMLingua-2 model (based on multilingual-BERT) surpassed the performance of LLaMA-2-7B-based models.

Efficiency and Latency LLMLingua-2's model size and efficiency contribute to significant reductions in latency and GPU memory usage. The model accelerates end-to-end latency by 1.6x to 2.9x, offering a compelling advantage in practical deployments. Additionally, the model's peak GPU memory usage is considerably lower than that of comparative models, further enhancing its applicability in resource-constrained environments.

Implications and Future Directions LLMLingua-2's approach to prompt compression represents a significant stride in improving the efficiency and reliability of LLM applications. The task-agnostic nature of the model ensures broad applicability, while the data distillation procedure guarantees high-quality compression without sacrificing essential information. Future research could explore extending the dataset to cover a wider range of domains, enhancing the model's generalizability further.

Overall, LLMLingua-2 sets a new standard for prompt compression in LLMs, balancing efficiency and fidelity to meet the demands of diverse real-world applications. The model's integration with existing compression frameworks and potential for expansion points to a promising trajectory for ongoing advancements in AI-driven language processing.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.