Emergent Mind

Abstract

In long context scenarios, LLMs face three main challenges: higher computational/financial cost, longer latency, and inferior performance. Some studies reveal that the performance of LLMs depends on both the density and the position of the key information (question relevant) in the input prompt. Inspired by these findings, we propose LongLLMLingua for prompt compression towards improving LLMs' perception of the key information to simultaneously address the three challenges. We conduct evaluation on a wide range of long context scenarios including single-/multi-document QA, few-shot learning, summarization, synthetic tasks, and code completion. The experimental results show that LongLLMLingua compressed prompt can derive higher performance with much less cost. The latency of the end-to-end system is also reduced. For example, on NaturalQuestions benchmark, LongLLMLingua gains a performance boost of up to 17.1% over the original prompt with ~4x fewer tokens as input to GPT-3.5-Turbo. It can derive cost savings of \$28.5 and \$27.4 per 1,000 samples from the LongBench and ZeroScrolls benchmark, respectively. Additionally, when compressing prompts of ~10k tokens at a compression rate of 2x-10x, LongLLMLingua can speed up the end-to-end latency by 1.4x-3.8x. Our code is available at https://aka.ms/LLMLingua.

LLMs' performance drops with noisy prompts; improved by question-aware compression and document reordering strategies.

Overview

  • LongLLMLingua introduces a prompt compression technique to address the challenges LLMs face when processing extensive contexts, thereby reducing computational and financial costs and improving performance.

  • The methodology includes question-aware coarse-to-fine compression, document reordering mechanism, dynamic compression ratios, and post-compression subsequence recovery to maintain information integrity.

  • Experimental results demonstrated performance improvements, substantial cost savings, and latency reductions, making LongLLMLingua suitable for cost-sensitive and latency-critical applications involving long context scenarios.

LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression

Introduction

The paper "LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression" addresses significant challenges faced by LLMs when processing extensive context. These challenges include higher computational and financial costs, longer latency, and degraded performance. Previous studies have noted that the effectiveness of LLMs is influenced by the density and position of key information within the input. Building on these insights, LongLLMLingua proposes a technique for prompt compression to enhance the perception of key information by LLMs, thus alleviating the identified challenges.

Key Contributions

The paper makes several notable contributions:

Question-Aware Coarse-to-Fine Compression:

  • The authors introduce a question-aware coarse-to-fine compression method. This method incrementally compresses the prompt by focusing first on a high-level coarse compression and then on fine-grained token-level compression, thus effectively concentrating the key information relevant to the question.

Document Reordering Mechanism:

  • A document reordering mechanism is proposed to mitigate information loss that frequently occurs when relevant information is placed in the middle of long contexts. By reordering documents based on their relevance scores, derived via coarse-grained compression, the key information is placed at positions where LLMs can more effectively process it.

Dynamic Compression Ratios:

  • To better control the level of compression applied to different documents, the authors present dynamic compression ratios. This allows for adaptive granular control during fine-grained compression, ensuring that more relevant documents retain a higher amount of original content.

Post-Compression Subsequence Recovery:

  • A subsequence recovery strategy is proposed to restore the integrity of information that may have been compromised during compression. This ensures that key entities and other critical details are accurately preserved in the compressed prompt.

Methodology

Problem Formulation and Approach

The problem is framed as an optimization problem where the goal is to compress a given prompt while preserving the distribution of the target LLM's output as close as possible to the distribution obtained from the original prompt. The process incorporates both token-level subsequence selection and document reordering.

Coarse-Grained Compression:

The authors calculate a relevance score r_k for each document in the prompt using question-conditioned perplexities. Irrelevant documents are discarded to reduce noise in the compressed prompt.

Fine-Grained Compression:

The token-level importance within the documents retained after coarse compression is calculated using contrastive perplexity. This ensures that only the most relevant tokens are retained, further compressing the prompt.

Document Reordering:

Documents are reordered based on their relevance scores to position the most pertinent information at the beginning or end of the prompt, where LLMs are more effective.

Subsequence Recovery:

During response generation, a token-level subsequence recovery method is employed to correct potential distortions in key information caused by token removal, thus improving the accuracy and reliability of the LLM’s output.

Experimental Results

The efficacy of LongLLMLingua was evaluated on several benchmarks encompassing various long context scenarios, including multi-document QA, few-shot learning, summarization, synthetic tasks, and code completion. Key results include:

Performance Improvement:

On the NaturalQuestions benchmark, LongLLMLingua achieved performance gains of up to 17.1% over the original prompt with approximately 4x fewer input tokens to GPT-3.5-Turbo.

Cost Savings:

LongLLMLingua demonstrated substantial financial savings, reducing inference costs by \$28.5 and \$27.4 per 1,000 samples on the LongBench and ZeroScrolls benchmarks, respectively.

Latency Reduction:

When prompts of approximately 10,000 tokens were compressed at rates between 2x to 10x, the end-to-end latency was reduced by 1.4x to 3.8x.

Implications and Future Work

Practical Implications

LongLLMLingua has practical implications for efficiently deploying LLMs in cost-sensitive and latency-critical applications, particularly those involving long context scenarios such as extensive document retrieval, legal text analysis, and scientific literature summarization.

Theoretical Implications

The work provides insights into the importance of information structuring within prompts and suggests further exploration into optimizing information retrieval and alignment techniques. The proposed methods could be extended to other domains, such as employing different kinds of conditioning for relevance score calculations.

Future Directions

Future research may focus on integrating LongLLMLingua with other LLM frameworks to further improve its applicability and efficiency. Additionally, the development of more sophisticated relevance metrics and advanced sequence recovery techniques could enhance performance further.

Conclusion

LongLLMLingua provides a sophisticated approach to managing long contexts in LLMs, addressing both efficiency and performance issues through innovative prompt compression techniques. The experimental results affirm the method's efficacy, and its broader applicability suggests significant potential for optimizing LLM performance in various real-world applications.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube