Learning to Compress Prompt in Natural Language Formats (2402.18700v2)

Published 28 Feb 2024 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs are great at processing multiple natural language processing tasks, but their abilities are constrained by inferior performance with long context, slow inference speed, and the high cost of computing the results. Deploying LLMs with precise and informative context helps users process large-scale datasets more effectively and cost-efficiently. Existing works rely on compressing long prompt contexts into soft prompts. However, soft prompt compression encounters limitations in transferability across different LLMs, especially API-based LLMs. To this end, this work aims to compress lengthy prompts in the form of natural language with LLM transferability. This poses two challenges: (i) Natural Language (NL) prompts are incompatible with back-propagation, and (ii) NL prompts lack flexibility in imposing length constraints. In this work, we propose a Natural Language Prompt Encapsulation (Nano-Capsulator) framework compressing original prompts into NL formatted Capsule Prompt while maintaining the prompt utility and transferability. Specifically, to tackle the first challenge, the Nano-Capsulator is optimized by a reward function that interacts with the proposed semantics preserving loss. To address the second question, the Nano-Capsulator is optimized by a reward function featuring length constraints. Experimental results demonstrate that the Capsule Prompt can reduce 81.4% of the original length, decrease inference latency up to 4.5x, and save 80.1% of budget overheads while providing transferability across diverse LLMs and different datasets.

References (25)

Citations (12)

View on Semantic Scholar

Summary

The paper introduces Nano-Capsulator, a framework that compresses long prompts into concise, natural language Capsules while retaining key semantic information.
It employs semantic preservation loss and reward-based optimization to reduce prompt lengths by up to 81.4%, cut inference latency by 4.5 times, and save 80.1% in costs.
Extensive experiments on diverse datasets demonstrate high transferability across LLMs, ensuring scalability and practical efficiency for real-world applications.

Learning to Compress Prompt in Natural Language Formats

The paper entitled "Learning to Compress Prompt in Natural Language Formats" explores the challenges and solutions associated with reducing the length of prompts used in LLMs while maintaining their effectiveness. The authors introduce an innovative framework, referred to as Nano-Capsulator, that employs a novel compression technique aimed at converting long prompts into shorter, natural language (NL) formatted prompts, termed Capsules. This approach addresses the two primary issues faced by existing soft prompt compression methods: transferability and flexibility across different LLMs.

Key Contributions

The main contributions of this paper are multi-fold:

Framework Introduction: The authors propose the Nano-Capsulator framework, which involves compressing long prompts into NL Capsules. These Capsules retain a high degree of semantic relevance and offer better transferability across different LLMs and datasets.
Optimization Techniques: The compression is achieved by employing a semantic preservation loss and a reward-based optimization to maintain the utility of the compressed prompts.
Practical Benefits: Experimental results reveal that the Capsule can significantly reduce the length of the original prompts by up to 81.4%, decrease inference latency by as much as 4.5 times, and cut budget overheads by 80.1% while preserving performance.

Methodology

The compression is guided by a well-structured optimization process that ensures both semantic fidelity and task utility. The reward-based optimization considers task-specific question-answer pairs to fine-tune the Capsules, maintaining the effectiveness of the prompts under length constraints. The overall loss function integrates a semantic preservation component to ensure the shorter prompts retain the essence of the longer ones.

Experimental Results and Implications

Evaluation Metrics and Datasets

The framework was tested on multiple datasets and LLMs to validate its effectiveness:

Few-shot CoT: Using datasets such as CommonsenseQA (CSQA) and GSM8K.
Reading Comprehension: Evaluations were conducted on MultiRC and TriviaQA-Long datasets.

The performance metrics include accuracy for individual tasks along with compression rate, latency reduction, and cost savings.

Key Findings

Effective Compression: Capsules achieve substantial compression rates while retaining similar performance levels across different LLMs such as Vicuna-13B, PaLM, and Claude2.
Reduced Cost and Latency: Significant reductions in computational costs and latency were observed. For instance, on the Claude2 API, Capsules saved up to 80.1% of the cost and reduced inference latency by up to 4.5 times.
High Transferability: The framework showed strong results even when applied to unseen datasets, suggesting that the approach generalizes well across different domains without requiring retraining.

Discussion

The implications of these findings are significant in both theoretical and practical contexts. Theoretically, this work underscores the potential for improving LLM efficiency without substantial performance trade-offs. Practically, the reduction in computational overhead and cost makes it more feasible to deploy LLMs at scale, especially in industries where cost and speed are critical factors. The ability to maintain performance across various LLMs and datasets also highlights the robustness of the Nano-Capsulator framework.

Looking forward, future research could explore further optimization techniques and adaptations of the Nano-Capsulator framework to extend its applicability to more diverse types of LLM tasks and larger-scale datasets. Integrating this framework with cross-modal architectures such as those involving vision and language tasks could also be a fruitful direction.

Conclusion

The proposed Nano-Capsulator framework showcases a promising methodology for prompt compression in LLMs, addressing key challenges in transferability and computational efficiency. The results demonstrate substantial benefits in reducing prompt lengths, cutting costs, and decreasing latency—all while preserving the utility and effectiveness of the prompts. This work marks an important step towards more practical and scalable applications of LLMs, paving the way for broader adoption and use-case diversity. Given its robust performance and applicability, the Nano-Capsulator framework sets a foundational precedent for future advancements in prompt optimization and compression strategies.

PDF Markdown