One Token Can Help! Learning Scalable and Pluggable Virtual Tokens for Retrieval-Augmented Large Language Models

Published 30 May 2024 in cs.CL | (2405.19670v4)

Abstract: Retrieval-augmented generation (RAG) is a promising way to improve LLMs for generating more factual, accurate, and up-to-date content. Existing methods either optimize prompts to guide LLMs in leveraging retrieved information or directly fine-tune LLMs to adapt to RAG scenarios. Although fine-tuning can yield better performance, it often compromises the LLMs' general generation capabilities by modifying their parameters. This limitation poses challenges in practical applications, especially when LLMs are already deployed, as parameter adjustments may affect their original functionality. To address this, we propose a novel method that involves learning scalable and pluggable virtual tokens for RAG. By maintaining the LLMs' original parameters and fine-tuning only the embeddings of these pluggable tokens, our approach not only enhances LLMs' performance but also preserves their general generation capabilities. Furthermore, we design several training strategies to improve the scalability, flexibility, and generalizability of our method. Comprehensive experiments across 12 question-answering tasks demonstrate the superiority of our approach.

Abstract PDF HTML Upgrade to Chat

Citations (4)

View on Semantic Scholar

Summary

The paper proposes SPRING, a novel approach that inserts a few trainable virtual tokens between retrieved documents and user queries to enhance RAG performance.
It demonstrates significant gains on QA benchmarks with improvements up to +33% EM and +12% F1, outperforming prompt-based methods while maintaining non-RAG tasks.
The scalable, plug-and-play design updates only the virtual token embeddings, ensuring flexibility and preserving the original LLM's capabilities.

This paper introduces SPRING (Scalable and Pluggable virtual Tokens for Retrieval-augmented Generation), a novel method to enhance the performance of LLMs in Retrieval-Augmented Generation (RAG) scenarios without compromising their general capabilities (2405.19670). Current RAG approaches either use prompt engineering, which can be suboptimal, or fine-tune the LLM (e.g., using LoRA), which improves RAG performance but often degrades performance on non-RAG tasks by altering the model's parameters.

SPRING addresses this by introducing a small number of trainable "virtual tokens" into the input sequence. Specifically, these virtual tokens are inserted between the retrieved documents ( $R$ ) and the user query ( $Q$ ). During training, only the embeddings of these virtual tokens ( $\delta$ ) are updated, while the parameters of the backbone LLM ( $\theta$ ) remain frozen. This makes the method highly parameter-efficient (e.g., adding 50 tokens to Mistral-7b only adds 0.2M trainable parameters). The input format becomes $[R; t_1, t_2, \dots, t_n; Q]$ , where $t_i$ are the virtual tokens.

Key features of SPRING include:

Scalability: A unique training strategy is proposed where, for each training sample, a random number $k$ (less than or equal to the total number of virtual tokens $n$ ) is chosen, and only the first $k$ virtual tokens ( $t_{1:k}$ ) are used. This allows the trained tokens to be effective even when only a subset is used during inference, enabling dynamic adjustment based on context length constraints or desired performance trade-offs. Experiments show that even a single virtual token ( $k=1$ ) can significantly improve RAG performance.
Pluggability: Since the base LLM's parameters are untouched, the learned virtual tokens act as a plug-and-play module. For RAG tasks, the tokens (represented as special tokens like [r1], [r2], etc., added to the vocabulary with the learned embeddings) are included in the input. For non-RAG tasks, they are simply omitted, preserving the LLM's original performance on general tasks.
Effectiveness: Experiments conducted on nine Question Answering (QA) datasets (including TQA, NQ, HQA, SQuAD, PopQA) using various LLMs (Mistral-7b, LLaMA-2-7b, Phi-3-4b, Qwen-1.8b) demonstrate significant improvements in RAG performance (average +33% EM, +12% F1 over prompt-based methods for Mistral-7b). While LoRA achieves slightly higher RAG scores, it drastically degrades performance on non-RAG and general capability benchmarks (BoolQ, CommonsenseQA, GSM8K, MMLU), unlike SPRING which preserves the original LLM's performance.
Generalizability: SPRING shows robustness across different retrievers (BM25, BGE-base, E5-base, E5-large) and varying numbers of retrieved passages. Training is performed by mixing data from multiple QA datasets and randomly varying the number of retrieved passages used ( $m \in [1,5]$ ), enhancing adaptability. The method also generalizes well to unseen datasets (PopQA).

The authors position SPRING as a lightweight, efficient, and practical solution for enhancing deployed LLMs with RAG capabilities without disrupting their existing functionalities. The code and trained virtual tokens are made publicly available.

Markdown Report Issue