ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities (2407.14482v3)

Published 19 Jul 2024 in cs.CL, cs.AI, cs.IR, and cs.LG

Abstract: In this work, we introduce ChatQA 2, an Llama 3.0-based model with a 128K context window, designed to bridge the gap between open-source LLMs and leading proprietary models (e.g., GPT-4-Turbo-2024-04-09) in long context understanding and retrieval-augmented generation (RAG) capabilities. These two capabilities are complementary to each other and essential for LLMs to process large volumes of information that cannot fit into a single prompt. We present a detailed continued training recipe to extend the context window of Llama3-70B-base from 8K to 128K tokens, along with a three-stage instruction tuning process to enhance the model's instruction-following, RAG performance, and long-context understanding capabilities. Our results demonstrate that the Llama3-ChatQA-2-70B model outperforms most existing state-of-the-art models, including GPT-4-Turbo-2024-04-09, Qwen2-72B-Instruct, and Llama3.1-70B-Instruct, on ultra-long tasks beyond 100K tokens, as well as on the RAG benchmark using only a 4K context window, showing the strong long context capability across varying sequence lengths. We further provide extensive comparisons between direct long-context and RAG solutions using the same state-of-the-art long-context LLMs. Interestingly, we find that the performance of strong long-context LLMs using RAG improves when retrieving a larger number of chunks. With a large set of top-k chunks, RAG consistently outperforms direct long-context solution using the same state-of-the-art long-context models (e.g., Llama3-ChatQA-2-70B and Qwen2-72B-Instruct) on both 32K and 128K benchmarks. We open-source the model weights, training data, and the evaluation setup for the for the community: https://chatqa2-project.github.io/

Citations (5)

View on Semantic Scholar

Summary

The paper extends Llama3-70B’s context window to 128K tokens through continued pretraining, significantly enhancing long-context comprehension.
A rigorous three-stage instruction-tuning process refines the model’s instruction-following and retrieval-augmented generation capabilities.
Evaluations demonstrate that ChatQA 2 rivals GPT-4-Turbo on long-context tasks and RAG benchmarks, offering a flexible solution for diverse applications.

An Overview of ChatQA 2: Advancements in Long-Context and Retrieval-Augmented Capabilities

The paper "ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities" by Xu et al. introduces a new model, ChatQA 2, designed to close the performance gap between open-access LLMs and leading proprietary models such as GPT-4-Turbo. This paper delineates the methodology and results associated with improving long-context understanding and retrieval-augmented generation (RAG) capabilities, essential for processing large volumes of information and adapting to diverse downstream tasks and computational budgets.

Key Contributions

The authors make the following significant contributions:

Context Window Extension: The authors extend the context window of the Llama3-70B model from 8K to 128K tokens. This was achieved through continued pretraining on the SlimPajama dataset with an upsampled mix of long sequences, reaching 10 billion tokens with sequence lengths extending to 128K.
Three-Stage Instruction Tuning: The paper details a comprehensive three-stage instruction-tuning process aimed at enhancing the model's instruction-following abilities, RAG performance, and long-context understanding capabilities.
Performance Evaluation: The resulting $\mathtt{Llama3\text{-}ChatQA\text{-}2\text{-}70B}$ model demonstrates accuracy comparable to GPT-4-Turbo on many long-context understanding tasks and surpasses it on RAG benchmarks.
RAG and Long-Context Solutions: Extensive comparisons are made between RAG and long-context solutions using state-of-the-art long-context LLMs, showcasing the complementary nature of these techniques.

Methodology

Extending Context Window to 128K

The authors utilize a two-step approach to achieve a 128K context window. Initially, Llama3-70B base's context window is extended from 8K to 128K through continued pretraining on a dataset derived from SlimPajama, emphasizing long sequences. The RoPE base frequency is increased substantially to accommodate the extended context window, and a learning rate of 3e-5 is applied over 2000 steps.

Instruction Tuning

The authors present a three-stage instruction-tuning process:

Initial Training: The model is trained on 128k high-quality instruction-following datasets.
Conversational QA Data: The model is further trained on conversational QA data with context provided for improved RAG capabilities.
Long SFT Dataset: To enhance the model's performance on long-context sequences (up to 128K tokens), a synthetic dataset combining existing long-context datasets and assembled documents from NarrativeQA is used for fine-tuning.

Long-Context Retriever Integration

To address limitations in the existing RAG pipeline—such as context fragmentation and efficiency—the authors integrate a state-of-the-art long-context retriever using E5-mistral embedding. They demonstrate that this integration alleviates many of the issues and further improves RAG-based results on long-context understanding tasks.

Evaluation and Results

The model is evaluated on various benchmarks covering short, medium, and long context tasks:

Needle in a Haystack Test: The model achieves perfect accuracy, confirming its strong long-context retrieval capabilities.
InfiniteBench (Over 100K Tokens): The model shows competitive performance against state-of-the-art models, achieving an average score of 34.11.
Medium-Long Context Benchmarks (Within 32K Tokens): The model scores 47.37, outperforming several comparable models but slightly trailing GPT-4-Turbo-2024-04-09 and Qwen2-72B-Instruct on certain tasks.
ChatRAG Bench (Within 4K Tokens): The model achieves an average score of 54.81, demonstrating competence in shorter context tasks while maintaining strong competitive standing against other state-of-the-art models.

Implications and Future Directions

The research presented in this paper has practical implications for both long-context and retrieval-augmented generation tasks. By leveraging both capabilities, the model can adapt to varying accuracy and efficiency requirements in downstream tasks. However, extending short context models to long contexts without performance degradation on shorter contexts remains a significant challenge, which presents an exciting direction for future research. The breakthroughs achieved with ChatQA 2 offer a promising foundation for further advancements in scalable and versatile LLMs.

Conclusion

Through methodical context window extension and a meticulously staged instruction-tuning regimen, Xu et al. successfully elevate the performance of the Llama3-70B model to rival proprietary models like GPT-4-Turbo. The ChatQA 2 model demonstrates significant improvements in both long-context task handling and RAG performance, providing a flexible and robust solution for complex language processing tasks. The detailed technical recipe and evaluation provided make this work not only a valuable addition to the open LLM community but also a reproducible benchmark for future research in this dynamic field.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_reachsumit/status/1815212629948068161

https://twitter.com/fly51fly/status/1817681063524900872

https://twitter.com/VidaofAi/status/1815401474727276944

https://twitter.com/gm8xx8/status/1815201149462474813

https://twitter.com/javaeeeee1/status/1815504413236338859

https://twitter.com/knishimae0531/status/1817919434906067348