LLoCO: Learning Long Contexts Offline (2404.07979v2)
Abstract: Processing long contexts remains a challenge for LLMs due to the quadratic computational and memory overhead of the self-attention mechanism and the substantial KV cache sizes during generation. We propose LLoCO, a novel approach to address this problem by learning contexts offline through context compression and in-domain parameter-efficient finetuning with LoRA. Our method enables an LLM to create a concise representation of the original context and efficiently retrieve relevant information to answer questions accurately. Our approach extends the effective context window of a 4k token LLaMA2-7B model to handle up to 128k tokens. We evaluate our approach on several long-context question-answering datasets, demonstrating that LLoCO significantly outperforms in-context learning while using $30\times$ fewer tokens during inference. LLoCO achieves up to $7.62\times$ speed-up during inference and $11.52\times$ higher throughput during finetuning, substantially reduces the cost of long document question answering. This makes it a promising solution for efficient long context processing. Our code is publicly available on https://github.com/jeffreysijuntan/lloco.
- Self-rag: Learning to retrieve, generate, and critique through self-reflection. arXiv preprint arXiv: 2310.11511, 2023.
- Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023.
- Punica: Multi-Tenant LoRA Serving, 2023a. URL https://arxiv.org/abs/2310.18547.
- Extending context window of large language models via positional interpolation. arXiv preprint arXiv: 2306.15595, 2023b.
- LongloRA: Efficient fine-tuning of long-context large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=6PmJoRfdaK.
- Adapting language models to compress contexts. arXiv preprint arXiv:2305.14788, 2023.
- A dataset of information-seeking questions and answers anchored in research papers. arXiv preprint arXiv:2105.03011, 2021.
- Multi-news: a large-scale multi-document summarization dataset and abstractive hierarchical model, 2019.
- In-context Autoencoder for Context Compression in a Large Language Model, October 2023. URL http://arxiv.org/abs/2307.06945. arXiv:2307.06945 [cs].
- gkamradt. Needle in a haystack - pressure testing llms., 2023. URL https://github.com/gkamradt/LLMTest_NeedleInAHaystack. [Accessed 26-03-2024].
- Realm: Retrieval-augmented language model pre-training. International Conference on Machine Learning, 2020.
- Kvquant: Towards 10 million context length llm inference with kv cache quantization. arXiv preprint arXiv: 2401.18079, 2024.
- Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
- Lorahub: Efficient cross-task generalization via dynamic lora composition. arXiv preprint arXiv: 2307.13269, 2023.
- Efficient attentions for long document summarization. arXiv preprint arXiv:2104.02112, 2021.
- Unsupervised dense information retrieval with contrastive learning, 2021. URL https://arxiv.org/abs/2112.09118.
- Mistral 7b. arXiv preprint arXiv: 2310.06825, 2023a.
- Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. ArXiv preprint, abs/2310.06839, 2023b. URL https://arxiv.org/abs/2310.06839.
- Llmlingua: Compressing prompts for accelerated inference of large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, December 2023c. URL https://arxiv.org/abs/2310.05736.
- Active retrieval augmented generation. Conference on Empirical Methods in Natural Language Processing, 2023d. doi: 10.48550/arXiv.2305.06983.
- The narrativeqa reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6:317–328, 2018.
- Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pp. 611–626, 2023.
- The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 3045–3059, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Neural Information Processing Systems, 2020.
- How long can context length of open-source LLMs truly promise? In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023. URL https://openreview.net/forum?id=LywifFNXV5.
- World model on million-length video and language with blockwise ringattention. arXiv preprint arXiv: 2402.08268, 2024a.
- Dora: Weight-decomposed low-rank adaptation. arXiv preprint arXiv:2402.09353, 2024b.
- When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. Annual Meeting of the Association for Computational Linguistics, 2022. doi: 10.18653/v1/2023.acl-long.546.
- Learning to compress prompts with gist tokens. Neural Information Processing Systems, 2023. doi: 10.48550/arXiv.2304.08467.
- Learning to route among specialized experts for zero-shot generalization. arXiv preprint arXiv: 2402.05859, 2024.
- LLMLingua-2: Data distillation for efficient and faithful task-agnostic prompt compression. ArXiv preprint, abs/2403.12968, 2024. URL https://arxiv.org/abs/2403.12968.
- Quality: Question answering with long input texts, yes! North American Chapter of the Association for Computational Linguistics, 2021. doi: 10.18653/v1/2022.naacl-main.391.
- YaRN: Efficient context window extension of large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=wHBfxhZu1u.
- Toolformer: Language models can teach themselves to use tools. Neural Information Processing Systems, 2023.
- S-LoRA: Serving Thousands of Concurrent LoRA Adapters, November 2023a. URL http://arxiv.org/abs/2311.03285. arXiv:2311.03285 [cs].
- Flexgen: High-throughput generative inference of large language models with a single gpu. In International Conference on Machine Learning, pp. 31094–31116. PMLR, 2023b.
- Large language models can be easily distracted by irrelevant context. International Conference on Machine Learning, 2023. doi: 10.48550/arXiv.2302.00093.
- Roformer: Enhanced transformer with rotary position embedding. NEUROCOMPUTING, 2021. doi: 10.1016/j.neucom.2023.127063.
- THUDM. Longbench: A benchmark for long-range language models. https://github.com/THUDM/LongBench, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Self-instruct: Aligning language models with self-generated instructions. arXiv preprint arXiv:2212.10560, 2022.
- Efficient streaming language models with attention sinks. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=NG7sS51zVF.
- RECOMP: Improving retrieval-augmented LMs with context compression and selective augmentation. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=mlJLVigNHp.
- Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600, 2018.
- Raft: Adapting language model to domain specific rag. arXiv preprint arXiv:2403.10131, 2024a.
- H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems, 36, 2024b.
- Qmsum: A new benchmark for query-based multi-domain meeting summarization. arXiv preprint arXiv:2104.05938, 2021.