Long Context Compression with Activation Beacon (2401.03462v3)
Abstract: Long context compression is a critical research problem due to its significance in reducing the high computational and memory costs associated with LLMs. In this paper, we propose Activation Beacon, a plug-in module for transformer-based LLMs that targets effective, efficient, and flexible compression of long contexts. To achieve this, our method introduces the following technical designs. 1) We directly compress the activations (i.e. keys and values at every layer), rather than leveraging soft prompts to relay information (which constitute a major bottleneck to encapsulate the complex information within long contexts). 2) We tailor the compression workflow, where each fine-grained input unit is progressively compressed, enabling high-quality compression and efficient computation during both training and inference. 3) We train the model through compression-based auto-regression, making full use of plain texts and instructional data to optimize the model's compression performance. 4) During training, we randomly sample a compression ratio at each step, teaching the model to support a wide range of compression configurations. Extensive evaluations are conducted on various long-context tasks whose lengths (e.g., 128K) may far exceed the maximum training length (20K), such as document understanding, few-shot learning, and Needle-in-a-Haystack. Whilst existing methods struggle to handle these challenging tasks, Activation Beacon maintains a comparable performance to the uncompressed baseline across various scenarios, achieving a 2x acceleration in inference time and an 8x reduction of memory costs for KV cache. Our data, model, and code have been released at \url{https://github.com/FlagOpen/FlagEmbedding/}.
- Ntk-aware scaled rope, 2023.
- Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023.
- Longformer: The long-document transformer. CoRR, abs/2004.05150, 2020.
- Scaling transformer to 1m tokens and beyond with RMT. CoRR, abs/2304.11062, 2023.
- Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023.
- Longlora: Efficient fine-tuning of long-context large language models. arXiv preprint arXiv:2309.12307, 2023.
- Adapting language models to compress contexts. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 3829–3846. Association for Computational Linguistics, 2023.
- Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
- Rethinking attention with performers. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
- Together Computer. Redpajama: An open source recipe to reproduce llama training dataset, 2023.
- Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. CoRR, abs/2307.08691, 2023.
- Longnet: Scaling transformers to 1, 000, 000, 000 tokens. CoRR, abs/2307.02486, 2023.
- Lm-infinite: Simple on-the-fly length generalization for large language models. CoRR, abs/2308.16137, 2023.
- Long-range language modeling with selective cache. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 4838–4858. Association for Computational Linguistics, 2023.
- Reformer: The efficient transformer. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020.
- How long can open-source llms truly promise on context length?, June 2023.
- Christian Puhrsch Michael Gschwind, Driss Guessous. Accelerated pytorch 2 transformers. https://pytorch.org/blog/accelerated-pytorch-2/, 2023.
- Landmark attention: Random-access infinite context length for transformers. arXiv preprint arXiv:2305.16300, 2023.
- Learning to compress prompts with gist tokens. CoRR, abs/2304.08467, 2023.
- Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071, 2023.
- Train short, test long: Attention with linear biases enables input length extrapolation. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.
- Compressive transformers for long-range sequence modelling. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020.
- Combiner: Full attention transformer with sparse computation cost. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 22470–22482, 2021.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
- Long-range language modeling with self-retrieval. CoRR, abs/2306.13421, 2023.
- Jianlin Su. Rectified rotary position embeddings. https://github.com/bojone/rerope, 2023.
- Roformer: Enhanced transformer with rotary position embedding. CoRR, abs/2104.09864, 2021.
- A length-extrapolatable transformer. arXiv preprint arXiv:2212.10554, 2022.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Natural language processing with transformers, 2022.
- Focused transformer: Contrastive training for context scaling. arXiv preprint arXiv:2307.03170, 2023.
- Linformer: Self-attention with linear complexity. CoRR, abs/2006.04768, 2020.
- Augmenting language models with long-term memory. CoRR, abs/2306.07174, 2023.
- Memorizing transformers. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.
- Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023.
- Retrieval meets long context large language models. CoRR, abs/2310.03025, 2023.
- Big bird: Transformers for longer sequences. Advances in neural information processing systems, 33:17283–17297, 2020.
- Retrieve anything to augment large language models. CoRR, abs/2310.07554, 2023.
- Bartosz Piotrowski Zhangir Azerbayev, Edward Ayers. Proof-pile. https://huggingface.co/datasets/hoskinson-center/proof-pile, 2022.
- Pose: Efficient context window extension of llms via positional skip-wise training. CoRR, abs/2309.10400, 2023.