Attention Sorting Combats Recency Bias In Long Context Language Models (2310.01427v1)
Abstract: Current LLMs often fail to incorporate long contexts efficiently during generation. We show that a major contributor to this issue are attention priors that are likely learned during pre-training: relevant information located earlier in context is attended to less on average. Yet even when models fail to use the information from a relevant document in their response, they still pay preferential attention to that document compared to an irrelevant document at the same position. We leverage this fact to introduce ``attention sorting'': perform one step of decoding, sort documents by the attention they receive (highest attention going last), repeat the process, generate the answer with the newly sorted context. We find that attention sorting improves performance of long context models. Our findings highlight some challenges in using off-the-shelf LLMs for retrieval augmented generation.
- Anthropic. Prompt engineering for claude’s long context window, 2023. Accessed: 2023-09-25.
- Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
- The elephant in the interpretability room: Why use attention as explanation when we have saliency methods? arXiv preprint arXiv:2010.05607, 2020.
- Analysis methods in neural language processing: A survey. Transactions of the Association for Computational Linguistics, 7:49–72, 2019.
- Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
- End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
- Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023.
- Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
- Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR, 2020.
- The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019.
- Distilling knowledge from reader to retriever for question answering. arXiv preprint arXiv:2012.04584, 2020.
- Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906, 2020.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
- Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172, 2023.
- Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568, 2023.
- Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114, 2021.
- the-little-retrieval-test, 2023.
- Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071, 2023.
- Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021.
- In-context retrieval-augmented language models. arXiv preprint arXiv:2302.00083, 2023.
- Transformer protein language models are unsupervised structure learners. Biorxiv, pages 2020–12, 2020.
- Long-range language modeling with self-retrieval. arXiv preprint arXiv:2306.13421, 2023.
- Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023.
- Is attention interpretable? arXiv preprint arXiv:1906.03731, 2019.
- Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652, 2023.
- Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
- Do long-range language models actually use long-range context? arXiv preprint arXiv:2109.09115, 2021.
- Long range arena: A benchmark for efficient transformers. arXiv preprint arXiv:2011.04006, 2020.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Attention interpretability across nlp tasks. arXiv preprint arXiv:1909.11218, 2019.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
- Larger language models do in-context learning differently. arXiv preprint arXiv:2303.03846, 2023.
- Memorizing transformers. arXiv preprint arXiv:2203.08913, 2022.
- Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pages 2048–2057. PMLR, 2015.