In-Context Learning State Vector with Inner and Momentum Optimization (2404.11225v2)
Abstract: LLMs have exhibited an impressive ability to perform In-Context Learning (ICL) from only a few examples. Recent works have indicated that the functions learned by ICL can be represented through compressed vectors derived from the transformer. However, the working mechanisms and optimization of these vectors are yet to be thoroughly explored. In this paper, we address this gap by presenting a comprehensive analysis of these compressed vectors, drawing parallels to the parameters trained with gradient descent, and introduce the concept of state vector. Inspired by the works on model soup and momentum-based gradient descent, we propose inner and momentum optimization methods that are applied to refine the state vector progressively as test-time adaptation. Moreover, we simulate state vector aggregation in the multiple example setting, where demonstrations comprising numerous examples are usually too lengthy for regular ICL, and further propose a divide-and-conquer aggregation method to address this challenge. We conduct extensive experiments using Llama-2 and GPT-J in both zero-shot setting and few-shot setting. The experimental results show that our optimization method effectively enhances the state vector and achieves the state-of-the-art performance on diverse tasks. Code is available at https://github.com/HITsz-TMG/ICL-State-Vector
- What learning algorithm is in-context learning? investigations with linear models. ArXiv preprint, abs/2211.15661, 2022.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
- Data distributional properties drive emergent in-context learning in transformers. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, 2022.
- AdapterSoup: Weight averaging to improve generalization of pretrained language models. In Findings of the Association for Computational Linguistics: EACL 2023, pages 2054–2063, 2023.
- Why can GPT learn in-context? language models secretly perform gradient descent as meta-optimizers. In Findings of the Association for Computational Linguistics: ACL 2023, pages 4005–4019, 2023.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022.
- A survey for in-context learning. ArXiv preprint, abs/2301.00234, 2023.
- Adaptive subgradient methods for online learning and stochastic optimization. In COLT 2010 - The 23rd Conference on Learning Theory, Haifa, Israel, June 27-29, 2010, pages 257–269, 2010.
- What can transformers learn in-context? A case study of simple function classes. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, 2022.
- Alex Graves. Generating sequences with recurrent neural networks. ArXiv, abs/1308.0850, 2013.
- In-context learning creates task vectors. ArXiv preprint, abs/2310.15916, 2023.
- Linearity of relation decoding in transformer language models. ArXiv preprint, abs/2308.09124, 2023.
- Separate the wheat from the chaff: Model deficiency unlearning via parameter-efficient module operation. In Proc. of AAAI, 2024.
- Editing models with task arithmetic. ArXiv preprint, abs/2212.04089, 2022.
- Adam: A method for stochastic optimization. In International Conference on Machine Learning, 2015.
- Efficient memory management for large language model serving with pagedattention. In Jason Flinn, Margo I. Seltzer, Peter Druschel, Antoine Kaufmann, and Jonathan Mace, editors, Proceedings of the 29th Symposium on Operating Systems Principles, SOSP 2023, Koblenz, Germany, October 23-26, 2023, pages 611–626. ACM, 2023. doi: 10.1145/3600006.3613165.
- Word translation without parallel data. In International Conference on Machine Learning, 2018.
- A comprehensive survey on test-time adaptation under distribution shifts. ArXiv preprint, abs/2303.15361, 2023.
- Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023a.
- In-context vectors: Making in context learning more effective and controllable through latent space steering. ArXiv preprint, abs/2311.06668, 2023b.
- Decoupled weight decay regularization. In International Conference on Machine Learning, 2019.
- Language models implement simple word2vec-style vector arithmetic. ArXiv preprint, abs/2305.16130, 2023.
- Learning to compress prompts with gist tokens. ArXiv preprint, abs/2304.08467, 2023.
- In-context learning and gradient descent revisited. ArXiv preprint, abs/2311.07772, 2023.
- Distinguishing antonyms and synonyms in a pattern-based neural network. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 76–85, 2017.
- In-context learning and induction heads. ArXiv preprint, abs/2209.11895, 2022.
- Task-specific skill localization in fine-tuned language models. In International Conference on Machine Learning, 2023.
- Ning Qian. On the momentum term in gradient descent learning algorithms. Neural Networks, 12(1):145–151, 1999.
- Compositional task representations for large language models. In International Conference on Learning Representations, 2023.
- Do pretrained transformers really learn in-context by gradient descent? ArXiv preprint, abs/2310.08540, 2023.
- On the importance of initialization and momentum in deep learning. In Proc. of ICML, volume 28 of JMLR Workshop and Conference Proceedings, pages 1139–1147, 2013.
- Function vectors in large language models. ArXiv preprint, abs/2310.15213, 2023.
- Llama 2: Open foundation and fine-tuned chat models. ArXiv preprint, abs/2307.09288, 2023.
- The bottom-up evolution of representations in the transformer: A study with machine translation and language modeling objectives. In Proc. of EMNLP, pages 4396–4406, 2019.
- Transformers learn in-context by gradient descent. In International Conference on Machine Learning, 2023.
- GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, 2021.
- Label words are anchors: An information flow perspective for understanding in-context learning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 9840–9855, 2023.
- Emergent abilities of large language models. Trans. Mach. Learn. Res., 2022, 2022.
- Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 23965–23998, 2022.
- An explanation of in-context learning as implicit bayesian inference. In International Conference on Machine Learning, 2022.
- Iterative forward tuning boosts in-context learning in language models. ArXiv preprint, abs/2305.13016, 2023.
- Language models are super mario: Absorbing abilities from homologous models as a free lunch. ArXiv preprint, abs/2311.03099, 2023.
- Character-level convolutional networks for text classification. In Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett, editors, Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 649–657, 2015.
- Dongfang Li (46 papers)
- Zhenyu Liu (63 papers)
- Xinshuo Hu (11 papers)
- Zetian Sun (8 papers)
- Baotian Hu (67 papers)
- Min Zhang (630 papers)