SequenceMatch: Imitation Learning for Autoregressive Sequence Modelling with Backtracking (2306.05426v3)
Abstract: In many domains, autoregressive models can attain high likelihood on the task of predicting the next observation. However, this maximum-likelihood (MLE) objective does not necessarily match a downstream use-case of autoregressively generating high-quality sequences. The MLE objective weights sequences proportionally to their frequency under the data distribution, with no guidance for the model's behaviour out of distribution (OOD): leading to compounding error during autoregressive generation. In order to address this compounding error problem, we formulate sequence generation as an imitation learning (IL) problem. This allows us to minimize a variety of divergences between the distribution of sequences generated by an autoregressive model and sequences from a dataset, including divergences with weight on OOD generated sequences. The IL framework also allows us to incorporate backtracking by introducing a backspace action into the generation process. This further mitigates the compounding error problem by allowing the model to revert a sampled token if it takes the sequence OOD. Our resulting method, SequenceMatch, can be implemented without adversarial training or architectural changes. We identify the SequenceMatch-$\chi2$ divergence as a more suitable training objective for autoregressive models which are used for generation. We show that empirically, SequenceMatch training leads to improvements over MLE on text generation with LLMs and arithmetic.
- LS-IQ: Implicit reward regularization for inverse reinforcement learning. In The Eleventh International Conference on Learning Representations, 2023.
- Wasserstein generative adversarial networks. In International Conference on Machine Learning, pages 214–223. PMLR, 2017.
- Why Exposure Bias Matters: An Imitation Learning Perspective of Error Accumulation in Language Generation. In Findings of the Association for Computational Linguistics: ACL 2022, pages 700–710, 2022.
- Adversarial soft advantage fitting: Imitation learning without policy optimization. In Advances in Neural Information Processing Systems, volume 33, pages 12334–12344, 2020.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
- JAX: Composable transformations of Python+NumPy programs{}, 2020.
- Language Models are Few-Shot Learners. arXiv:2005.14165 [cs], July 2020.
- Behavioural cloning in control of a dynamic system. In 1995 IEEE International Conference on Systems, Man and Cybernetics. Intelligent Systems for the 21st Century, volume 3, pages 2904–2909. IEEE, 1995.
- Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 889–898, 2018.
- A theoretical analysis of the repetition problem in text generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 12848–12856, 2021.
- IQ-Learn: Inverse soft-Q learning for imitation. In NeurIPS, 2021.
- Generative adversarial networks. Neural Information Processing Systems (NeurIPS), 2014.
- Flax: A neural network library and ecosystem for JAX, 2020.
- Generative adversarial imitation learning. Advances in neural information processing systems, 29, 2016.
- The curious case of neural text degeneration. In International Conference on Learning Representations, 2019.
- A Survey on Generative Adversarial Networks: Variants, Applications, and Training. arXiv:2006.05132 [cs], June 2020.
- A simple contrastive learning objective for alleviating neural text degeneration. arXiv preprint arXiv:2205.02517, 2022.
- Imitation learning via off-policy distribution matching. In International Conference on Learning Representations, 2019.
- Training Wasserstein GANs without gradient penalties. arXiv:2110.14150 [cs, math], October 2021.
- A Tutorial on Energy-Based Learning, 2006.
- Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
- Playing Atari with Deep Reinforcement Learning, 2013.
- Algorithms for inverse reinforcement learning. In In Proc. 17th International Conf. on Machine Learning. Citeseer, 2000.
- MAUVE scores for generative models: Theory and practice. arXiv preprint arXiv:2212.14578, 2022.
- Language models are unsupervised multitask learners, 2018.
- A fixed point theorem for the infinite-dimensional simplex. Journal of mathematical analysis and applications, 332(2):1063–1070, 2007.
- A reduction of imitation learning and structured prediction to no-regret online learning. In AISTATS, pages 627–635, 2011.
- Toward Diverse Text Generation with Inverse Reinforcement Learning. arXiv:1804.11258 [cs, stat], June 2018.
- Maurice Sion. On general minimax theorems. Pacific Journal of Mathematics, 8(4):171–176, 1958.
- Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems, volume 32, 2019.
- Score-Based Generative Modeling through Stochastic Differential Equations. arXiv:2011.13456 [cs, stat], February 2021.
- Of Moments and Matching: Trade-offs and Treatments in Imitation Learning. arXiv:2103.03236 [cs, stat], March 2021.
- Apprenticeship learning using linear programming. In Proceedings of the 25th International Conference on Machine Learning, pages 1032–1039, 2008.
- Shichang Tang. Lessons Learned from the Training of GANs on Artificial Datasets. arXiv:2007.06418 [cs, stat], July 2020.
- Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, 2017.
- Scaling autoregressive video models. In International Conference on Learning Representations, 2020.
- Neural text generation with unlikelihood training. In International Conference on Learning Representations, 2019.
- Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.