We show that autoregressive language models can learn to infill text after we apply a straightforward transformation to the dataset, which simply moves a span of text from the middle of a document to its end. While this data augmentation has garnered much interest in recent years, we provide extensive evidence that training models with a large fraction of data transformed in this way does not harm the original left-to-right generative capability, as measured by perplexity and sampling evaluations across a wide range of scales. Given the usefulness, simplicity, and efficiency of training models to fill-in-the-middle (FIM), we suggest that future autoregressive language models be trained with FIM by default. To this end, we run a series of ablations on key hyperparameters, such as the data transformation frequency, the structure of the transformation, and the method of selecting the infill span. We use these ablations to prescribe strong default settings and best practices to train FIM models. We have released our best infilling model trained with best practices in our API, and release our infilling benchmarks to aid future research.
The paper introduces a method to add text infilling capability to causal decoder-based models, using a data transformation technique during training, and confirms its effectiveness through extensive experiments.
A key discovery is the 'FIM-for-free' property, showing that training with high proportions of FIM-transformed data does not degrade left-to-right generative performance.
The research highlights the efficiency of incorporating FIM during pretraining over finetuning, and explores several hyperparameters affecting the model's infilling capabilities.
Autoregressive language models have seen significant advances, particularly in open-ended text generation. Among these models, causal decoder-based architectures such as the GPT series have demonstrated superior performance compared to other paradigms like encoder-only and encoder-decoder models. However, a crucial capability missing in these models is text infilling—where the model generates text conditioned on both preceding and succeeding context.
This paper introduces a method to imbue causal decoder-based models with fill-in-the-middle (FIM) capabilities. The fundamental approach involves a simple data transformation where a middle span of text within a document is relocated to the end, facilitating the model to practice infilling during training. The authors proceed to investigate if this transformation influences the traditional left-to-right generative prowess of the model, confirming its effectiveness via extensive experiments and benchmarks.
In a pivotal discovery, the authors demonstrate what they term the "FIM-for-free" property: training models with a significant proportion of FIM-transformed data does not adversely affect their left-to-right generative performance. This claim is validated by training models with various proportions of FIM transformation (up to 90%) and evaluating their performance across standard autoregressive benchmarks. The left-to-right test loss for models incorporating FIM remained consistent with those trained without, suggesting an efficient integration of the text infilling capability.
The authors meticulously explore several hyperparameters:
A notable insight from the study is the differential efficiency of pretraining and finetuning to acquire FIM capabilities:
The paper proposes several future directions:
This study establishes autoregressive models as efficient generators for diverse text completion tasks, including infilling. The FIM-for-free property offers a compelling argument for adopting FIM training as a new standard, ensuring that language models are equipped with versatile capabilities without sacrificing traditional performance metrics. The findings and methodologies provided pave the way for future exploration and operational deployment of more adaptable and robust language models.