Cramming: Training a Language Model on a Single GPU in One Day (2212.14034v1)

Published 28 Dec 2022 in cs.CL and cs.LG

Abstract: Recent trends in LLMing have focused on increasing performance through scaling, and have resulted in an environment where training LLMs is out of reach for most researchers and practitioners. While most in the community are asking how to push the limits of extreme computation, we ask the opposite question: How far can we get with a single GPU in just one day? We investigate the downstream performance achievable with a transformer-based LLM trained completely from scratch with masked LLMing for a single day on a single consumer GPU. Aside from re-analyzing nearly all components of the pretraining pipeline for this scenario and providing a modified pipeline with performance close to BERT, we investigate why scaling down is hard, and which modifications actually improve performance in this scenario. We provide evidence that even in this constrained setting, performance closely follows scaling laws observed in large-compute settings. Through the lens of scaling laws, we categorize a range of recent improvements to training and architecture and discuss their merit and practical applicability (or lack thereof) for the limited compute setting.

Citations (76)

View on Semantic Scholar

Summary

The paper shows that transformer models can reach near BERT performance when trained from scratch on a single GPU within 24 hours.
It re-evaluates each training component and introduces modified strategies to optimize performance under strict computational constraints.
Evaluation confirms that scaling laws still hold in low-resource settings, paving the way for accessible deep learning research.

Introduction to LLM Training

The training of deep learning models, particularly those with transformer architectures like BERT, has become a dominant method in reaching new heights of performance in various natural language processing tasks such as understanding and generation. The key finding behind their success is their ability to scale – more specifically, their performance improves continuously as the number of model parameters increases. This aspect, governed by power laws, has led to a computational arms race to produce excessively large models, presenting a barrier to entry for many researchers and practitioners due to the resource-heavy requirements.

Training Efficiencies on a Single GPU

The paper presented a novel exploration into the concept of training efficiency. It challenged the status quo by investigating the performance of a transformer-based LLM trained from scratch on a single GPU within a 24-hour time frame. This idea, termed "Cramming," explores the concept of large model downscaling and aims to understand the downstream performance attainable under such computational constraints. By working within these limitations, the paper aimed to facilitate broader academic inquiry and enable practitioners to tackle practical applications without the daunting need for immense computational power.

Revisiting Training Components

In order to achieve a BERT-like performance with limited resources, nearly every component of the pretraining pipeline was re-evaluated. The researchers provided modified training strategies to approach BERT baseline performances on downstream tasks, often on a budget significantly less than the original. Additionally, the paper examined the effects of architecture and training techniques in the face of reduced compute, noting which approaches scaled effectively.

Evaluation and Insights

The outcomes of the training process showed that, even under the strict computational budget, the performance closely followed the scaling laws observed in large-compute settings. Surprisingly, despite the evident challenges associated with scaling down, adjustments to training recipes enabled improvements. The concluding data indicated that models trained under this "Cramming" setup could reach competencies close to that of BERT on certain benchmarks, establishing a promising avenue for research and practice where computational resources are a limiting factor.

Related Papers

Tweets

https://twitter.com/Yampeleg/status/1817708257961775479

https://twitter.com/vboykis/status/1789461150939193475

https://twitter.com/andersonbcdefg/status/1779049742783443061

https://twitter.com/MstrMachines/status/1879679777571062168