Emergent Mind

Cramming: Training a Language Model on a Single GPU in One Day

(2212.14034)
Published Dec 28, 2022 in cs.CL and cs.LG

Abstract

Recent trends in language modeling have focused on increasing performance through scaling, and have resulted in an environment where training language models is out of reach for most researchers and practitioners. While most in the community are asking how to push the limits of extreme computation, we ask the opposite question: How far can we get with a single GPU in just one day? We investigate the downstream performance achievable with a transformer-based language model trained completely from scratch with masked language modeling for a single day on a single consumer GPU. Aside from re-analyzing nearly all components of the pretraining pipeline for this scenario and providing a modified pipeline with performance close to BERT, we investigate why scaling down is hard, and which modifications actually improve performance in this scenario. We provide evidence that even in this constrained setting, performance closely follows scaling laws observed in large-compute settings. Through the lens of scaling laws, we categorize a range of recent improvements to training and architecture and discuss their merit and practical applicability (or lack thereof) for the limited compute setting.

Overview

  • Language models, especially transformer-based ones, scale in performance with increased parameters but require significant computational resources.

  • The paper introduces 'Cramming,' a method to train a language model on a single GPU in 24 hours with minimal resources.

  • Every aspect of the training pipeline was reassessed to achieve BERT-like results under resource constraints.

  • The study found that adjusted training strategies could produce near-BERT performance, even with reduced compute capacity.

  • Scaling down models with inventive training approaches allows for broader research opportunities and practical applications with limited computational power.

Introduction to Language Model Training

The training of deep learning models, particularly those with transformer architectures like BERT, has become a dominant method in reaching new heights of performance in various natural language processing tasks such as understanding and generation. The key finding behind their success is their ability to scale – more specifically, their performance improves continuously as the number of model parameters increases. This aspect, governed by power laws, has led to a computational arms race to produce excessively large models, presenting a barrier to entry for many researchers and practitioners due to the resource-heavy requirements.

Training Efficiencies on a Single GPU

The paper presented a novel exploration into the concept of training efficiency. It challenged the status quo by investigating the performance of a transformer-based language model trained from scratch on a single GPU within a 24-hour time frame. This idea, termed "Cramming," explore the concept of large model downscaling and aims to understand the downstream performance attainable under such computational constraints. By working within these limitations, the study aimed to facilitate broader academic inquiry and enable practitioners to tackle practical applications without the daunting need for immense computational power.

Revisiting Training Components

In order to achieve a BERT-like performance with limited resources, nearly every component of the pretraining pipeline was re-evaluated. The researchers provided modified training strategies to approach BERT baseline performances on downstream tasks, often on a budget significantly less than the original. Additionally, the study examined the effects of architecture and training techniques in the face of reduced compute, noting which approaches scaled effectively.

Evaluation and Insights

The outcomes of the training process showed that, even under the strict computational budget, the performance closely followed the scaling laws observed in large-compute settings. Surprisingly, despite the evident challenges associated with scaling down, adjustments to training recipes enabled improvements. The concluding data indicated that models trained under this "Cramming" setup could reach competencies close to that of BERT on certain benchmarks, establishing a promising avenue for research and practice where computational resources are a limiting factor.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.