Language Models are Few-Shot Learners (2005.14165v1)

Published 28 May 2020 in cs.CL

Abstract: Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up LLMs greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive LLM with 175 billion parameters, 10x more than any previous non-sparse LLM, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.

Authors (31)

Tom B. Brown (9 papers)
Benjamin Mann (2 papers)
Nick Ryder (16 papers)
Melanie Subbiah (11 papers)
Jared Kaplan (79 papers)
Prafulla Dhariwal (15 papers)
Arvind Neelakantan (20 papers)
Pranav Shyam (12 papers)
Girish Sastry (11 papers)
Amanda Askell (23 papers)
Sandhini Agarwal (10 papers)
Ariel Herbert-Voss (8 papers)
Gretchen Krueger (11 papers)
Tom Henighan (21 papers)
Rewon Child (10 papers)
Aditya Ramesh (21 papers)
Daniel M. Ziegler (8 papers)
Jeffrey Wu (8 papers)
Clemens Winter (6 papers)
Christopher Hesse (9 papers)

Citations (34,484)

View on Semantic Scholar

Summary

The paper demonstrates that scaling language models significantly enhances few-shot performance, as shown by impressive benchmarks like an 86.4% accuracy on LAMBADA.
It employs a task-agnostic pre-training approach using diverse datasets, enabling robust in-context learning without the need for fine-tuning.
The study also highlights limitations such as bidirectionality issues, data contamination risks, and inherent biases, urging further research for responsible AI deployment.

An Analysis of GPT-3: Capabilities, Limitations, and Implications

The paper presents an in-depth exploration of GPT-3, a 175 billion parameter autoregressive LLM, designed to advance the capabilities of language processing systems significantly. Notably, GPT-3 marks a considerable scaling effort over previous non-sparse LLMs, enabling it to exhibit strong few-shot learning capabilities on a diverse set of NLP tasks. This discussion outlines the core contributions of the paper, underlining the model's performance, limitations, and broader implications for both the academic field and societal use.

Overview and Methodology

The model was pre-trained on a vast and diverse corpus of text from the internet, utilizing a mixture of high-quality datasets, including filtered Common Crawl, WebText2, Books1, Books2, and English-language Wikipedia. Significantly, the training methodology abstained from leveraging task-specific architectures, relying solely on a task-agnostic approach. This generality in the model's architecture means GPT-3's performance hinges primarily on the scale of pre-training data and model size.

Performance was evaluated across a broad spectrum of benchmarks in zero-shot, one-shot, and few-shot settings, avoiding traditional fine-tuning. The authors offered a systematic comparison involving multiple model sizes, amplifying the clarity of how scaling parameters enhances certain capabilities.

Numerical Performance Highlights

GPT-3 showcases notable improvements in NLP benchmarks, particularly under few-shot learning configurations. Key performances include:

LAMBADA: GPT-3 demonstrated a few-shot accuracy of 86.4%, outperforming the previous state-of-the-art by 18.4 percentage points, thus highlighting its situational context comprehension capabilities.
TriviaQA: The model achieved a 71.2% accuracy in few-shot settings, exceeding fine-tuned models designed for closed-book systems.
SuperGLUE: GPT-3 few-shot averaged 71.8%, closely rivaling models fine-tuned on extensive supervised datasets.
Translation Tasks: Few-shot performance outstripped prior unsupervised NMT for several language pairs, indicating GPT-3's robust capability for multilingual translation without fine-tuning.

Noteworthy is GPT-3's in-context learning, which enables the model to perform new tasks by simply observing examples within the prompt at test time. This was evidenced by substantial gains in zero-shot to few-shot evaluations across tasks, such as PIQA and reading comprehension datasets like CoQA.

Limitations and Challenges

Despite its extensive capabilities, GPT-3 reveals several limitations that warrant further investigation:

Bidirectionality and Task-Specific Performance: The model underperforms in tasks requiring bidirectional context comprehension, such as certain reading comprehension and comparison tasks (e.g., ANLI). This suggests the potential utility of integrating bidirectional objectives alongside autoregressive training.
Contamination: The impact of training data contaminations, such as overlapping with test datasets like LAMBADA and PIQA, poses a risk of result inflation. The authors' efforts to identify and mitigate this issue underscore the complexities in managing web-scale datasets.
Bias and Fairness: Preliminary analyses reveal that GPT-3 inherits biases prevalent in its training data, reflecting societal stereotypes across gender, race, and religion. Addressing these biases is crucial for responsible deployment in sensitive applications.

Broader Implications

The implications of GPT-3 extend beyond enhancing NLP tasks. The model has significant potential for both beneficial and harmful applications, from improving automated assistance systems to enabling sophisticated, automated misinformation dissemination.

Practical and Ethical Considerations

Misuse Potential: As LLMs like GPT-3 become more proficient in generating human-like text, the risk of misuse for generating misleading or harmful content increases. Ongoing dialogue and development of mitigatory frameworks are necessary.
Bias Mitigation: Given the entrenchment of biases in large-scale LLMs, there is an imperative for developing methodologies to detect, understand, and mitigate discriminatory tendencies in generated content.
Energy Efficiency: The substantial computational resources required for training models at this scale necessitate consideration of environmental and economic impacts. Future research could focus on efficient training methods and model distillation strategies to ameliorate these concerns.

Future Directions

Exploring bidirectional training techniques, refining in-context learning algorithms, and broadening the range of tasks and modalities integrated with LLMs are promising avenues for advancing the capabilities demonstrated by GPT-3. The ongoing challenge will be to balance scaling benefits with interpretability, fairness, and the responsible use of AI technologies.

In conclusion, GPT-3 signifies a substantial stride in the evolution of LLMs, demonstrating the powerful potential of scaling up model size and training data. However, it simultaneously unveils new challenges and responsibilities in the development and deployment of AI systems within society. The balance between innovation and ethical practice will be pivotal in steering the future trajectory of AI research and its applications.

PDF Markdown

Related Papers

Tweets

https://twitter.com/karpathy/status/1795980744436932871

https://twitter.com/teortaxesTex/status/1784984961121993145

https://twitter.com/suchenzang/status/1886623303546364264

https://twitter.com/cwolferesearch/status/1753458134079635552

https://twitter.com/gordic_aleksa/status/1799061122298867876

https://twitter.com/Yuchenj_UW/status/1874172523937714404

YouTube

Show All Videos

HackerNews

Language Models Are Few-Shot Learners (2 points, 0 comments)
Language models are few-shot learners (2020) (1 point, 0 comments)