Language Model Behavior: A Comprehensive Survey (2303.11504v2)

Published 20 Mar 2023 in cs.CL

Abstract: Transformer LLMs have received widespread public attention, yet their generated text is often surprising even to NLP researchers. In this survey, we discuss over 250 recent studies of English LLM behavior before task-specific fine-tuning. LLMs possess basic capabilities in syntax, semantics, pragmatics, world knowledge, and reasoning, but these capabilities are sensitive to specific inputs and surface features. Despite dramatic increases in generated text quality as models scale to hundreds of billions of parameters, the models are still prone to unfactual responses, commonsense errors, memorized text, and social biases. Many of these weaknesses can be framed as over-generalizations or under-generalizations of learned patterns in text. We synthesize recent results to highlight what is currently known about LLM capabilities, thus providing a resource for applied work and for research in adjacent fields that use LLMs.

Citations (74)

View on Semantic Scholar

Summary

The paper synthesizes over 250 LLM behavior studies, highlighting key insights on generalization and inherent biases.
It evaluates language models across syntax, semantics, pragmatics, and reasoning, noting significant sensitivity to input features.
The survey underscores that despite scaling, LLMs remain prone to factual errors, memorization, and social biases due to over- or under-generalization.

LLM Behavior: A Comprehensive Survey

This essay summarizes a comprehensive survey (2303.11504) of over 250 studies on English LLM behavior before task-specific fine-tuning. The survey synthesizes recent findings on syntax, semantics, pragmatics, world knowledge, reasoning, memorization, and bias in LLMs. It emphasizes that while LLMs exhibit basic capabilities across these domains, their performance is often sensitive to specific inputs and surface features. Despite scaling, LLMs remain prone to factual errors, commonsense errors, memorization, and social biases, often stemming from over- or under-generalization of learned text patterns.

Transformer LLMs

The Transformer architecture, introduced in 2018, forms the basis for modern LLMs. These models are trained to predict masked or upcoming words from context, using techniques like byte pair encoding (BPE) for tokenization. Token embeddings are passed through a stack of Transformer layers, incorporating self-attention mechanisms to create contextualized representations. Position encoding techniques, such as absolute or relative embeddings, are used to capture word order information. Models typically contain between 100M and 500B parameters, pre-trained on text corpora ranging from 5B to 1.5T tokens, using optimization steps with batch sizes from 100K to 4M tokens. The survey focuses on pre-trained models, while acknowledging the impact of supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) in more recent LLMs. Downstream tasks, such as text classification and generation, are addressed through fine-tuning, zero-shot prompting, or few-shot prompting. Open-ended text generation utilizes methods like greedy sampling, temperature sampling, top- $k$ sampling, and nucleus sampling.

Syntactic Abilities

The survey examines LLM predictions from a syntactic perspective, noting that LLMs generally produce grammatical text. Evaluations compare model probabilities for minimal pair examples differing in grammaticality. Both autoregressive and masked LLMs assign higher probabilities to grammatical tokens and exhibit consistency with hierarchical syntactic structure. The models also recognize licensing, where the grammaticality of a token depends on an upstream licensor token. Performance improves with model and pre-training corpus size. Subject-verb agreement is learned, but the models are sensitive to intervening clauses and specific words. Syntactic rules are acquired relatively early in pre-training, with smaller models achieving reasonable syntactic performance. Notably, word order is not always necessary, as models trained on shuffled words can still perform well, and LLMs can learn word order without explicit position information.

Semantics, Pragmatics, and Compositionality

The survey explores semantic abilities, focusing on how LLMs construct meaning from text. LLMs demonstrate learning of word meanings and relationships, tracking entities in described situations, and recognizing basic figurative language. However, they struggle with negation and pragmatics. Compositional semantics are assessed through lexical semantics, examining how individual words influence phrase meaning. LLMs can predict frequent words from definitions and vice versa, but struggle with infrequent words. They predict noun hypernyms using template sentences, with performance degrading for infrequent pairs. The confidence in hypernym prediction correlates with human-rated typicality. LLMs exhibit a verb-specific understanding of how verb properties affect syntactic and semantic structures.

Negation poses a challenge, with models often ignoring negated prompts and performing worse as models scale. Situation models are constructed, enabling LLMs to track entities throughout a passage, but performance degrades with multiple nouns or complex inferences. Basic analogies, metaphors, and figurative language are recognized, with improvements based on model size. Pragmatic understanding is limited, as LLMs struggle with implied meanings, presuppositions, and scalar implicatures.

Commonsense and World Knowledge

Beyond syntax and semantics, LLMs exhibit basic world knowledge, including commonsense reasoning and facts, with knowledge improving with model size. They learn encyclopedic facts and commonsense properties of objects, but less reliably. They can infer typical relationships between actions and events. LLMs assign higher probabilities to facts than to alternatives when expressed as sentences. Performance is worse when predicting numeric literals and numerical commonsense. LLMs are sensitive to the context and frequency of facts in the pre-training corpus, with learned facts evolving even late in pre-training.

Logical and Numerical Reasoning

LLMs exhibit basic logical reasoning, when prompted with instructions or examples, and can perform basic step-by-step reasoning and numerical reasoning. Step-by-step reasoning emerges with explicit prompts, as models can combine facts with reasoning to some extent. However, complex reasoning remains a challenge. Basic numerical and probabilistic reasoning abilities are dependent on specific inputs. GPT-3 can perform addition and subtraction for small numbers, but struggles with large numbers, with performance dropping when word problems include irrelevant context.

Memorization vs. Novel Text

LLMs are likely to generate memorized text from their pre-training corpus. As models scale, they are more likely to generate memorized text, paraphrased memorized text, and text that has been memorized after fewer observations. Deduplicating the pre-training data can reduce memorization and improve language modeling. However, LLMs can also generate novel text consistent with the input context, without simply generating memorized examples. LLM generated text includes more concrete and frequent words with shallower syntactic structures, than people, with model-generated text generally being consistent with any provided input context.

Bias, Privacy, and Toxicity

Despite their capabilities, LLMs generate biased, offensive, and private text. Models are susceptible to harmful social biases and stereotypes. It is possible to "red-team" the models into producing harmful and offensive text such as swearing, harassment, insults, and hate speech. LLMs can be prompted to generate PII (personally identifiable information) such phone numbers or email addresses, using prompts generated by people. LLM behavior varies across demographic groups, both in terms of raw performance and probabilities of toxic text. LLMs reflect harmful stereotypes based on gender, sexuality, race, religion, and other demographic identities. LLM "personality" and politics depend on the input context.

Misinformation, Personality, and Politics

LLMs can generate convincing unfactual text and unsafe advice that is difficult to distinguish from human-generated text, making these models potential vectors for spreading misinformation. The generated text depends on the political leaning and perceived personality of the input context. People are more likely to rate GPT-3 generated tweets as true than human-generated tweets, regardless of whether they are factual.

Effects of Scale

Larger LLMs exhibit substantial performance improvements on text generation tasks, which is why most recent work focuses on model scaling. Scaling results are limited by available published studies, as most do not evaluate beyond 175B parameters. Larger models learn syntactic rules more robustly and learn more commonsense properties and facts but are worse at recognizing negation. There are some cases where performance gains are exhibit unexpectedly large performance gains beyond 175B parameters, while other tasks exhibit sudden performance improvements beyond 175B parameters. Larger models are also more likely to mimic political opinions in a given input, but model size appears to have little impact on offensive text generation. Large models can be prompted to generate explicit multi-step reasoning by asking them to "think step by step", but logical reasoning improves only slightly beyond around 10B parameters.

Language Modeling as Generalization

Many strengths and weaknesses of LLMs can be viewed through the lens of text pattern generalization. Over-generalizations and under-generalizations of learned patterns in text simultaneously provide insights into the impressive capabilities and brittle responses of LLMs. LLMs are trained to generalize from text examples observed during pre-training to novel examples. However, there are infinitely many generalization approaches that a LLM can apply to extrapolate from pre-training examples, with model weaknesses being interpreted as over- or under-generalization. An example of overgeneralization would be failing to recall facts and defaulting to heuristics such as predicting semantically similar tokens to the input context.

Levels of Analysis in Understanding LLMs

This survey focuses on behavioral analyses of LLMs. Other studies have investigated the internal mechanisms that lead LLMs to generate their predictions. Mechanistic analyses have probed the linguistic information that can be extracted from LLMs' internal vector representations of tokens, causal links between individual neurons and language modeling predictions, and functionalities of individual attention heads in LLMs. Future work might apply similar analysis techniques to investigate the mechanisms underlying a wider range of LLM behaviors, bridging the gap between behavioral and mechanistic levels of LLM analysis.

Conclusion

The survey synthesizes a wide range of LLM capabilities and weaknesses, finding that LLMs remain sensitive to specific inputs and surface features even as they scale to hundreds of billions of parameters. Many model strengths and weaknesses can be framed as correct or incorrect generalizations of text patterns. By distilling what is known about LLM capabilities, this informs the deployment and regulation of LLMs and inspiring future LLM analysis research.