Log Probabilities Are a Reliable Estimate of Semantic Plausibility in Base and Instruction-Tuned Language Models (2403.14859v2)

Published 21 Mar 2024 in cs.CL and cs.AI

Abstract: Semantic plausibility (e.g. knowing that "the actor won the award" is more likely than "the actor won the battle") serves as an effective proxy for general world knowledge. LLMs (LMs) capture vast amounts of world knowledge by learning distributional patterns in text, accessible via log probabilities (LogProbs) they assign to plausible vs. implausible outputs. The new generation of instruction-tuned LMs can now also provide explicit estimates of plausibility via prompting. Here, we evaluate the effectiveness of LogProbs and basic prompting to measure semantic plausibility, both in single-sentence minimal pairs (Experiment 1) and short context-dependent scenarios (Experiment 2). We find that (i) in both base and instruction-tuned LMs, LogProbs offers a more reliable measure of semantic plausibility than direct zero-shot prompting, which yields inconsistent and often poor results; (ii) instruction-tuning generally does not alter the sensitivity of LogProbs to semantic plausibility (although sometimes decreases it); (iii) across models, context mostly modulates LogProbs in expected ways, as measured by three novel metrics of context-sensitive plausibility and their match to explicit human plausibility judgments. We conclude that, even in the era of prompt-based evaluations, LogProbs constitute a useful metric of semantic plausibility, both in base and instruction-tuned LMs.

Citations (3)

View on Semantic Scholar

Summary

The paper finds that log probabilities serve as robust estimates for semantic plausibility, outperforming explicit prompt methods in both base and instruction-tuned models.
It shows that base models align more closely with human plausibility judgments compared to instruction-tuned variants, highlighting tuning limitations.
The study demonstrates that context-dependent log probability measurements at the word level closely replicate human sensitivity to semantic cues.

Log Probabilities and Semantic Plausibility in LLMs

The paper "Log Probabilities Are a Reliable Estimate of Semantic Plausibility in Base and Instruction-Tuned LLMs" examines the effectiveness of using log probabilities (LL) as a metric for assessing semantic plausibility in LLMs. The paper explores both base models and instruction-tuned variants, focusing on their abilities to perform plausibility judgments in a variety of linguistic scenarios.

Experiment 1: Explicit vs. Implicit Plausibility Judgments

The paper involves a careful analysis of LLMs' capacity to distinguish plausible from implausible sentences. The models, both base and instruction-tuned, are evaluated using log probabilities and a series of explicitly designed prompts.

Figure 1: Results of implicit vs. explicit plausibility judgment performance experiments.

Datasets and Experimental Setup

The research utilizes datasets adapted from previous studies, including EventsAdapt and DTFit, which contain sentences varying in plausibility. Human plausibility judgments provide a benchmark for model performance. Various prompt-based methods were employed to assess explicit prompting abilities against implicit LL-based measurements.

Key Results

LL vs. Prompting: LL scores consistently outperform prompt-based evaluations in capturing semantic plausibility, indicating a direct relationship between LL and inherent plausibility.
Base vs. Instruction-Tuned Models: Instruction-tuned models generally show less consistency with human plausibility judgments compared to their base counterparts, suggesting potential downsides of instruction-tuning in this context.
Human Comparison: While LLMs demonstrate above-chance performance, they fall short of human capabilities, especially in complex scenarios involving animate protagonists.

Experiment 2: Context-Dependent Plausibility

The paper extends its analysis to context-dependent plausibility judgments, investigating the modulation of semantic plausibility by contextual information in LLMs.

Figure 2: Target word LLs replicate patterns of human sentence sensibility judgments.

Methodology

A dataset consisting of three contexts (Control, SemAnom, and Critical) is used to evaluate context sensitivity. The paper measures both target word and sentence LLs to determine how context influences plausibility assessments.

Findings

Word vs. Sentence LLs: Target word LLs show greater modulation by context compared to entire sentences, aligning more closely with human judgment patterns.
Context Sensitivity: Models adjust plausibility judgments based on contextual cues but exhibit limitations in sentence-level predictions post-encountering anomalous stimuli.
Figure 3: Replicating the sensibility-judgment task in LLMs using sentence LL measures. Human data from \citet{jouravlev2019tracking}.

Conclusion

The paper concludes that LL is a reliable measure of semantic plausibility in LLMs, providing a robust metric that surpasses many prompt-based methods, particularly in base models. Despite advances in instruction-tuning, these adjustments may not always enhance semantic plausibility alignment with human judgment. The findings suggest a continued role for LL evaluations in understanding LLMs' implicit knowledge.

The paper highlights the need for further research on optimizing instruction-tuning processes to preserve or enhance the coherence with human semantic expectations. Additionally, improvements in LLM context sensitivity could enhance their application in real-world scenarios where context is vital for accurate language understanding.