Emergent Mind

Abstract

LLMs have demonstrated remarkable capability for understanding semantics, but they often struggle with understanding pragmatics. To demonstrate this fact, we release a Pragmatics Understanding Benchmark (PUB) dataset consisting of fourteen tasks in four pragmatics phenomena, namely, Implicature, Presupposition, Reference, and Deixis. We curated high-quality test sets for each task, consisting of Multiple Choice Question Answers (MCQA). PUB includes a total of 28k data points, 6.1k of which have been created by us, and the rest are adapted from existing datasets. We evaluated nine models varying in the number of parameters and type of training. Our study indicates that fine-tuning for instruction-following and chat significantly enhances the pragmatics capabilities of smaller language models. However, for larger models, the base versions perform comparably with their chat-adapted counterparts. Additionally, there is a noticeable performance gap between human capabilities and model capabilities. Furthermore, unlike the consistent performance of humans across various tasks, the models demonstrate variability in their proficiency, with performance levels fluctuating due to different hints and the complexities of tasks within the same dataset. Overall, the benchmark aims to provide a comprehensive evaluation of LLM's ability to handle real-world language tasks that require pragmatic reasoning.

Overview

  • The paper introduces the Pragmatics Understanding Benchmark (PUB), designed to evaluate LLMs' ability to understand language pragmatics.

  • 28,000 data entries were used to evaluate LLMs across 14 tasks involving four pragmatic phenomena: Implicature, Presupposition, Reference, and Deixis.

  • Various models, including base and chat-adapted versions, were assessed, revealing that fine-tuning for instructions and chat can enhance LLMs' pragmatic understanding.

  • Instruction-tuned and chat-optimized LLMs showed improvements over base counterparts, but larger models did not always outperform others in pragmatics.

  • The paper highlights LLMs' limitation in pragmatics compared to human understanding, indicating the need for context-based improvements for more human-like interactions.

Introduction to Pragmatics in LLMs

The field of NLP has been revolutionized by LLMs capable of performing a wide range of language-based tasks with increasing competency. An important aspect of language understanding is pragmatics - the ability to interpret language based on context, intentions, presuppositions, and implied meanings. Although LLMs excel at understanding semantics, their ability to grasp pragmatics is not as well studied. A recent research effort evaluates this by introducing a benchmark called the Pragmatics Understanding Benchmark (PUB).

Evaluating LLMs with PUB

PUB consists of 28,000 data entries, specially curated for 14 tasks over four pragmatic phenomena: Implicature, Presupposition, Reference, and Deixis. The tasks revolve around Multiple Choice Question Answers (MCQA), simulating real-world language use scenarios. In this comprehensive benchmark study, a wide range of models, including base and chat-adapted versions varying in size and training approach, were evaluated. The research illuminates the effectiveness of fine-tuning small models for instruction-following and chat tasks in enhancing pragmatic understanding.

Interpretation of Pragmatic Phenomena

The benchmark looks into distinguishing indirect from direct responses, classifying responses, implicature recovery in dialogue contexts, and several other tasks that involve figurative language such as sarcasm detection and agreement. It becomes evident through this study that instruction-tuned and chat-optimized LLMs exhibit improved pragmatic capabilities over their base counterparts. However, large models, despite their size, do not always maintain superiority in pragmatics, with some showing comparable performance to their chat-adapted equivalents.

Insights and Future Directions

Notwithstanding significant progress, LLMs have yet to match human-level pragmatics. Human evaluations maintain consistent performance across tasks, whereas models show varied proficiency, indicating room for improvement. One clear takeaway is the importance of context-based understanding for LLMs to provide more nuanced and human-like interactions. The PUB has substantiated certain gaps in LLMs' abilities to fully comprehend pragmatics and is expected to steer further research towards refining their interactive abilities, moving closer to a genuine conversational understanding.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.