"Is ChatGPT a Better Explainer than My Professor?": Evaluating the Explanation Capabilities of LLMs in Conversation Compared to a Human Baseline (2406.18512v1)

Published 26 Jun 2024 in cs.CL

Abstract: Explanations form the foundation of knowledge sharing and build upon communication principles, social dynamics, and learning theories. We focus specifically on conversational approaches for explanations because the context is highly adaptive and interactive. Our research leverages previous work on explanatory acts, a framework for understanding the different strategies that explainers and explainees employ in a conversation to both explain, understand, and engage with the other party. We use the 5-Levels dataset was constructed from the WIRED YouTube series by Wachsmuth et al., and later annotated by Booshehri et al. with explanatory acts. These annotations provide a framework for understanding how explainers and explainees structure their response when crafting a response. With the rise of generative AI in the past year, we hope to better understand the capabilities of LLMs and how they can augment expert explainer's capabilities in conversational settings. To achieve this goal, the 5-Levels dataset (We use Booshehri et al.'s 2023 annotated dataset with explanatory acts.) allows us to audit the ability of LLMs in engaging in explanation dialogues. To evaluate the effectiveness of LLMs in generating explainer responses, we compared 3 different strategies, we asked human annotators to evaluate 3 different strategies: human explainer response, GPT4 standard response, GPT4 response with Explanation Moves.

Authors (3)

Grace Li (5 papers)
Milad Alshomary (14 papers)
Smaranda Muresan (47 papers)

Summary

The paper finds that standard GPT-4 responses are preferred in 49% of cases over human expert explanations.
It employs a multi-strategy evaluation using a richly annotated WIRED 5-Levels dataset and eight assessment dimensions.
Results imply that LLMs can enhance science communication by delivering scalable, adaptive, and engaging explanations.

Evaluating the Explanation Capabilities of LLMs in Conversation Compared to a Human Baseline

The paper, titled "Is ChatGPT a Better Explainer than My Professor?: Evaluating the Explanation Capabilities of LLMs in Conversation Compared to a Human Baseline" by Grace Li, Milad Alshomary, and Smaranda Muresan, contributes to the domain of science communication by focusing on how LLMs, particularly GPT-4, perform in generating explanatory dialogues compared to human experts. This work is pivotal as it provides empirical evidence and methodologies for assessing the quality of LLM-generated explanations in conversational settings.

Background and Framework

The research targets conversational explanations due to their adaptive and interactive nature, leveraging the 5-Levels dataset from WIRED's YouTube series. This dataset, annotated with "explanatory acts", provides a rich ground for examining the diverse strategies used by explainers and explainees in dialogues. The annotations clarify how these dialogues are structured, providing a nuanced understanding of various explanatory tactics.

Methodology

The paper evaluates three distinct strategies for generating explainer responses:

S1: Baseline - Human expert responses extracted directly from the 5-Levels dataset.
S2: GPT4 Standard - GPT-4 generated responses based solely on the previous conversational context.
S3: GPT4 w/ Explanatory Acts (EAs) - GPT-4 responses constructed using the previous context and a sequence of predefined explanatory acts.

Human annotators were recruited to assess these responses on eight dimensions (coherence, conciseness, conversational nature, acknowledgment, appropriateness, depth, active guidance, and engagement), and rank the responses to facilitate a comprehensive evaluation.

Results

The inter-annotator agreement was measured using Krippendorff’s alpha for the rating dimensions and Kendall's Tau for the ranking agreement. Notably, the paper found:

S2: GPT Standard was preferred in 49% of the instances, outperforming both the human baseline and the explanatory acts-aided approaches in terms of overall perceived quality.
S3: GPT w/ EA demonstrated value in deepening conversations and maintaining engagement, although it was less preferred due to the verbosity and complexity of some responses.
S1: Baseline received the lowest preference, primarily due to a perceived lack of engagement and adaptability compared to the LLM-generated responses.

Discussion

The paper's results indicate that LLMs, particularly GPT-4, are capable of not only matching but sometimes exceeding human performance in generating conversational explanations. The effectiveness of LLMs in this context has significant implications:

Practical Implications: LLMs could serve as valuable tools in augmenting education and science communication. They can provide scalable, adaptive, and engaging explanations, potentially bridging the gap between experts and laypersons.
Theoretical Implications: The ability of LLMs to follow structured explanatory acts suggests that LLMs possess an advanced understanding of conversational dynamics. This offers a pathway for future research to explore and refine strategies that optimize LLM performance in educational settings.

Future Directions

The standing results from this paper pave the way for several future research ventures. These include:

Enhanced Personalization: Developing frameworks that automatically tailor explanations based on the explainee’s preferences and background.
Integrative Systems: Exploring hybrid models where human experts and LLMs collaboratively generate explanations, leveraging the strengths of both.
Deeper Evaluation Criteria: Expanding evaluation metrics to include longer-term impacts on learning and understanding, as well as exploring the role of multimodal inputs in explanatory dialogues.

Through rigorous experimentation and detailed analysis, this paper makes a substantial contribution to the understanding of how LLMs can enhance science communication. By addressing both the strengths and shortcomings of current LLM capabilities, it lays a robust groundwork for further advancements in the field of AI-driven education and communication systems.

PDF Markdown

Related Papers

Tweets

https://twitter.com/emulenews/status/1807481574289805464

https://twitter.com/oyaceliktutan/status/1806214678768431316

https://twitter.com/fly51fly/status/1806450423810900430

https://twitter.com/calculito/status/1806408823705378868

YouTube

Show All Videos