Large Language Models are In-context Teachers for Knowledge Reasoning (2311.06985v3)

Published 12 Nov 2023 in cs.CL

Abstract: In this work, we study in-context teaching (ICT), where a teacher provides in-context example rationales to teach a student to reason over unseen cases. Human teachers are usually required to craft in-context demonstrations, which are costly and have high variance. We ask whether a LLM can serve as a more effective in-context teacher for itself or other LLMs, compared to humans. Inspired by the Encoding Specificity Hypothesis from human episodic memory, we hypothesize that in-context exemplars crafted by the teacher should match the training data of the student. This hypothesis motivates us to propose Self-Explain where an LLM's self-elicited explanations are used as in-context demonstrations for prompting it as they are generalized from the model's training examples. Self-Explain is shown to significantly outperform using human-crafted exemplars and other baselines. Furthermore, we reveal that for ICT, rationales from different teacher LLMs or human experts that more resemble the student LLM's self-explanations are better in-context demonstrations. This supports our encoding specificity hypothesis. We then propose Teach-Back that aligns a teacher LLM with the student to enhance the ICT performance. For example, Teach-Back enables a 7B model to teach the much larger GPT-3.5 in context, surpassing human teachers by around 5% in test accuracy on medical question answering.

Citations (1)

View on Semantic Scholar

Summary

The paper demonstrates that LLMs can self-generate detailed reasoning steps, reducing the need for labor-intensive human-crafted examples.
Experiments show SELF-EXPLAIN improves accuracy and calibration, with notable gains on datasets like MedMCQA and StrategyQA.
The framework mitigates bias and lowers reliance on domain expertise, paving the way for cost-effective AI in specialized fields.

Teaching LLMs to Reason with SELF-EXPLAIN

The paper "SELF-EXPLAIN: Teaching LLMs to Reason Complex Questions by Themselves" by Jiachen Zhao, Zonghai Yao, Zhichao Yang, and Hong Yu makes significant strides in the domain of prompting LLMs like GPT-3.5 to generate intermediate reasoning steps without the need for human-crafted demonstrations. This goal addresses the challenges and limitations associated with the creation and application of human-crafted Chain-of-Thought (CoT) exemplars, which are traditionally employed to enhance the reasoning capabilities of LLMs.

Introduction

The paper begins by recognizing that while LLMs have demonstrated substantial capabilities in learning patterns from in-context exemplars, known as in-context learning (ICL), the deployment of intermediate reasoning steps via CoT prompting often yields higher performance. However, designing CoT examples is labor-intensive, particularly in professional domains such as medicine, where domain-specific expertise is required. Moreover, the variance in human annotations can lead to inconsistencies in the generated CoTs.

SELF-EXPLAIN Framework

The proposed method, SELF-EXPLAIN, leverages LLMs to generate their own CoT examples inspired by the concept of encoding specificity in human memory retrieval. SELF-EXPLAIN enables LLMs to produce explanations that make them more confident, calibrated, and less biased when handling complex questions. The generated self-explanations serve as in-context CoTs and have demonstrated performance that even surpasses human-crafted CoTs in various knowledge-intensive domains.

The SELF-EXPLAIN framework operates in three primary stages:

Generation of Self-Explanations: LLMs generate CoTs for training data given a question and its answer based on encoded knowledge.
In-Context Learning: Use these self-generated CoTs as exemplars for ICL during testing.
Performance Comparison: Evaluate the efficacy of self-explanations against human-crafted CoTs.

Experimental Setup and Results

The authors perform rigorous experiments on datasets that demand intricate reasoning, such as MedMCQA, MedQA, and StrategyQA. These datasets include multiple-choice questions that require deep domain knowledge and logical reasoning.

The results demonstrated in Table 1 reveal that using associated CoT prompting significantly enhances performance across the datasets. Notably, SELF-EXPLAIN achieves higher accuracy than both zero-shot CoT and Auto-CoT methodologies and even surpasses human-crafted CoTs. For instance, the test accuracy on MedMCQA improved to 56.6%, compared to 53.1% with human-crafted CoTs, highlighting the potential of self-explanation.

Calibration and Bias

Another key finding is that LLMs exhibit higher confidence and are better calibrated when prompted with self-explanations. Figures 3 and 4 in the paper illustrate that self-explanations reduce the intrinsic biases observed when human-crafted CoTs are used. This calibration and reduced bias could be critical in real-world applications where user trust and reliability are vital.

Implications

Practical Implications:

Healthcare and Specialized Knowledge Domains: In professional domains where expertise is scarce and expensive, SELF-EXPLAIN can significantly lower costs and improve access to high-quality AI guidance in decision-making processes.
Model Confidence and Bias: Implementing self-explanation prompts can lead to more reliable AI systems with better-calibrated outputs, which is crucial for user trust in high-stakes environments.

Theoretical Implications:

Encoding Specificity Hypothesis: The success of SELF-EXPLAIN supports the encoding specificity hypothesis, suggesting that LLMs benefit from context during pre-training that closely aligns with test-time requirements.
Generalization: The work challenges the entrenched belief that human-crafted CoTs are superior, proposing that machine-generated CoTs, driven by properly framed prompts, can achieve or even surpass human-generated reasoning in certain contexts.

Future Directions

Future research could expand on several fronts:

Diverse Domains and Models: Extending the SELF-EXPLAIN method to different domains and testing its efficacy on other LLMs could validate its robustness.
Optimization of Prompt Generation: Refining the prompt generation process to universally accommodate variances in input-output relationships.
User Interaction: Investigating user interactions with LLMs using self-explanations to further understand trust and decision reliance on machine-generated insights.

In conclusion, "SELF-EXPLAIN" offers a novel approach that enables LLMs to autonomously generate intermediate reasoning steps, demonstrating not only superior performance to human-crafted CoTs but also potentially shifting the paradigm in how machine intelligence can learn and convey complex information. This paves the way for more accessible, reliable, and cost-effective AI applications across a myriad of domains.

PDF Markdown

Related Papers

Tweets

https://twitter.com/UMassBioNLP/status/1854259006690169003