Distillation Contrastive Decoding: Improving LLMs Reasoning with Contrastive Decoding and Distillation

Published 21 Feb 2024 in cs.CL, cs.AI, and cs.LG | (2402.14874v2)

Abstract: We propose a straightforward approach called Distillation Contrastive Decoding (DCD) to enhance the reasoning capabilities of LLMs during inference. In contrast to previous approaches that relied on smaller amateur models or analysis of hidden state differences, DCD employs Contrastive Chain-of-thought Prompting and advanced distillation techniques, including Dropout and Quantization. This approach effectively addresses the limitations of Contrastive Decoding (CD), which typically requires both an expert and an amateur model, thus increasing computational resource demands. By integrating contrastive prompts with distillation, DCD obviates the need for an amateur model and reduces memory usage. Our evaluations demonstrate that DCD significantly enhances LLM performance across a range of reasoning benchmarks, surpassing both CD and existing methods in the GSM8K and StrategyQA datasets.

Abstract PDF HTML Upgrade to Chat

References (22)

Citations (5)

View on Semantic Scholar

Summary

The paper presents a novel Distillation Contrastive Decoding (DCD) method that integrates contrastive chain-of-thought prompts with distillation techniques to remove the need for a separate amateur model.
It achieves superior performance on benchmarks, boosting Llama2 performance by up to 5.9% on commonsense reasoning and 3.79% on arithmetic tasks.
DCD utilizes controlled dropout and quantization to reduce computational resource demands while remaining adaptable across various LLM architectures.

The paper, "Distillation Contrastive Decoding: Improving LLMs Reasoning with Contrastive Decoding and Distillation," presents a novel method termed Distillation Contrastive Decoding (DCD) aimed at enhancing reasoning abilities in LLMs. The proposed approach addresses the limitations associated with traditional Contrastive Decoding (CD), which typically relies on both an expert and a smaller, amateur model for effective inference. This reliance not only increases computational demands but also presents challenges when relatively smaller models of the same architecture are unavailable.

Key Contributions:

Distillation Contrastive Decoding (DCD): Unlike traditional CD methods that require the dual loading of models, DCD leverages Contrastive Chain-of-thought (CoT) prompts combined with distillation techniques, such as dropout and quantization, to obtain amateur reasoning information. This integration eliminates the need for a distinct amateur LLM while maintaining or enhancing performance, significantly reducing memory usage during inference.
Improved Reasoning Benchmarks: The method demonstrates superior performance across various reasoning benchmarks, significantly outperforming both CD and Chain-of-thought Prompting (CP) methods. Specifically, on arithmetic reasoning tasks (GSM8K), DCD boosts the performance of Llama2 models by as much as 3.79% and exceeds CD by 1.89%. In commonsense reasoning tasks (StrategyQA), DCD surpasses traditional methods and enhances Llama2 models' performance by up to 5.9%.
Methodology and Abstraction: DCD does not rely on the availability of specific amateur models, making it highly adaptable across different model architectures, including Llama2, Mistral-7B, and DeepSeek-7B. The method capitalizes on distillation techniques to simulate smaller models internally, which not only achieves effective reasoning but also efficiently utilizes computational resources.

Technical Insights:

Contrastive Decoding Limitations: Traditional CD's dependency on an amateur model presents both logistical and computational challenges, especially when smaller model variants are unavailable or impractical to deploy in larger architectures.
Contrastive CoT Prompting: DCD leverages various forms of contrastive CoT design, including both correct and incorrect reasoning exemplars, to enhance logical task performance by minimizing inference errors.
Distillation Techniques: Dropout rates, applied judiciously during inference, play a critical role in achieving optimal model performance. Experimentation has shown that a moderate dropout rate, between 0.2 and 0.4, generally yields the best results for both arithmetic and commonsense tasks.
Performance Correlation: The paper notes a correlation between high scores on tasks such as MMLU and the enhancement provided by DCD, indicating that models with a strong foundational knowledge base particularly benefit from the DCD approach.

In conclusion, this work provides a significant step in advancing the effectiveness and efficiency of LLM reasoning capabilities, offering a robust solution to the constraints of traditional Contrastive Decoding methods. By eliminating the need for external amateur models and reducing resource demands, DCD emerges as a viable strategy for improving logic and reasoning tasks in LLMs. Further research may explore the application of DCD to even more complex reasoning scenarios and larger, more sophisticated model architectures.

Markdown Report Issue