Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity (2104.08786v2)

Published 18 Apr 2021 in cs.CL and cs.AI

Abstract: When primed with only a handful of training samples, very large, pretrained LLMs such as GPT-3 have shown competitive results when compared to fully-supervised, fine-tuned, large, pretrained LLMs. We demonstrate that the order in which the samples are provided can make the difference between near state-of-the-art and random guess performance: essentially some permutations are "fantastic" and some not. We analyse this phenomenon in detail, establishing that: it is present across model sizes (even for the largest current models), it is not related to a specific subset of samples, and that a given good permutation for one model is not transferable to another. While one could use a development set to determine which permutations are performant, this would deviate from the true few-shot setting as it requires additional annotated data. Instead, we use the generative nature of LLMs to construct an artificial development set and based on entropy statistics of the candidate permutations on this set, we identify performant prompts. Our method yields a 13% relative improvement for GPT-family models across eleven different established text classification tasks.

Citations (987)

View on Semantic Scholar

Summary

The paper presents a novel approach using GlobalE and LocalE entropy-based metrics to optimize prompt orders in few-shot learning.
It demonstrates that prompt order variations can result in performance swings from around 50% to over 85% accuracy on tasks like sentiment classification.
The method leverages artificial development sets, reducing reliance on annotated data and offering a scalable solution for low-resource scenarios.

Overcoming Few-Shot Prompt Order Sensitivity in LLMs

The research paper titled "Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity" explores the implications of sample order sensitivity in the context of few-shot learning with very large pretrained LLMs (PLMs) such as GPT-3. This sensitivity issue, which can result in significant performance variations depending on the sequence in which prompts are presented, presents a major challenge for leveraging the full potential of these models in few-shot settings.

Key Findings

The authors demonstrate that prompt ordering can highly influence performance, showing that different permutations of the same sample set can lead to outcomes ranging from near state-of-the-art to merely random guess levels. This phenomenon persists across various model sizes and tasks, indicating its universal nature. Notably, larger models do mitigate but do not entirely eliminate this problem.

Performance Variance:

For example, the paper reports performance variations for large models such as GPT-3 (175B parameters), revealing that certain sample orders can achieve over 85% accuracy while others yield around 50% on specific tasks like sentiment classification.

The analysis further shows that:

Prompt order sensitivity is independent of specific sample subsets.
Good permutations for one model are not necessarily beneficial for others.
Increased model size and additional training samples can reduce but not resolve performance variance.

Contribution of the Study

The paper contributes a novel approach to overcoming the order sensitivity issue without relying on additional annotated datasets. By using the generative capabilities of PLMs to construct artificial development sets, the authors propose two entropy-based metrics—Global Entropy (GlobalE) and Local Entropy (LocalE)—to evaluate the quality of prompt orders. These metrics focus on the entropy of predicted label distributions to filter out non-performant prompts.

Methodology

Artificial Development Set:

The method constructs a probing or artificial development set by generating text sequences from the model based on various prompt permutations.
Each candidate prompt is scored using entropy-based metrics derived from the generated sequences to identify the most effective prompts.

Prominent Results:

Utilizing GlobalE and LocalE metrics, the paper demonstrated an average of 13% relative improvement across eleven text classification tasks compared to baseline prompt ordering methods.

Implications and Future Directions

Theoretical Implications:

This paper underlines the intrinsic sensitivity of few-shot learning paradigms to prompt ordering, thereby contributing to a more nuanced understanding of PLM behaviors in low-data regimes.
Identifying universally performant prompts challenges assumptions in prompt design and selection, highlighting the need for adaptive and model-specific strategies.

Practical Implications:

The development of generative probing sets and entropy-based selection provides a scalable solution for practitioners aiming to deploy few-shot learning in real-world applications without extensive labeled datasets.
This approach could significantly optimize model performance in scenarios where annotated data is scarce or unavailable.

Future Directions:

Further research could explore the extensions of entropy-based metrics to more complex tasks, including multi-turn dialogues and sequence-to-sequence learning.
Investigating the interaction between template structure and prompt order sensitivity may yield additional insights for enhancing few-shot learning frameworks.
Scalability and computational efficiency of the proposed methods, especially when applied to extremely large models like GPT-3, warrant additional exploration to make them more accessible and cost-effective.

Conclusion

The paper "Fantastically Ordered Prompts and Where to Find Them" provides substantial insights and a practical solution to the prompt order sensitivity issue in few-shot learning scenarios. By leveraging the generative properties of PLMs and entropy-based metrics, the paper presents a method that improves classification tasks performance across multiple datasets and model sizes. This work forms a critical step towards more reliable and effective few-shot learning applications, setting the stage for future advancements in the field.

PDF Markdown

Related Papers

Tweets

https://twitter.com/IntuitMachine/status/1792871053187772902

https://twitter.com/PMinervini/status/1775176364800258062

https://twitter.com/guy_dar1/status/1773855979383230703

https://twitter.com/knishimae0531/status/1788777269143744883