Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 154 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 36 tok/s Pro
GPT-5 High 33 tok/s Pro
GPT-4o 70 tok/s Pro
Kimi K2 184 tok/s Pro
GPT OSS 120B 437 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Prompt Design Matters for Computational Social Science Tasks but in Unpredictable Ways (2406.11980v1)

Published 17 Jun 2024 in cs.AI and cs.CY

Abstract: Manually annotating data for computational social science tasks can be costly, time-consuming, and emotionally draining. While recent work suggests that LLMs can perform such annotation tasks in zero-shot settings, little is known about how prompt design impacts LLMs' compliance and accuracy. We conduct a large-scale multi-prompt experiment to test how model selection (ChatGPT, PaLM2, and Falcon7b) and prompt design features (definition inclusion, output type, explanation, and prompt length) impact the compliance and accuracy of LLM-generated annotations on four CSS tasks (toxicity, sentiment, rumor stance, and news frames). Our results show that LLM compliance and accuracy are highly prompt-dependent. For instance, prompting for numerical scores instead of labels reduces all LLMs' compliance and accuracy. The overall best prompting setup is task-dependent, and minor prompt changes can cause large changes in the distribution of generated labels. By showing that prompt design significantly impacts the quality and distribution of LLM-generated annotations, this work serves as both a warning and practical guide for researchers and practitioners.

Citations (5)

Summary

  • The paper shows that variations in prompt design significantly affect LLM compliance and accuracy across multiple CSS tasks.
  • The study employs a multifactorial experiment with three LLMs and four CSS tasks, revealing notable discrepancies in model performance.
  • The findings highlight that concise prompts save costs while explanations and definitions can shift label distributions, necessitating careful prompt configuration.

Introduction

The research presented in "Prompt Design Matters for Computational Social Science Tasks but in Unpredictable Ways" focuses on understanding the influence of prompt design on LLMs tasked with data annotation in computational social science (CSS). This paper systematically investigates how variations in prompt design—specifically, definition inclusion, output type, explanation provisioning, and prompt length—affect LLM compliance and accuracy across multiple CSS tasks.

Methodology

The experiment employed three LLMs—ChatGPT, PaLM2, and Falcon7b—executed over four diverse CSS tasks: toxicity detection, sentiment analysis, rumor stance identification, and news frame classification. A multifactorial framework was applied, generating 16 prompt variations to encompass different permutations of the prompt design dimensions (Figure 1). Figure 1

Figure 1: Prompt variations used in our experiments.

The authors utilized datasets typical in CSS research, such as SST5 for sentiment analysis and HOT for toxicity, aiming to characterize the performance metrics—compliance and accuracy—of LLMs under varied prompt conditions.

Key Findings

Compliance and Accuracy Variability

Significant discrepancies in compliance and accuracy were observed across LLMs and prompt designs. Falcon7b showed compliance variations up to 55% on rumor stance tasks, while ChatGPT's accuracy varied by up to 14% across news framing prompts. Notably, numerical output requests degraded both compliance and accuracy. However, providing definitions improved ChatGPT's accuracy while negatively impacting PaLM2 and Falcon7b compliance. Figure 2

Figure 2: Percentage compliance for different tasks and LLMs.

Explanation and Distribution Shifts

While explanations increased compliance, they also altered label distributions significantly, exemplified by ChatGPT annotating 34% more content as neutral when explanations were requested, which can bias research conclusions. Figure 3

Figure 3: Examples demonstrating LLM noncompliance.

Cost-Effectiveness and Prompt Length

Concise prompts, while reducing input token costs, impacted compliance inconsistently. ChatGPT maintained compliance with concise prompts, offering a cost advantage, contrasting with other models that showed declines in compliance or accuracy. Figure 4

Figure 4: Falcon7b's response on the same data when prompted with/without explanation.

Discussion

This paper illustrates the complex influence of prompt design on LLM performance for CSS tasks, emphasizing the need for careful prompt selection based on task specificity and underlying model characteristics. The findings suggest that researchers should tailor prompts to balance between cost and annotation quality, using empirical evidence as guidance rather than relying solely on conventional best practices.

Understanding these variables is vital, particularly within fields where LLM-driven annotations impact social science research outcomes, potentially affecting public policy or societal interpretations.

Conclusion

The research provides critical insights into configuring LLM prompts for optimal performance in CSS applications. The variability and unpredictability observed necessitate adaptive strategies, where multiple prompt configurations may enhance robustness and reliability in annotated datasets. Future work could explore alternative prompting methods more suited to varying model capabilities and task complexities, advancing the integration of LLMs in CSS research.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 2 tweets and received 69 likes.

Upgrade to Pro to view all of the tweets about this paper: