Papers
Topics
Authors
Recent
Search
2000 character limit reached

Navigating Prompt Complexity for Zero-Shot Classification: A Study of Large Language Models in Computational Social Science

Published 23 May 2023 in cs.CL | (2305.14310v3)

Abstract: Instruction-tuned LLMs have exhibited impressive language understanding and the capacity to generate responses that follow specific prompts. However, due to the computational demands associated with training these models, their applications often adopt a zero-shot setting. In this paper, we evaluate the zero-shot performance of two publicly accessible LLMs, ChatGPT and OpenAssistant, in the context of six Computational Social Science classification tasks, while also investigating the effects of various prompting strategies. Our experiments investigate the impact of prompt complexity, including the effect of incorporating label definitions into the prompt; use of synonyms for label names; and the influence of integrating past memories during foundation model training. The findings indicate that in a zero-shot setting, current LLMs are unable to match the performance of smaller, fine-tuned baseline transformer models (such as BERT-large). Additionally, we find that different prompting strategies can significantly affect classification accuracy, with variations in accuracy and F1 scores exceeding 10\%.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Ask me anything: A simple strategy for prompting language models. arXiv preprint arXiv:2210.02441.
  2. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509.
  3. The longest month: analyzing covid-19 vaccination opinions dynamics from tweets in the month following the first vaccine announcement. Ieee Access, 9:33203–33223.
  4. SemEval-2017 task 8: RumourEval: Determining rumour veracity and support for rumours. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 69–76, Vancouver, Canada. Association for Computational Linguistics.
  5. 8-bit optimizers via block-wise quantization. arXiv preprint arXiv:2110.02861.
  6. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
  7. Semeval-2022 task 6: isarcasmeval, intended sarcasm detection in english and arabic. In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), pages 802–814.
  8. Systematic evaluation of gpt-3 for zero-shot personality estimation. arXiv preprint arXiv:2306.01183.
  9. Automatic identification and classification of bragging in social media. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3945–3959.
  10. Openassistant conversations–democratizing large language model alignment. arXiv preprint arXiv:2304.07327.
  11. Chatgpt: Beginning of an end of manual annotation? use case of automatic genre identification. arXiv preprint arXiv:2303.03953.
  12. Bishal Lamichhane. 2023. Evaluation of chatgpt for nlp-based mental health applications. arXiv preprint arXiv:2303.15727.
  13. The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688.
  14. Detectgpt: Zero-shot machine-generated text detection using probability curvature. arXiv preprint arXiv:2301.11305.
  15. Vaxxhesitancy: A dataset for studying hesitancy towards covid-19 vaccination on twitter. In Proceedings of the International AAAI Conference on Web and Social Media, volume 17, pages 1052–1062.
  16. Elite Olshtain and Liora Weinbach. 1987. 10. complaints: A study of speech act behavior among native and non-native speakers of hebrew. In The pragmatic perspective, page 195. John Benjamins.
  17. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  18. Winds of change: Impact of covid-19 on vaccine-related opinions of twitter users. In Proceedings of the International AAAI Conference on Web and Social Media, volume 16, pages 782–793.
  19. Automatically identifying complaints in social media. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5008–5019.
  20. Michael V Reiss. 2023. Testing the reliability of chatgpt for text annotation and classification: A cautionary remark. arXiv preprint arXiv:2304.11085.
  21. Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207.
  22. Detectllm: Leveraging log rank information for zero-shot detection of machine-generated text. arXiv preprint arXiv:2306.05540.
  23. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261.
  24. Petter Törnberg. 2023. Chatgpt-4 outperforms experts and crowd workers in annotating political twitter messages with zero-shot learning. arXiv preprint arXiv:2304.06588.
  25. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  26. Yau-Shian Wang and Yingshan Chang. 2022. Toxicity detection with generative prompt-based inference. arXiv preprint arXiv:2205.12390.
  27. Understanding abuse: A typology of abusive language detection subtasks. In Proceedings of the First Workshop on Abusive Language Online, pages 78–84.
  28. Zeerak Waseem and Dirk Hovy. 2016. Hateful symbols or hateful people? predictive features for hate speech detection on twitter. In Proceedings of the NAACL Student Research Workshop, pages 88–93, San Diego, California. Association for Computational Linguistics.
  29. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
  30. Chain of thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems.
  31. Large language models can be used to estimate the ideologies of politicians in a zero-shot learning setting. arXiv preprint arXiv:2303.12057.
  32. Can large language models transform computational social science? arXiv preprint arXiv:2305.03514.
  33. Detection and resolution of rumours in social media: A survey. ACM Computing Surveys (CSUR), 51(2):1–36.
Citations (13)

Summary

  • The paper demonstrates that carefully designed prompt strategies can significantly enhance the zero-shot performance of LLMs compared to fine-tuned models.
  • LLMs like GPT achieved impressive accuracy on tasks such as sarcasm detection and complaint recognition by using techniques like front-loading task descriptions and synonym replacements.
  • The study highlights potential data leakage issues from training datasets and stresses the importance of robust evaluation in practical zero-shot classification applications.

This paper investigates the performance of LLMs, specifically ChatGPT and OpenAssistant, in the zero-shot classification setting within the field of computational social science. The study focuses on evaluating the efficacy of these models using various prompting strategies on six different classification tasks without task-specific fine-tuning. The analysis also compares LLMs against smaller, fine-tuned models like BERT-large to ascertain their relative performance and capabilities.

Research Questions and Methodology

The paper addresses three primary research questions. Firstly, the authors explore the level of zero-shot performance that LLMs can achieve and compare it to fine-tuned models on similar tasks. Secondly, the study examines which prompting strategies are most effective in improving the performance of LLMs in this context. Thirdly, the authors consider potential data leakage issues related to the training data of the LLMs.

To address these questions, the authors conduct experiments using GPT-3.5-turbo (GPT) and OpenAssistant-LLaMA (LLaMA-OA) models across six tasks. They utilize different prompt strategies, including basic instructions, task and label descriptions (T/L Desc), few-sample examples, and memory recall prompts. Additionally, they test replacing original labels with synonyms to assess the robustness of the models' performances.

Key Findings

Zero-Shot Performance

The study shows that the task-specific fine-tuned BERT-large models generally outperform the LLMs in a zero-shot setting. However, LLMs like GPT, using carefully crafted prompts, can achieve impressive accuracy on complex classification tasks such as sarcasm detection and complaint recognition. GPT tended to exceed other models in zero-shot classification when optimal prompt strategies were applied, such as front-loading the prompt with task descriptions.

Prompting Strategies

The efficacy of different prompting strategies was significant. The simple prompts often outperformed more complex ones, notably showing that excessive detail might dilute model performance by diverting focus away from key instructions. LLMs benefited from carefully chosen synonyms for label names, which occasionally improved performance by preventing overfitting to specific token patterns seen in the model’s training corpus.

Data Leakage and Model Robustness

While the training datasets for LLMs such as GPT and LLaMA-OA are not fully transparent, the elicitation of exact dataset details from the models suggests some exposure to these datasets during training. This could influence the zero-shot performances observed, though the extent and nature of such potential data leakage remain unspecified.

Practical and Theoretical Implications

The paper highlights the challenges and potential of using LLMs in practical applications without task-specific training data. It underscores the importance of selecting prompt strategies carefully to maximize performance. This has implications for deploying LLMs in areas requiring scalable solutions with minimal specific annotations, such as automatic data annotation for large-scale analyses.

Moreover, this study indicates a direction for researchers seeking to employ LLMs for classification tasks in computational social science. As LLMs continue to evolve with more sophisticated architectures and training datasets, their capabilities in zero-shot learning and natural language understanding will likely progress, offering more robust out-of-the-box solutions for complex NLP tasks.

Conclusion

The paper "Navigating Prompt Complexity for Zero-Shot Classification" provides valuable insights into leveraging LLMs like GPT and OpenAssistant for zero-shot classification tasks in computational social science. While current performance levels suggest that fine-tuned models still hold an advantage in accuracy, the strategic design of prompt inputs can significantly enhance LLM performance, offering a viable alternative for applications with limited labeled data. Future work could explore integrating LLMs with other AI systems to improve robustness and the scope of zero-shot tasks further.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.