Large Language Models for Data Annotation and Synthesis: A Survey (2402.13446v3)
Abstract: Data annotation and synthesis generally refers to the labeling or generating of raw data with relevant information, which could be used for improving the efficacy of machine learning models. The process, however, is labor-intensive and costly. The emergence of advanced LLMs, exemplified by GPT-4, presents an unprecedented opportunity to automate the complicated process of data annotation and synthesis. While existing surveys have extensively covered LLM architecture, training, and general applications, we uniquely focus on their specific utility for data annotation. This survey contributes to three core aspects: LLM-Based Annotation Generation, LLM-Generated Annotations Assessment, and LLM-Generated Annotations Utilization. Furthermore, this survey includes an in-depth taxonomy of data types that LLMs can annotate, a comprehensive review of learning strategies for models utilizing LLM-generated annotations, and a detailed discussion of the primary challenges and limitations associated with using LLMs for data annotation and synthesis. Serving as a key guide, this survey aims to assist researchers and practitioners in exploring the potential of the latest LLMs for data annotation, thereby fostering future advancements in this critical field.
- Persistent anti-muslim bias in large language models. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pages 298–306.
- Bernardo Aceituno and Antoni Rosinol. 2022. Stack ai: The middle-layer of ai.
- Large language models are zero-shot clinical information extractors. arXiv preprint arXiv:2205.12689.
- Self-consuming generative models go mad. ArXiv, abs/2307.01850.
- Open-source large language models outperform crowd workers and approach chatgpt in text-annotation tasks. arXiv preprint arXiv:2307.02179.
- Hussam Alkaissi and Samy I McFarlane. 2023. Artificial hallucinations in chatgpt: implications in scientific writing. Cureus, 15(2).
- Walid Amamou. 2021. Ubiai: Text annotation tool.
- Gpt4all: Training an assistant-style chatbot with large scale data distillation from gpt-3.5-turbo. GitHub.
- A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861.
- Large language models and the perils of their hallucinations. Critical Care, 27(1):1–2.
- Fine-tuning language models to find agreement among humans with diverse preferences. Advances in Neural Information Processing Systems, 35:38176–38189.
- Parikshit Bansal and Amit Sharma. 2023. Large language models as annotators: Enhancing generalization of nlp models at minimal cost. arXiv preprint arXiv:2306.15766.
- A drop of ink may make a million think: The spread of false information in large language models. arXiv preprint arXiv:2305.04812.
- A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- A survey on evaluation of large language models.
- Canyu Chen and Kai Shu. 2023. Can llm-generated misinformation be detected? arXiv preprint arXiv:2309.13788.
- Improving in-context few-shot learning via self-supervised training. In NAACL.
- Disco: Distilling counterfactuals with large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5514–5528.
- Socially responsible ai algorithms: Issues, purposes, and challenges. Journal of Artificial Intelligence Research, 71:1137–1181.
- Cheng-Han Chiang and Hung-yi Lee. 2023. Can large language models be an alternative to human evaluations? arXiv preprint arXiv:2305.01937.
- Vicuna: An open-source chatbot impressing GPT-4 with 90%* chatgpt quality.
- Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- Chatlaw: Open-source legal large language model with integrated external knowledge bases. arXiv preprint arXiv:2306.16092.
- Why can gpt learn in-context? language models secretly perform gradient descent as meta-optimizers. In Findings of the Association for Computational Linguistics: ACL 2023, pages 4005–4019.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Can ai language models replace human participants? Trends in Cognitive Sciences.
- Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767.
- A survey for in-context learning. arXiv preprint arXiv:2301.00234.
- Alpacafarm: A simulation framework for methods that learn from human feedback. arXiv preprint arXiv:2305.14387.
- Avia Efrat and Omer Levy. 2020. The turking test: Can language models understand instructions? arXiv preprint arXiv:2010.11982.
- Gpts are gpts: An early look at the labor market impact potential of large language models. arXiv preprint arXiv:2303.10130.
- Specializing smaller language models towards multi-step reasoning. arXiv preprint arXiv:2301.12726.
- Koala: A dialogue model for academic research. BAIR Blog.
- Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375.
- A systematic survey of prompt engineering on vision-language foundation models. arXiv preprint arXiv:2307.12980.
- Distilling large language models for biomedical knowledge extraction: A case study on adverse drug events. arXiv preprint arXiv:2307.06439.
- Knowledge distillation of large language models. arXiv preprint arXiv:2306.08543.
- The false promise of imitating proprietary llms. ArXiv, abs/2305.15717.
- Textbooks are all you need. arXiv preprint arXiv:2306.11644.
- Textbooks are all you need. ArXiv, abs/2306.11644.
- Chase Harrison. 2022. Langchain.
- Annollm: Making large language models to be better crowdsourced annotators. arXiv preprint arXiv:2303.16854.
- Selective annotation makes language models better few-shot learners. In ICLR.
- Matthew Honnibal and Ines Montani. 2017. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear.
- Unnatural instructions: Tuning language models with (almost) no human labor. arXiv preprint arXiv:2212.09689.
- Instruction induction: From few examples to natural language task descriptions. arXiv preprint arXiv:2205.10782.
- Large language models are zero-shot rankers for recommender systems. arXiv preprint arXiv:2305.08845.
- Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. arXiv preprint arXiv:2305.02301.
- Large language model distillation doesn’t need a teacher. arXiv preprint arXiv:2305.14864.
- Disinformation detection: An evolving challenge in the age of llms. arXiv preprint arXiv:2309.15847.
- Lion: Adversarial distillation of closed-source large language model. arXiv preprint arXiv:2305.12870.
- Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858.
- Causal reasoning and large language models: Opening a new frontier for causality. arXiv preprint arXiv:2305.00050.
- Prefer to classify: Improving text classifiers via auxiliary preference learning. arXiv preprint arXiv:2306.04925.
- Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
- Pretraining language models with human preferences. In International Conference on Machine Learning, pages 17506–17533. PMLR.
- Semantic role labeling with pretrained language models for known and unknown predicates. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), pages 619–628.
- Rlaif: Scaling reinforcement learning from human feedback with ai feedback.
- Contextualization distillation from large language model for knowledge graph completion. arXiv preprint arXiv:2402.01729.
- A survey on fairness in large language models. arXiv preprint arXiv:2308.10149.
- Q Vera Liao and Jennifer Wortman Vaughan. 2023. Ai transparency in the age of llms: A human-centered research roadmap. arXiv preprint arXiv:2306.01941.
- Teaching models to express their uncertainty in words. arXiv preprint arXiv:2205.14334.
- Chain of hindsight aligns language models with feedback. arXiv preprint arXiv:2302.02676, 3.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485.
- Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35.
- Trustworthy llms: a survey and guideline for evaluating large language models’ alignment. arXiv preprint arXiv:2308.05374.
- Is prompt all you need? no. a comprehensive and broader view of instruction learning. arXiv preprint arXiv:2303.10475.
- Teaching small language models to reason. arXiv preprint arXiv:2212.08410.
- Active learning principles for in-context learning with large language models. arXiv preprint arXiv:2305.14264.
- Teaching language models to support answers with verified quotes. arXiv preprint arXiv:2203.11147.
- Sachit Menon and Carl Vondrick. 2022. Visual classification via description from large language models. arXiv preprint arXiv:2210.07183.
- A diverse corpus for evaluating and developing english math word problem solvers. arXiv preprint arXiv:2106.15772.
- Recent advances in natural language processing via large pre-trained language models: A survey. ACM Computing Surveys.
- Ines Montani and Matthew Honnibal. 2018. Prodigy: A new annotation tool for radically efficient machine teaching. Artificial Intelligence, to appear.
- Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786.
- Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474.
- OpenAI. 2023. Gpt-4 technical report.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- On the risk of misinformation pollution with large language models. arXiv preprint arXiv:2305.13661.
- Language models as knowledge bases? arXiv preprint arXiv:1909.01066.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
- Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290.
- Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250.
- Learning to retrieve prompts for in-context learning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2655–2671.
- Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
- Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618.
- Active learning for sequence tagging with deep pre-trained models and bayesian uncertainty estimates. In EACL 2021-16th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference, pages 1698–1712.
- Constrained language models yield few-shot semantic parsers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7699–7715.
- The curse of recursion: Training on generated data makes models forget. ArXiv, abs/2305.17493.
- Offline rl for natural language generation with implicit language q learning. arXiv preprint arXiv:2206.11871.
- Preference ranking optimization for human alignment. arXiv preprint arXiv:2306.17492.
- An information-theoretic approach to prompt engineering without ground truth labels. arXiv preprint arXiv:2203.11364.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021.
- Is chatgpt good at search? investigating large language models as re-ranking agent. arXiv preprint arXiv:2304.09542.
- Commonsenseqa: A question answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937.
- Active learning helps pretrained models learn the intended task. Advances in Neural Information Processing Systems, 35:28140–28153.
- Gkd: A general knowledge distillation framework for large-scale pre-trained language model. arXiv preprint arXiv:2306.06629.
- Sparsity-guided holistic explanation for llms with interpretable inference-time intervention. arXiv preprint arXiv:2312.15033.
- Interpreting pretrained language models via concept bottlenecks. arXiv preprint arXiv:2311.05014.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
- Stanford alpaca: An instruction-following llama model.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
- Large language models in medicine. Nature Medicine, pages 1–11.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Newsqa: A machine comprehension dataset. arXiv preprint arXiv:1611.09830.
- Web content filtering through knowledge distillation of large language models. arXiv preprint arXiv:2305.05027.
- Revisiting relation extraction in the era of large language models. arXiv preprint arXiv:2305.05003.
- Scott: Self-consistent chain-of-thought distillation. arXiv preprint arXiv:2305.01879.
- Noise-robust fine-tuning of pretrained language models via external guidance. arXiv preprint arXiv:2311.01108.
- Knowledge editing for large language models: A survey. arXiv preprint arXiv:2310.16218.
- Rationale-augmented ensembles in language models. arXiv preprint arXiv:2207.00747.
- Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
- Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. arXiv preprint arXiv:2204.07705.
- Aligning large language models with human: A survey. arXiv preprint arXiv:2307.12966.
- Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
- Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
- Huggingface’s transformers: State-of-the-art natural language processing.
- Scattershot: Interactive in-context example curation for text transformation. In Proceedings of the 28th International Conference on Intelligent User Interfaces, pages 353–367.
- Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564.
- Small models are valuable plug-ins for large language models. arXiv preprint arXiv:2305.08848.
- Fingpt: Open-source financial large language models. arXiv preprint arXiv:2306.06031.
- Zerogen: Efficient zero-shot learning via dataset generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11653–11669.
- Generate rather than retrieve: Large language models are strong context generators. arXiv preprint arXiv:2209.10063.
- Temporal data meets llm–explainable financial time series forecasting. arXiv preprint arXiv:2306.11025.
- Large language models meet nl2code: A survey. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7443–7464.
- Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493.
- Lmturk: Few-shot learners as crowdsourcing workers in a language-model-as-a-service framework. arXiv preprint arXiv:2112.07522.
- A survey of large language models.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.