Few-Shot Detection of Machine-Generated Text using Style Representations (2401.06712v3)
Abstract: The advent of instruction-tuned LLMs that convincingly mimic human writing poses a significant risk of abuse. However, such abuse may be counteracted with the ability to detect whether a piece of text was composed by a LLM rather than a human author. Some previous approaches to this problem have relied on supervised methods by training on corpora of confirmed human- and machine- written documents. Unfortunately, model under-specification poses an unavoidable challenge for neural network-based detectors, making them brittle in the face of data shifts, such as the release of newer LLMs producing still more fluent text than the models used to train the detectors. Other approaches require access to the models that may have generated a document in question, which is often impractical. In light of these challenges, we pursue a fundamentally different approach not relying on samples from LLMs of concern at training time. Instead, we propose to leverage representations of writing style estimated from human-authored text. Indeed, we find that features effective at distinguishing among human authors are also effective at distinguishing human from machine authors, including state-of-the-art LLMs like Llama-2, ChatGPT, and GPT-4. Furthermore, given a handful of examples composed by each of several specific LLMs of interest, our approach affords the ability to predict which model generated a given document. The code and data to reproduce our experiments are available at https://github.com/LLNL/LUAR/tree/main/fewshot_iclr2024.
- 2023. OpenAI ChatGPT API “gpt-3.5-turbo”. Available at: https://api.openai.com/v1/chat/completions.
- Nicholas Andrews and Marcus Bishop. 2019. Learning invariant representations of social media users. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1684–1695.
- The Pushshift Reddit Dataset. In Proceedings of the 14th International AAAI Conference on Web and Social Media (ICWSM), volume 14, pages 830–839.
- Language models are few-shot learners.
- Scaling instruction-finetuned language models.
- Free dolly: Introducing the world’s first truly open instruction-tuned llm.
- Roft: A tool for evaluating human detection of machine-generated text.
- Model-agnostic meta-learning for fast adaptation of deep networks.
- Unsupervised and distributional detection of machine-generated text. arXiv preprint arXiv:2111.02878.
- Gltr: Statistical detection and visualization of generated text. arXiv preprint arXiv:1906.04043.
- Tilmann Gneiting and Adrian E Raftery. 2007. Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association, 102(477):359–378.
- On calibration of modern neural networks. In International conference on machine learning, pages 1321–1330. PMLR.
- Julian Hazell. 2023. Large language models can be used to effectively scale spear phishing campaigns. arXiv preprint arXiv:2305.06972.
- Cater: Intellectual property protection on text generation apis via conditional watermarks.
- Automatic detection of generated text is easiest when humans are fooled. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1808–1822.
- Automatic detection of machine generated text: A critical survey. In Proceedings of the 28th International Conference on Computational Linguistics, pages 2296–2309, Barcelona, Spain (Online). International Committee on Computational Linguistics.
- A watermark for large language models. arXiv preprint arXiv:2301.10226.
- Paraphrasing evades detectors of ai-generated text, but retrieval is an effective defense. ArXiv, abs/2303.13408.
- Identifying automatically generated headlines using transformers. arXiv preprint arXiv:2009.13375.
- Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426.
- Detectgpt: Zero-shot machine-generated text detection using probability curvature.
- Crosslingual generalization through multitask finetuning.
- Do deep generative models know what they don’t know? arXiv preprint arXiv:1810.09136.
- Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 188–197.
- OpenAI. 2023. Gpt-4 technical report.
- Low-resource authorship style transfer: Can non-famous authors be imitated?
- John Platt et al. 1999. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10(3):61–74.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
- A recipe for arbitrary text style transfer with large language models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 837–848, Dublin, Ireland. Association for Computational Linguistics.
- Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
- Learning universal authorship representations. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 913–919, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Can ai-generated text be reliably detected?
- Prototypical networks for few-shot learning. Advances in neural information processing systems, 30.
- Release strategies and the social impacts of language models (arxiv:1908.09203). https://huggingface.co/roberta-base-openai-detector.
- Llama 2: Open foundation and fine-tuned chat models.
- M4: Multi-generator, multi-domain, and multi-lingual black-box machine-generated text detection.
- Same author or just same topic? Towards content-independent style representations. In Proceedings of the 7th Workshop on Representation Learning for NLP, pages 249--268. Association for Computational Linguistics.
- Taxonomy of risks posed by language models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 214--229.
- Defending against neural fake news. Advances in neural information processing systems, 32.
- Opt: Open pre-trained transformer language models.