Large Language Models Reflect Human Citation Patterns with a Heightened Citation Bias (2405.15739v3)
Abstract: Citation practices are crucial in shaping the structure of scientific knowledge, yet they are often influenced by contemporary norms and biases. The emergence of LLMs introduces a new dynamic to these practices. Interestingly, the characteristics and potential biases of references recommended by LLMs that entirely rely on their parametric knowledge, and not on search or retrieval-augmented generation, remain unexplored. Here, we analyze these characteristics in an experiment using a dataset from AAAI, NeurIPS, ICML, and ICLR, published after GPT-4's knowledge cut-off date. In our experiment, LLMs are tasked with suggesting scholarly references for the anonymized in-text citations within these papers. Our findings reveal a remarkable similarity between human and LLM citation patterns, but with a more pronounced high citation bias, which persists even after controlling for publication year, title length, number of authors, and venue. The results hold for both GPT-4, and the more capable models GPT-4o and Claude 3.5 where the papers are part of the training data. Additionally, we observe a large consistency between the characteristics of LLM's existing and non-existent generated references, indicating the model's internalization of citation patterns. By analyzing citation graphs, we show that the references recommended are embedded in the relevant citation context, suggesting an even deeper conceptual internalization of the citation networks. While LLMs can aid in citation generation, they may also amplify existing biases, such as the Matthew effect, and introduce new ones, potentially skewing scientific knowledge dissemination.
- Large language models show human-like content biases in transmission chain experiments. Proceedings of the National Academy of Sciences, 120(44):e2313790120.
- Do language models know when they’re hallucinating references? In Graham, Y. and Purver, M., editors, Findings of the Association for Computational Linguistics: EACL 2024, pages 912–928, St. Julian’s, Malta. Association for Computational Linguistics.
- Emergent autonomous scientific research capabilities of large language models. arXiv preprint arXiv:2304.05332.
- What do citation counts measure? a review of studies on citing behavior. Journal of documentation, 64(1):45–80.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
- How to optimize the systematic review process using ai tools. JCPP Advances, page e12234.
- Science of science. Science, 359(6379):eaao0185.
- Investigating different types of research collaboration and citation impact: a case study of harvard university’s publications. Scientometrics, 87(2):251–265.
- Large language models for automated data science: Introducing caafe for context-aware automated feature engineering. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S., editors, Advances in Neural Information Processing Systems, volume 36, pages 44753–44775. Curran Associates, Inc.
- Mixtral of experts. arXiv preprint arXiv:2401.04088.
- SWE-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations.
- Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221.
- Challenges and applications of large language models. arXiv preprint arXiv:2307.10169.
- Large language models struggle to learn long-tail knowledge. In International Conference on Machine Learning, pages 15696–15707. PMLR.
- The semantic scholar open data platform. arXiv preprint arXiv:2301.10140.
- Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
- Lawrence, P. A. (2003). The politics of publication. Nature, 422(6929):259–261.
- Beyond scale: the diversity coefficient as a data quality metric demonstrates llms are pre-trained on formally diverse data. arXiv preprint arXiv:2306.13840.
- The advantage of short paper titles. Royal Society open science, 2(8):150266.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
- Fairness-guided few-shot prompting for large language models. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S., editors, Advances in Neural Information Processing Systems, volume 36, pages 43136–43155. Curran Associates, Inc.
- Social bias probing: Fairness benchmarking for language models. arXiv preprint arXiv:2311.09090.
- Scaling deep learning for materials discovery. Nature, 624(7990):80–85.
- Biases in large language models: origins, inventory, and discussion. ACM Journal of Data and Information Quality, 15(2):1–21.
- OpenAI (2023). Gpt-4 technical report. arXiv preprint arXiV:4812508.
- Are chatgpt and large language models “the answer” to bringing us closer to systematic review automation? Systematic Reviews, 12(1):72.
- Mathematical discoveries from program search with large language models. Nature, 625(7995):468–475.
- Smith, D. R. (2012). Impact factors, scientometrics and the history of citation-based research. Scientometrics, 92(2):419–427.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
- Automating research synthesis with domain-specific large language model fine-tuning. arXiv preprint arXiv:2404.08680.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Identifying meaningful citations. In AAAI Workshop: Scholarly Big Data.
- Fabrication and errors in the bibliographic citations generated by chatgpt. Scientific Reports, 13(1):14045.
- Wang, J. (2014). Unpacking the Matthew effect in citations. Journal of Informetrics, 8(2):329–339.
- Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
- Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837.
- Autoformalization with large language models. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A., editors, Advances in Neural Information Processing Systems, volume 35, pages 32353–32368. Curran Associates, Inc.
- Knowledge conflicts for llms: A survey. arXiv preprint arXiv:2403.08319.
- Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36.
- Siren’s song in the ai ocean: a survey on hallucination in large language models. arXiv preprint arXiv:2309.01219.
- Large language models for scientific synthesis, inference and explanation. arXiv preprint arXiv:2310.07984.