Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models (2401.01301v2)
Abstract: Do LLMs know the law? These models are increasingly being used to augment legal practice, education, and research, yet their revolutionary potential is threatened by the presence of hallucinations -- textual output that is not consistent with legal facts. We present the first systematic evidence of these hallucinations, documenting LLMs' varying performance across jurisdictions, courts, time periods, and cases. Our work makes four key contributions. First, we develop a typology of legal hallucinations, providing a conceptual framework for future research in this area. Second, we find that legal hallucinations are alarmingly prevalent, occurring between 58% of the time with ChatGPT 4 and 88% with Llama 2, when these models are asked specific, verifiable questions about random federal court cases. Third, we illustrate that LLMs often fail to correct a user's incorrect legal assumptions in a contra-factual question setup. Fourth, we provide evidence that LLMs cannot always predict, or do not always know, when they are producing legal hallucinations. Taken together, our findings caution against the rapid and unsupervised integration of popular LLMs into legal tasks. Even experienced lawyers must remain wary of legal hallucinations, and the risks are highest for those who stand to benefit from LLMs the most -- pro se litigants or those without access to traditional legal resources.
- Do Language Models Know When They’re Hallucinating References?
- Bob Ambrogi. 2023. As Allen & Overy Deploys GPT-based Legal App Harvey Firmwide, Founders Say Other Firms Will Soon Follow. LawSites.
- PaLM 2 Technical Report.
- Amos Azaria and Tom Mitchell. 2023. The Internal State of an LLM Knows When It’s Lying.
- Ryan C. Black and James F. Spriggs, II. 2013. The Citation and Depreciation of U.S. Supreme Court Precedent. Journal of Empirical Legal Studies, 10(2):325–358.
- Can GPT-3 Perform Statutory Reasoning? In Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law, Braga, Portugal. Association for Computing Machinery.
- On the Opportunities and Risks of Foundation Models.
- Hallucinated but Factual! Inspecting the Factuality of Hallucinations in Abstractive Summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3340–3354, Dublin, Ireland. Association for Computational Linguistics.
- Faithful to the Original: Fact Aware Neural Abstractive Summarization. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1).
- Quantifying Memorization Across Neural Language Models.
- Caselaw Access Project. 2023. Caselaw Access Project.
- Casetext. 2023. Cocounsel harnesses gpt4’s power to deliver results that legal professionals can rely on.
- Seherman Chann. 2023. Non-determinism in GPT-4 is caused by Sparse MoE. https://152334H.github.io/blog/non-determinism-in-gpt-4/.
- ChatGPT Goes to Law School. Journal of Legal Education, 71(3):387–400.
- Jacob Cohen. 1960. A Coefficient of Agreement for Nominal Scales. Educational and Psychological Measurement, 20(1):37–46.
- Congress.gov. 2023. Table of Supreme Court Decisions Overruled by Subsequent Decisions. https://constitution.congress.gov/resources/decisions-overruled/.
- ChatLaw: Open-Source Legal Large Language Model with Integrated External Knowledge Bases.
- Eyecite: A tool for parsing legal citations. Journal of Open Source Software, 6(66):3617.
- How Ready are Pre-trained Abstractive Models and LLMs for Legal Case Judgement Summarization?
- Chain-of-verification reduces hallucination in large language models. arXiv preprint arXiv:2309.11495.
- Chris Draper and Nicky Gillibrand. 2023. The Potential for Jurisdictional Challenges to AI or LLM Training Datasets. In Proceedings of the ICAIL 2023 Workshop on Artificial Intelligence for Access to Justice, Braga, Portugal. CEUR Workshop Proceedings.
- Ronald Dworkin. 1986. Law’s Empire. Harvard University Press, Cambridge, MA.
- LawBench: Benchmarking Legal Knowledge of Large Language Models.
- Diego de Vargas Feijo and Viviane P. Moreira. 2023. Improving abstractive summarization of legal rulings through textual entailment. Artificial Intelligence and Law, 31(1):91–113.
- Network Analysis and the Law: Measuring the Legal Importance of Precedents at the U.S. Supreme Court. Political Analysis, 15(3):324–346.
- LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models.
- On Calibration of Modern Neural Networks. In Proceedings of the 34th International Conference on Machine Learning, pages 1321–1330. PMLR.
- Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset.
- Survey of Hallucination in Natural Language Generation. ACM Computing Surveys, 55(12):248:1–248:38.
- Erik Jones and Jacob Steinhardt. 2022. Capturing Failures of Large Language Models via Human Cognitive Biases.
- Language Models (Mostly) Know What They Know.
- Adam Tauman Kalai and Santosh S Vempala. 2023. Calibrated language models must hallucinate. arXiv preprint arXiv:2311.14648.
- Jon Kleinberg and Manish Raghavan. 2021. Algorithmic Monoculture and Social Welfare. Proceedings of the National Academy of Sciences, 118(22):e2018340118.
- Hurdles to Progress in Long-form Question Answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4940–4957, Online. Association for Computational Linguistics.
- Ashlyn K. Kuersten and Susan B. Haire. 2011. Update to the Appeals Courts Database (1997–2002).
- Verified Uncertainty Calibration.
- J. Richard Landis and Gary G. Koch. 1977. The Measurement of Observer Agreement for Categorical Data. Biometrics, 33(1):159–174.
- Factuality Enhanced Language Models for Open-Ended Text Generation.
- HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models.
- TruthfulQA: Measuring How Models Mimic Human Falsehoods.
- Joseph Luft and Harrington Ingham. 1955. The Johari Window as a graphic model of interpersonal awareness. In Proceedings of the Western Training Laboratory in Group Development. University of California, Los Angeles, Extension Office.
- SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models.
- On Faithfulness and Factuality in Abstractive Summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, Online. Association for Computational Linguistics.
- FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation.
- Self-contradictory Hallucinations of Large Language Models: Evaluation, Detection and Mitigation.
- Large Language Models as Tax Attorneys: A Case Study in Legal Capabilities Emergence.
- OpenAI. 2023a. GPT-4 Technical Report.
- OpenAI. 2023b. Introducing ChatGPT. https://openai.com/blog/chatgpt.
- Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback.
- Andrew Perlman. 2023. The Implications of ChatGPT for Legal Services and Society. The Practice, (March/April).
- From Sparse to Soft Mixtures of Experts.
- James Romoser. 2023. No, Ruth Bader Ginsburg did not dissent in Obergefell — and other things ChatGPT gets wrong about the Supreme Court.
- Explaining Legal Concepts with Augmented Large Language Models (GPT-4).
- Towards understanding sycophancy in language models.
- Retrieval Augmentation Reduces Hallucination in Conversation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3784–3803, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Drew Simshaw. 2022. Access to A.I. Justice: Avoiding an Inequitable Two-Tiered System of Legal Services. Yale Journal of Law & Technology, 24:150–226.
- Donald R. Songer. 2008. The United States Courts of Appeals Database, 1925–1996.
- 2022 Supreme Court Database, Version 2022 Release 01. http://supremecourtdatabase.org/.
- Do Large Language Models Show Decision Heuristics Similar to Humans? A Case Study Using GPT-3.5.
- ChatGPT as an Artificial Lawyer? In Proceedings of the ICAIL 2023 Workshop on Artificial Intelligence for Access to Justice, Braga, Portugal. CEUR Workshop Proceedings.
- Fine-tuning Language Models for Factuality.
- Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback.
- Joel Tito. 2017. How AI can improve access to justice.
- Llama 2: Open Foundation and Fine-Tuned Chat Models.
- Large Language Models in Cryptocurrency Securities Cases: Can ChatGPT Replace Lawyers?
- Amos Tversky and Daniel Kahneman. 1974. Judgment under Uncertainty: Heuristics and Biases. Science, 185(4157):1124–1131.
- Entailment as Few-Shot Learner.
- Simple synthetic data reduces sycophancy in large language models.
- Benjamin Weiser. 2023. Here’s What Happens When Your Lawyer Uses ChatGPT. The New York Times.
- Ludwig Wittgenstein. 1998 [1921]. Tractatus Logico-Philosophicus. Dover. "Translated by C. K. Ogden".
- Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs.
- Understanding and Detecting Hallucinations in Neural Machine Translation via Model Introspection.
- Do Large Language Models Know What They Don’t Know?
- Effect of confidence and explanation on accuracy and trust calibration in AI-assisted decision making. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, FAT* ’20, pages 295–305, New York, NY, USA. Association for Computing Machinery.