Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools (2405.20362v1)

Published 30 May 2024 in cs.CL and cs.CY

Abstract: Legal practice has witnessed a sharp rise in products incorporating AI. Such tools are designed to assist with a wide range of core legal tasks, from search and summarization of caselaw to document drafting. But the LLMs used in these tools are prone to "hallucinate," or make up false information, making their use risky in high-stakes domains. Recently, certain legal research providers have touted methods such as retrieval-augmented generation (RAG) as "eliminating" (Casetext, 2023) or "avoid[ing]" hallucinations (Thomson Reuters, 2023), or guaranteeing "hallucination-free" legal citations (LexisNexis, 2023). Because of the closed nature of these systems, systematically assessing these claims is challenging. In this article, we design and report on the first preregistered empirical evaluation of AI-driven legal research tools. We demonstrate that the providers' claims are overstated. While hallucinations are reduced relative to general-purpose chatbots (GPT-4), we find that the AI research tools made by LexisNexis (Lexis+ AI) and Thomson Reuters (Westlaw AI-Assisted Research and Ask Practical Law AI) each hallucinate between 17% and 33% of the time. We also document substantial differences between systems in responsiveness and accuracy. Our article makes four key contributions. It is the first to assess and report the performance of RAG-based proprietary legal AI tools. Second, it introduces a comprehensive, preregistered dataset for identifying and understanding vulnerabilities in these systems. Third, it proposes a clear typology for differentiating between hallucinations and accurate legal responses. Last, it provides evidence to inform the responsibilities of legal professionals in supervising and verifying AI outputs, which remains a central open question for the responsible integration of AI into law.

Citations (38)

View on Semantic Scholar

Summary

The paper rigorously evaluates the reliability of leading AI legal research tools by benchmarking hallucination rates and accuracy.
It finds hallucination rates of 17%–33% and identifies Lexis+ AI as the most accurate with 65% correctness, with performance varying by query type.
The study underscores the need for legal professionals to verify AI outputs due to significant hallucination risks and potential ethical implications.

Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools

The paper "Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools" provides a rigorous empirical evaluation of AI-driven legal research tools. The paper covers the performance of proprietary solutions from LexisNexis (Lexis+ AI), Thomson Reuters (Westlaw AI-Assisted Research, Ask Practical Law AI), and compares them against GPT-4. Despite claims by vendors of being "hallucination-free" or significantly minimizing hallucinations, the paper reveals that these claims are overstated, with these tools demonstrating varying degrees of hallucination and accuracy.

Key Findings

Hallucination Rates: Lexis+ AI and Thomson Reuters' Ask Practical Law AI have hallucination rates between 17% and 33%. Despite their advanced retrieval-augmented generation (RAG) systems, these tools make false or misleading statements.
Accuracy and Responsiveness: Lexis+ AI was found to be the most accurate of the tools evaluated, providing correct and grounded answers to 65% of the queries. Westlaw's AI-Assisted Research (AI-AR) demonstrated a lower accuracy of 42% but tended to produce longer, more detailed answers.
Performance Variability: Across all models, performance varied substantially based on the type of query. General legal research questions, jurisdiction-specific inquiries, questions with false premises, and factual recalls produced different rates of hallucinations and accuracy.
Legal Profession Implications: The tools' hallucination rates and the variability in their outputs pose challenges for responsible integration into legal practice. Lawyers must verify AI-generated responses carefully, adhering to professional ethical standards such as competence and supervision.
Legal AI developers’ Challenges: Providers must navigate economic pressures while adhering to legal and regulatory frameworks. Potential tort liability and deceptive practice allegations underline the necessity for precise marketing and thorough validation of claimed capabilities.

Methodology

The researchers developed a benchmark dataset composed of 202 legal queries, organized into categories addressing general legal research, jurisdictional/time-specific questions, false premise queries, and factual recall questions. Using a systematic protocol, they evaluated the correctness and groundedness of AI outputs. Correctness assessed factual accuracy and relevance, while groundedness evaluated the validity and applicability of legal citations provided.

Implications

For Legal Practitioners

Lawyers integrating AI tools into their practice must thoroughly vet and cross-reference AI-originated data to ensure compliance with ethical standards such as those emphasized in the ABA's Model Rules of Professional Conduct. The persistent risk of hallucination necessitates a cautious approach, potentially undermining the efficiency gains expected from these tools.

For AI Developers

Developers face the dual pressures of competitive commercialization against stringent legal and ethical standards. Misrepresentation of a tool's capabilities can lead to substantial legal repercussions, including those under the Lanham Act and potential tort liabilities. Transparent benchmarks and empirical evidence of performance are essential to mitigate risks and build trust in AI applications.

Future Speculation

Future research and development in AI legal tools are likely to focus on minimizing hallucinations further through enhanced RAG systems and more sophisticated retrieval techniques. Ongoing empirical evaluations and public benchmarks will be crucial in tracking progress. The dichotomy between economic pressures and legal integrity will continue to shape the landscape of AI in legal practice.

Conclusion

Despite advancements in RAG systems, legal AI tools still demonstrate significant hallucination rates. These findings underscore the critical need for rigorous empirical scrutiny and transparent benchmarking in the development and deployment of AI in high-stakes domains like law. As the field progresses, close collaboration between AI developers, legal professionals, and regulatory bodies will be essential in ensuring both innovation and responsibility.

PDF Markdown

Related Papers

Tweets

https://twitter.com/barrowjoseph/status/1798834054458155509

https://twitter.com/IntuitMachine/status/1833439094476849181

https://twitter.com/iamkylebalmer/status/1844712438265954449

https://twitter.com/awadallah/status/1842404914967810236

https://twitter.com/Mira_Network/status/1928398764802515165

https://twitter.com/jpohhhh/status/1798139821367624065

YouTube

Show All Videos