Emergent Mind

Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools

(2405.20362)
Published May 30, 2024 in cs.CL and cs.CY

Abstract

Legal practice has witnessed a sharp rise in products incorporating AI. Such tools are designed to assist with a wide range of core legal tasks, from search and summarization of caselaw to document drafting. But the LLMs used in these tools are prone to "hallucinate," or make up false information, making their use risky in high-stakes domains. Recently, certain legal research providers have touted methods such as retrieval-augmented generation (RAG) as "eliminating" (Casetext, 2023) or "avoid[ing]" hallucinations (Thomson Reuters, 2023), or guaranteeing "hallucination-free" legal citations (LexisNexis, 2023). Because of the closed nature of these systems, systematically assessing these claims is challenging. In this article, we design and report on the first preregistered empirical evaluation of AI-driven legal research tools. We demonstrate that the providers' claims are overstated. While hallucinations are reduced relative to general-purpose chatbots (GPT-4), we find that the AI research tools made by LexisNexis (Lexis+ AI) and Thomson Reuters (Westlaw AI-Assisted Research and Ask Practical Law AI) each hallucinate between 17% and 33% of the time. We also document substantial differences between systems in responsiveness and accuracy. Our article makes four key contributions. It is the first to assess and report the performance of RAG-based proprietary legal AI tools. Second, it introduces a comprehensive, preregistered dataset for identifying and understanding vulnerabilities in these systems. Third, it proposes a clear typology for differentiating between hallucinations and accurate legal responses. Last, it provides evidence to inform the responsibilities of legal professionals in supervising and verifying AI outputs, which remains a central open question for the responsible integration of AI into law.

Accurate, incomplete, and hallucinated responses percentages; hallucination rates for direct responses; 95% confidence intervals.

Overview

  • The paper evaluates the reliability and performance of leading AI-driven legal research tools like Lexis+ AI and Westlaw's AI-Assisted Research, finding varying degrees of accuracy and hallucination rates.

  • Key findings include hallucination rates between 17% and 33% for some tools, and variability in performance based on the query type, raising concerns about responsible integration into legal practice.

  • The study urges caution and rigorous verification for lawyers using these tools and highlights the regulatory and ethical challenges for AI developers, emphasizing the need for transparent benchmarking and ongoing evaluation.

Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools

The paper "Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools" provides a rigorous empirical evaluation of AI-driven legal research tools. The study covers the performance of proprietary solutions from LexisNexis (Lexis+ AI), Thomson Reuters (Westlaw AI-Assisted Research, Ask Practical Law AI), and compares them against GPT-4. Despite claims by vendors of being "hallucination-free" or significantly minimizing hallucinations, the study reveals that these claims are overstated, with these tools demonstrating varying degrees of hallucination and accuracy.

Key Findings

  1. Hallucination Rates: Lexis+ AI and Thomson Reuters' Ask Practical Law AI have hallucination rates between 17% and 33%. Despite their advanced retrieval-augmented generation (RAG) systems, these tools make false or misleading statements.
  2. Accuracy and Responsiveness: Lexis+ AI was found to be the most accurate of the tools evaluated, providing correct and grounded answers to 65% of the queries. Westlaw's AI-Assisted Research (AI-AR) demonstrated a lower accuracy of 42% but tended to produce longer, more detailed answers.
  3. Performance Variability: Across all models, performance varied substantially based on the type of query. General legal research questions, jurisdiction-specific inquiries, questions with false premises, and factual recalls produced different rates of hallucinations and accuracy.
  4. Legal Profession Implications: The tools' hallucination rates and the variability in their outputs pose challenges for responsible integration into legal practice. Lawyers must verify AI-generated responses carefully, adhering to professional ethical standards such as competence and supervision.
  5. Legal AI developers’ Challenges: Providers must navigate economic pressures while adhering to legal and regulatory frameworks. Potential tort liability and deceptive practice allegations underline the necessity for precise marketing and thorough validation of claimed capabilities.

Methodology

The researchers developed a benchmark dataset composed of 202 legal queries, organized into categories addressing general legal research, jurisdictional/time-specific questions, false premise queries, and factual recall questions. Using a systematic protocol, they evaluated the correctness and groundedness of AI outputs. Correctness assessed factual accuracy and relevance, while groundedness evaluated the validity and applicability of legal citations provided.

Implications

For Legal Practitioners

Lawyers integrating AI tools into their practice must thoroughly vet and cross-reference AI-originated data to ensure compliance with ethical standards such as those emphasized in the ABA's Model Rules of Professional Conduct. The persistent risk of hallucination necessitates a cautious approach, potentially undermining the efficiency gains expected from these tools.

For AI Developers

Developers face the dual pressures of competitive commercialization against stringent legal and ethical standards. Misrepresentation of a tool's capabilities can lead to substantial legal repercussions, including those under the Lanham Act and potential tort liabilities. Transparent benchmarks and empirical evidence of performance are essential to mitigate risks and build trust in AI applications.

Future Speculation

Future research and development in AI legal tools are likely to focus on minimizing hallucinations further through enhanced RAG systems and more sophisticated retrieval techniques. Ongoing empirical evaluations and public benchmarks will be crucial in tracking progress. The dichotomy between economic pressures and legal integrity will continue to shape the landscape of AI in legal practice.

Conclusion

Despite advancements in RAG systems, legal AI tools still demonstrate significant hallucination rates. These findings underscore the critical need for rigorous empirical scrutiny and transparent benchmarking in the development and deployment of AI in high-stakes domains like law. As the field progresses, close collaboration between AI developers, legal professionals, and regulatory bodies will be essential in ensuring both innovation and responsibility.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube