Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 60 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 18 tok/s Pro
GPT-5 High 14 tok/s Pro
GPT-4o 77 tok/s Pro
Kimi K2 159 tok/s Pro
GPT OSS 120B 456 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

LegalLens: Leveraging LLMs for Legal Violation Identification in Unstructured Text (2402.04335v1)

Published 6 Feb 2024 in cs.CL, cs.AI, and cs.LG

Abstract: In this study, we focus on two main tasks, the first for detecting legal violations within unstructured textual data, and the second for associating these violations with potentially affected individuals. We constructed two datasets using LLMs which were subsequently validated by domain expert annotators. Both tasks were designed specifically for the context of class-action cases. The experimental design incorporated fine-tuning models from the BERT family and open-source LLMs, and conducting few-shot experiments using closed-source LLMs. Our results, with an F1-score of 62.69\% (violation identification) and 81.02\% (associating victims), show that our datasets and setups can be used for both tasks. Finally, we publicly release the datasets and the code used for the experiments in order to advance further research in the area of legal NLP.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Falcon-40B: an open large language model with state-of-the-art performance.
  2. Nlp-based automated compliance checking of data processing agreements against gdpr. IEEE Transactions on Software Engineering.
  3. Named entity recognition, linking and generation for greek legislation. In JURIX, pages 1–10.
  4. Longformer: The long-document transformer. CoRR, abs/2004.05150.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  6. Time-aware prompting for text generation. arXiv preprint arXiv:2211.02162.
  7. LEGAL-BERT: the muppets straight out of law school. CoRR, abs/2010.02559.
  8. LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, Canada. Association for Computational Linguistics.
  9. Xiang Dai. 2018. Recognizing complex entity mentions: A review and future directions. In Proceedings of ACL 2018, Student Research Workshop, pages 37–44.
  10. Qlora: Efficient finetuning of quantized llms.
  11. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805.
  12. Is gpt-3 a good data annotator? arXiv preprint arXiv:2212.10450.
  13. Named entity recognition and resolution in legal text. Springer.
  14. Response generation with context-aware prompt learning. arXiv preprint arXiv:2111.02643.
  15. Evaluating large language models in generating synthetic hci research data: a case study. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pages 1–19.
  16. Parameter-efficient transfer learning for nlp.
  17. Lora: Low-rank adaptation of large language models.
  18. Hover: A dataset for many-hop fact extraction and claim verification. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3441–3460.
  19. Named entity recognition in indian court judgments. arXiv preprint arXiv:2211.03442.
  20. A watermark for large language models. arXiv preprint arXiv:2301.10226.
  21. Yuta Koreeda and Christopher D Manning. 2021. Contractnli: A dataset for document-level natural language inference for contracts. arXiv preprint arXiv:2110.01799.
  22. Prototyping the use of large language models (llms) for adult learning content creation at scale. arXiv preprint arXiv:2306.01815.
  23. Fine-grained named entity recognition in legal documents. In International Conference on Semantic Systems, pages 272–287. Springer.
  24. A dataset of german legal documents for named entity recognition. arXiv preprint arXiv:2003.13016.
  25. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35.
  26. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.
  27. Lener-br: a dataset for named entity recognition in brazilian legal text. In Computational Processing of the Portuguese Language: 13th International Conference, PROPOR 2018, Canela, Brazil, September 24–26, 2018, Proceedings 13, pages 313–323. Springer.
  28. Processing long legal documents with pre-trained transformers: Modding legalbert and longformer.
  29. Detectgpt: Zero-shot machine-generated text detection using probability curvature. arXiv preprint arXiv:2301.11305.
  30. Is a prompt and a few samples all you need? using gpt-4 for data augmentation in low-resource classification tasks. arXiv preprint arXiv:2304.13861.
  31. Multilegalpile: A 689gb multilingual legal corpus. arXiv preprint arXiv:2306.02069.
  32. Anonymity at risk? assessing re-identification capabilities of large language models.
  33. OpenAI. 2023. Gpt-4 technical report.
  34. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  35. Named entity recognition in the romanian legal domain. In Proceedings of the Natural Legal Language Processing Workshop 2021, pages 9–18.
  36. Training question answering models from synthetic data. arXiv preprint arXiv:2002.09599.
  37. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  38. Clasp: Few-shot cross-lingual data augmentation for semantic parsing. arXiv preprint arXiv:2210.07074.
  39. Linguist: Language model instruction tuning to generate annotated utterances for intent classification and slot tagging. arXiv preprint arXiv:2209.09900.
  40. Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR, abs/1910.01108.
  41. Can gpt-4 support analysis of textual data in tasks requiring highly specialized domain expertise? arXiv preprint arXiv:2306.13906.
  42. Explaining legal concepts with augmented large language models (gpt-4). arXiv preprint arXiv:2306.09525.
  43. Classactionprediction: A challenging benchmark for legal judgment prediction of class action cases in the us. arXiv preprint arXiv:2211.00582.
  44. Using nlp and machine learning to detect data privacy violations. In IEEE INFOCOM 2020 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), pages 972–977.
  45. Named entity recognition in the legal domain using a pointer generator network. arXiv preprint arXiv:2012.09936.
  46. How to fine-tune bert for text classification?
  47. Fever: a large-scale dataset for fact extraction and verification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 809–819.
  48. Llama 2: Open foundation and fine-tuned chat models.
  49. Dimitrios Tsarapatsanis and Nikolaos Aletras. 2021. On the ethical limits of natural language processing on legal text. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3590–3599.
  50. Attention is all you need. CoRR, abs/1706.03762.
  51. Generating faithful synthetic data with large language models: A case study in computational social science. arXiv preprint arXiv:2305.15041.
  52. Unleash gpt-2 power for event detection. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6271–6282.
  53. A simplified cohen’s kappa for use in binary classification data annotation tasks. IEEE Access, 7:164386–164397.
  54. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
  55. Intelligent classification and automatic annotation of violations based on neural network language model. In 2020 International Joint Conference on Neural Networks (IJCNN), pages 1–7.
Citations (15)

Summary

  • The paper introduces LegalLens, a system that leverages GPT-4 generated datasets and expert validation to detect legal violations and associate them with affected individuals.
  • The authors generated specialized NER and NLI datasets using explicit and implicit prompting and fine-tuned various BERT-based models and LLMs, achieving up to 62.69% F1 score for violation detection.
  • The study reveals LLMs’ advantages in low-data NLI tasks while highlighting challenges like misclassification and contextual errors, suggesting avenues for further research.

This paper introduces LegalLens, a system designed to detect legal violations in unstructured text and associate these violations with affected individuals. The authors address the limitations of existing domain-specific models by creating two new datasets using GPT-4, validated by legal experts, for Named Entity Recognition (NER) and Natural Language Inference (NLI) tasks. The NER dataset identifies violations, while the NLI dataset links violations to potential victims by matching them with resolved class-action cases. Experiments involve fine-tuning BERT-based models and open-source LLMs, as well as few-shot learning with closed-source LLMs. The results demonstrate the effectiveness of their datasets and setups for both tasks, achieving an F1-score of 62.69% for violation identification and 81.02% for victim association.

Dataset Curation and Methodology

To address the lack of suitable datasets for identifying legal violations across diverse contexts, the authors employed a multi-stage approach consisting of prompting, labeling, and data validation to generate two datasets for NER and NLI tasks. The NER task classifies tokens into predefined entities (Law, Violation, Violated By, and Violated On) to identify violations, while the NLI task classifies the relationship between a premise and a hypothesis (entailment, contradiction, or neutral) to match violations with known, resolved class-action cases. Figure 1

Figure 1: A visual representation of the data generation flow, illustrating the step-by-step process from raw input to the final synthesized dataset.

The data generation process leverages GPT-4 to produce synthetic data that mimics the syntactic complexity of legal language. For NER, the authors extracted relevant sections from class action complaints, summarized them using GPT-4, and employed explicit and implicit prompting strategies to generate diverse content. Explicit prompting emphasizes the inclusion and order of multiple entities, while implicit prompting focuses on content describing the violation. For NLI, the authors summarized articles from legal news websites to serve as premises, and then tasked the model to generate hypotheses that mirror real-world scenarios. The prompts used for NLI data generation are shown in (Figure 2). Figure 2

Figure 2: Prompt design for generating NLI data set. Prompt contains the task description, specific instructions, and few-shot examples.

Human Expert Annotations and Data Validation

Given the synthetic nature of the datasets, the authors implemented several validation methods to ensure the data is realistic and challenging. Legal experts examined auto-generated summaries and tasks, ensuring summaries accurately reflected key points of the complaints and that the tasks were correctly aligned with the context provided by these summaries. Multiple annotators examined each record to identify missing entities and hallucinations. Annotators, tasked with distinguishing between machine-generated and human-written records, achieved an average F1-score of 44.86%, with low inter-annotator agreement as measured by Cohen's Kappa scores (0.0821, 0.2149, and 0.0988). This low agreement suggests that the machine-generated content closely resembled human writing, making it difficult even for experts to differentiate.

Experimental Setup and Results

The authors evaluated various LLMs on the generated datasets, including fine-tuned BERT-based models (RoBERTa, DistilBERT, BERT, Legal-BERT, Legal-RoBERTa, Legal-English-RoBERTa, and Longformer-based models), parameter-efficient fine-tuned open-source LLMs (Falcon-7B, Llama-2-7B, and Llama-2-13B using QLoRA), and few-shot experiments using closed-source LLMs (OpenAI's GPT-3.5 and GPT-4).

For NER, the dataset is categorized by Cause of Action (CoA), and a strategy was adopted to exclude CoAs present in the training set from the test set to mitigate data leakage. For NLI, the dataset contains news articles across four legal domains, and a leave-one-out approach was employed to test each legal domain separately while training the model on the other domains.

The results, shown in Table 1 from the paper, indicate that BERT-based models outperform LLMs on the NER task, with roberta-base achieving the best performance (F1 score of 62.69% and Recall of 70.3%). The authors attribute this to the cross-entropy objective function used by BERT-based models, which provides a stronger gradient signal compared to the causal LLMing objective used for fine-tuning LLMs. The full results are shown in the table below.

Model Size Method F1 Precision Recall
nlpaueb/legal-bert-small-uncased 41.92 Fine-tune 48.90±0.39 49.71±0.83 42.19±0.89
distilbert-base-uncased 66M Fine-tune 58.69±0.52 60.50±0.77 47.23±1.06
bert-base-cased 108M Fine-tune 54.80±0.64 65.28±1.01 39.92±0.80
bert-base-uncased 109M Fine-tune 53.22±1.42 45.86±1.68 63.42±1.11
roberta-base 125M Fine-tune 62.69±0.69 56.58±1.12 70.30±0.73
nlpaueb/legal-bert-base-uncased 109M Fine-tune 57.50±0.94 50.34±1.26 67.04±0.71
lexlms/legal-roberta-base 124M Fine-tune 59.73±2.03 53.11±2.27 68.25±1.86
joelito-legal-english-roberta-base 124M Fine-tune 59.01±1.74 52.52+2.52 67.40±0.85
lexlms/legal-longformer-base 148M Fine-tune 62.30±1.76 56.78+2.14 69.04±1.32
lexlms/legal-roberta-large 355M Fine-tune 50.23±28.1 46.07±25.8 55.22±30.8
lexlms/legal-longformer-large 434M Fine-tune 37.63±34.4 34.26±31.3 41.76±38.1
joelito-legal-english-roberta-large 355M Fine-tune 58.92±4.28 52.88±4.95 66.59±3.22
Falcon 7B 7B QLoRA 1.00±0.50 39.50±16.8 0.50±0.20
Llama-2 7B 7B QLoRA 16.3±4.10 34.10±11.1 11.20±2.60
OpenAI GPT-3.5 175B Few-shot 2.77±0.12 1.78±0.08 6.23±0.29
OpenAI GPT-4 N/A Few-shot 13.55±0.54 8.29±0.37 37.1±0.99

In contrast, for NLI, LLMs outperform BERT-based models. Falcon 7B achieves the highest performance across domains (Consumer Protection, Privacy, and TCPA), except for the Wage domain. The authors attribute this to the fact that, unlike NER, LLMs are fine-tuned to predict only one token (entailed, contradict, or neutral) in NLI. Additionally, LLMs learn relatively better in low data situations and generalize well to out-of-distribution test datasets.

Error Analysis and Future Directions

Error analysis of the NER model reveals that the "VIOLATION" entity type exhibited the lowest F1 score due to its length and contextual complexity. Errors fall into truncation errors, context misunderstanding, and incorrect entity identification. Error analysis of the NLI model (Falcon 7B) indicates a substantial number of second-class errors, where "Contradict" or "Entailed" are misclassified as "Neutral", suggesting the model struggles with nuanced cases.

Future work includes expanding the dataset to include a broader range of legal areas and multiple jurisdictions, as well as integrating fact matching algorithms to enhance the accuracy of legal violation identification.

Conclusion

The paper presents LegalLens, a system for identifying legal violations in unstructured text and associating them with affected individuals, using LLMs and expert validation. The dual setup approach, employing NER to pinpoint violations and NLI to associate these violations with resolved cases, demonstrates promising results. The paper also highlights the challenges and limitations of using LLMs for legal NLP tasks, providing insights for future research and development in this area.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets