Processing Long Legal Documents with Pre-trained Transformers: Modding LegalBERT and Longformer (2211.00974v2)

Published 2 Nov 2022 in cs.CL

Abstract: Pre-trained Transformers currently dominate most NLP tasks. They impose, however, limits on the maximum input length (512 sub-words in BERT), which are too restrictive in the legal domain. Even sparse-attention models, such as Longformer and BigBird, which increase the maximum input length to 4,096 sub-words, severely truncate texts in three of the six datasets of LexGLUE. Simpler linear classifiers with TF-IDF features can handle texts of any length, require far less resources to train and deploy, but are usually outperformed by pre-trained Transformers. We explore two directions to cope with long legal texts: (i) modifying a Longformer warm-started from LegalBERT to handle even longer texts (up to 8,192 sub-words), and (ii) modifying LegalBERT to use TF-IDF representations. The first approach is the best in terms of performance, surpassing a hierarchical version of LegalBERT, which was the previous state of the art in LexGLUE. The second approach leads to computationally more efficient models at the expense of lower performance, but the resulting models still outperform overall a linear SVM with TF-IDF features in long legal document classification.

Citations (23)

View on Semantic Scholar

Summary

The paper extends transformer capabilities by warm-starting Longformer with LegalBERT to handle up to 8,192 subwords, achieving superior LexGLUE performance.
It introduces a TF-IDF augmentation for LegalBERT that boosts efficiency by prioritizing the most relevant text segments.
These enhancements highlight the benefits of domain-specific pre-training and sparse attention mechanisms in processing lengthy legal documents.

Enhancing Transformers for Long Legal Document Processing: Modding LegalBERT and Longformer

Introduction

The processing of long legal documents presents unique challenges in the field of NLP. Traditional transformer models such as BERT are limited by their input length, typically capped at 512 sub-word tokens, which is insufficient for many legal documents that can be significantly longer. Sparse attention models like Longformer and BigBird offer some relief by extending the input capacity but still face limitations in processing the entirety of extra-long documents without truncation. Against this backdrop, this paper explores methods to adapt and extend transformer models, specifically LegalBERT and Longformer, to more effectively handle long legal texts.

Approaches to Long Document Processing

This paper investigates two main strategies to enhance the processing capabilities for long legal documents:

Adapting Longformer with Legal Pre-training: The paper experiments with a Longformer model that has been warm-started from LegalBERT to handle texts up to 8,192 sub-words, aiming to leverage the legal domain knowledge encapsulated in LegalBERT while extending the input length capacity.
Modifying LegalBERT with TF-IDF Representations: The second approach seeks to augment LegalBERT with Term Frequency-Inverse Document Frequency (TF-IDF) features, allowing the model to process longer texts indirectly by prioritizing the most relevant textual inputs based on their TF-IDF scores.

Key Findings

The modified Longformer, warm-started from LegalBERT and capable of processing up to 8,192 sub-words, achieved the best performance on LexGLUE long document classification tasks, outperforming the hierarchical version of LegalBERT. The introduction of TF-IDF modifications to LegalBERT, while not surpassing the tailored Longformer, still demonstrated considerable efficiency improvements over the linear SVM baseline when handling long legal texts.

Implications and Future Directions

These findings open new avenues for processing long legal documents with pre-trained transformers, offering pathways to both enhanced performance and computational efficiency. The success of the adapted Longformer model underscores the importance of domain-specific pre-training and the potential benefits of extending input length capabilities for complex text classification tasks. The TF-IDF augmented approach to LegalBERT presents an interesting compromise between efficiency and performance, leveraging traditional NLP techniques within a modern transformer framework for improved handling of long texts.

Future work may explore further optimizations and pre-training strategies tailored to the unique demands of legal document processing. Experimentation with additional sparse attention mechanisms and the integration of richer contextual embeddings could yield further improvements. Additionally, testing these adapted models on a broader range of legal NLP tasks beyond classification may illuminate their versatility and limitations, guiding the development of more robust solutions for legal text analysis.

Conclusion

This paper contributes to the ongoing exploration of adapting and enhancing transformer models for specialized domains such as legal document processing. By extending the capabilities of both LegalBERT and Longformer to accommodate longer texts, the research addresses a significant limitation in current NLP approaches to legal text and opens the door to more sophisticated and effective tools for legal practitioners and researchers alike.

PDF Markdown