Emergent Mind


Pre-trained Transformers currently dominate most NLP tasks. They impose, however, limits on the maximum input length (512 sub-words in BERT), which are too restrictive in the legal domain. Even sparse-attention models, such as Longformer and BigBird, which increase the maximum input length to 4,096 sub-words, severely truncate texts in three of the six datasets of LexGLUE. Simpler linear classifiers with TF-IDF features can handle texts of any length, require far less resources to train and deploy, but are usually outperformed by pre-trained Transformers. We explore two directions to cope with long legal texts: (i) modifying a Longformer warm-started from LegalBERT to handle even longer texts (up to 8,192 sub-words), and (ii) modifying LegalBERT to use TF-IDF representations. The first approach is the best in terms of performance, surpassing a hierarchical version of LegalBERT, which was the previous state of the art in LexGLUE. The second approach leads to computationally more efficient models at the expense of lower performance, but the resulting models still outperform overall a linear SVM with TF-IDF features in long legal document classification.


  • This study examines methods to adapt transformer models, especially LegalBERT and Longformer, for handling longer legal documents beyond the standard input limitations.

  • Two primary strategies are explored: adapting Longformer with legal pre-training to process texts up to 8,192 sub-words and augmenting LegalBERT with TF-IDF features to prioritize relevant inputs.

  • The adapted Longformer model, leveraging legal domain knowledge from LegalBERT, showed superior performance on complex document classification tasks.

  • The research suggests future directions for optimizing transformer models for legal texts, including experimentation with additional sparse attention mechanisms and integrating richer contextual embeddings.

Enhancing Transformers for Long Legal Document Processing: Modding LegalBERT and Longformer


The processing of long legal documents presents unique challenges in the field of NLP. Traditional transformer models such as BERT are limited by their input length, typically capped at 512 sub-word tokens, which is insufficient for many legal documents that can be significantly longer. Sparse attention models like Longformer and BigBird offer some relief by extending the input capacity but still face limitations in processing the entirety of extra-long documents without truncation. Against this backdrop, this study explores methods to adapt and extend transformer models, specifically LegalBERT and Longformer, to more effectively handle long legal texts.

Approaches to Long Document Processing

This paper investigates two main strategies to enhance the processing capabilities for long legal documents:

  • Adapting Longformer with Legal Pre-training: The study experiments with a Longformer model that has been warm-started from LegalBERT to handle texts up to 8,192 sub-words, aiming to leverage the legal domain knowledge encapsulated in LegalBERT while extending the input length capacity.
  • Modifying LegalBERT with TF-IDF Representations: The second approach seeks to augment LegalBERT with Term Frequency-Inverse Document Frequency (TF-IDF) features, allowing the model to process longer texts indirectly by prioritizing the most relevant textual inputs based on their TF-IDF scores.

Key Findings

The modified Longformer, warm-started from LegalBERT and capable of processing up to 8,192 sub-words, achieved the best performance on LexGLUE long document classification tasks, outperforming the hierarchical version of LegalBERT. The introduction of TF-IDF modifications to LegalBERT, while not surpassing the tailored Longformer, still demonstrated considerable efficiency improvements over the linear SVM baseline when handling long legal texts.

Implications and Future Directions

These findings open new avenues for processing long legal documents with pre-trained transformers, offering pathways to both enhanced performance and computational efficiency. The success of the adapted Longformer model underscores the importance of domain-specific pre-training and the potential benefits of extending input length capabilities for complex text classification tasks. The TF-IDF augmented approach to LegalBERT presents an interesting compromise between efficiency and performance, leveraging traditional NLP techniques within a modern transformer framework for improved handling of long texts.

Future work may explore further optimizations and pre-training strategies tailored to the unique demands of legal document processing. Experimentation with additional sparse attention mechanisms and the integration of richer contextual embeddings could yield further improvements. Additionally, testing these adapted models on a broader range of legal NLP tasks beyond classification may illuminate their versatility and limitations, guiding the development of more robust solutions for legal text analysis.


This study contributes to the ongoing exploration of adapting and enhancing transformer models for specialized domains such as legal document processing. By extending the capabilities of both LegalBERT and Longformer to accommodate longer texts, the research addresses a significant limitation in current NLP approaches to legal text and opens the door to more sophisticated and effective tools for legal practitioners and researchers alike.

Create an account to read this summary for free:


Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.