LAMBERT: Layout-Aware (Language) Modeling for information extraction (2002.08087v5)

Published 19 Feb 2020 in cs.CL

Abstract: We introduce a simple new approach to the problem of understanding documents where non-trivial layout influences the local semantics. To this end, we modify the Transformer encoder architecture in a way that allows it to use layout features obtained from an OCR system, without the need to re-learn language semantics from scratch. We only augment the input of the model with the coordinates of token bounding boxes, avoiding, in this way, the use of raw images. This leads to a layout-aware LLM which can then be fine-tuned on downstream tasks. The model is evaluated on an end-to-end information extraction task using four publicly available datasets: Kleister NDA, Kleister Charity, SROIE and CORD. We show that our model achieves superior performance on datasets consisting of visually rich documents, while also outperforming the baseline RoBERTa on documents with flat layout (NDA (F_{1}) increase from 78.50 to 80.42). Our solution ranked first on the public leaderboard for the Key Information Extraction from the SROIE dataset, improving the SOTA (F_{1})-score from 97.81 to 98.17.

Citations (82)

View on Semantic Scholar

Summary

The paper introduces layout embeddings and relative attention bias in a modified RoBERTa model to capture both semantic content and two-dimensional spatial context.
The model achieves notable F1-score improvements on datasets like Kleister NDA and SROIE, outperforming traditional language models.
The approach leverages efficient OCR-derived bounding boxes instead of raw images, enabling scalable and practical deployment for document processing.

Understanding LAMBERT: Layout-Aware Language Modeling for Information Extraction

The paper introduces LAMBERT, a novel approach to language modeling that incorporates document layout information to enhance information extraction tasks. The traditional LLMs primarily focus on linear text sequences; however, LAMBERT significantly improves upon this by acknowledging the non-linear and spatial characteristics of documents. This advancement is achieved by effectively integrating layout features, derived from OCR systems, into the RoBERTa architecture, leading to a model that leverages both semantic and positional information without relying on raw image data.

Methodology

LAMBERT ingeniously modifies the RoBERTa Transformer model by embedding the relative positional information of text tokens on a document page. Two main enhancements are introduced:

Layout Embeddings: These embeddings capture the positional data of tokens in two-dimensional space, specifically utilizing token bounding boxes. This addition is processed with a learnable projection to align with the RoBERTa's input dimensionality.
Relative Attention Bias: This bias adapts the attention mechanism in Transformers to account for both sequential and two-dimensional token positioning. This is particularly useful for distinguishing the relationship between tokens in complex layouts, such as tables and forms.

The model is trained initially on large volumes of unannotated documents using an unsupervised masked language modeling objective. It can then be fine-tuned for specific information extraction tasks, utilizing datasets containing documents with varying complexity and layout diversity.

Experimental Evaluation

LAMBERT demonstrates superior performance across a range of publicly available datasets, including Kleister NDA, Kleister Charity, SROIE, and CORD, which feature both richly formatted and more straightforward documents. Noteworthy results include an F1-score improvement from 78.50 to 80.42 on the Kleister NDA dataset and a SOTA F1-score increase to 98.17 on the SROIE dataset—highlighting its effectiveness over conventional models like RoBERTa and even achieving competitive results against more complex models such as LayoutLM.

Theoretical and Practical Implications

Practically, LAMBERT's architecture is advantageous due to its reliance on OCR-derived bounding box information, negating the need for computationally expensive image processing. This makes it viable for deployment in industrial scenarios where large document volumes are processed. Theoretically, this approach broadens the scope of LLMs to integrate spatial awareness, representing a meaningful stride towards more comprehensive document understanding systems.

Future Directions

The research opens promising avenues for further development in AI document parsing. Subsequent work could focus on scaling up both the training data and model parameters to potentially enhance performance further. Additionally, investigating the nuanced effects of different attention biases and layout embedding dimensions could yield insights for optimizing layout-aware models. Future explorations could also include extending the model to handle multi-modal inputs more effectively.

In summary, the LAMBERT model presents a noteworthy advancement in language modeling for document information extraction by incorporating layout awareness, thereby improving content comprehension and facilitating more robust extraction performance—enhancing both document processing efficiency and output accuracy.