- The paper presents LayoutXLM, a model that integrates text, layout, and image features to advance multilingual document understanding.
- It employs a multimodal Transformer with spatial-aware self-attention and three pre-training objectives to precisely model document semantics.
- Experiments on the XFUND benchmark show significant F1 improvements over text-only baselines in language-specific, zero-shot, and multitask settings.
This paper introduces LayoutXLM, a multimodal pre-trained model designed for understanding visually-rich documents across multiple languages. It addresses the limitation of previous models that were either text-only multilingual or multimodal but monolingual (primarily English). LayoutXLM extends the LayoutLMv2 architecture to handle multilingual documents by incorporating text, layout (2D position), and visual (image) information.
Model Architecture and Pre-training
XFUND Benchmark
- To evaluate multilingual performance, the paper introduces the XFUND benchmark, an extension of the English FUNSD dataset.
- Languages: XFUND includes human-annotated forms in 7 languages: Chinese, Japanese, Spanish, French, Italian, German, and Portuguese.
- Task: The primary task is key-value extraction, divided into:
- Semantic Entity Recognition (SER): Identifying and classifying text segments into predefined categories (e.g.,
HEADER
, QUESTION
, ANSWER
). This is framed as a sequence labeling task using BIO format.
- Relation Extraction (RE): Identifying links between semantic entities, specifically key-value relationships. This is treated as a classification problem on entity pairs, using a biaffine attention classifier.
Data: XFUND contains 1,393 forms (199 per language), split into 149 for training and 50 for testing per language. Templates were collected online, filled with synthetic data (typed or handwritten), scanned, OCR'd (using Microsoft Read API), and manually annotated.
Experiments and Results
- LayoutXLM (Base and Large versions) was compared against strong multilingual text-only baselines (XLM-RoBERTa, InfoXLM).
- Evaluation Settings:
1. Language-specific fine-tuning: Training and testing on the same target language.
2. Zero-shot transfer: Training only on the English FUNSD dataset and testing on other XFUND languages.
3. Multitask fine-tuning: Training on all 8 languages (FUNSD + XFUND) simultaneously and testing on each language.
- Results: LayoutXLM significantly outperformed the text-only baselines across all languages and settings for both SER and RE tasks.
- In language-specific fine-tuning, LayoutXLM-Large achieved an average F1 of 82.82% for SER and 72.06% for RE across the 7 XFUND languages, compared to 74.71%/60.02% for InfoXLM-Large.
- Zero-shot results showed strong transfer capabilities, with LayoutXLM-Large achieving 61.15% SER / 54.87% RE average F1, demonstrating its ability to generalize layout understanding across languages.
- Multitask fine-tuning further boosted performance, yielding the best results (e.g., 84.29% SER / 84.58% RE average F1 for LayoutXLM-Large), indicating that the model benefits from shared layout patterns across different languages.
Conclusion
LayoutXLM effectively combines text, layout, and image information for multilingual document understanding. Its pre-training on diverse, multilingual documents allows it to outperform text-only models and generalize well across languages, as demonstrated on the newly introduced XFUND benchmark. The model and dataset were made publicly available to facilitate further research.