Reading and Writing: Discriminative and Generative Modeling for Self-Supervised Text Recognition

Published 1 Jul 2022 in cs.CV | (2207.00193v2)

Abstract: Existing text recognition methods usually need large-scale training data. Most of them rely on synthetic training data due to the lack of annotated real images. However, there is a domain gap between the synthetic data and real data, which limits the performance of the text recognition models. Recent self-supervised text recognition methods attempted to utilize unlabeled real images by introducing contrastive learning, which mainly learns the discrimination of the text images. Inspired by the observation that humans learn to recognize the texts through both reading and writing, we propose to learn discrimination and generation by integrating contrastive learning and masked image modeling in our self-supervised method. The contrastive learning branch is adopted to learn the discrimination of text images, which imitates the reading behavior of humans. Meanwhile, masked image modeling is firstly introduced for text recognition to learn the context generation of the text images, which is similar to the writing behavior. The experimental results show that our method outperforms previous self-supervised text recognition methods by 10.2%-20.2% on irregular scene text recognition datasets. Moreover, our proposed text recognizer exceeds previous state-of-the-art text recognition methods by averagely 5.3% on 11 benchmarks, with similar model size. We also demonstrate that our pre-trained model can be easily applied to other text-related tasks with obvious performance gain. The code is available at https://github.com/ayumiymk/DiG.

Abstract PDF Upgrade to Chat

Authors (8)

Citations (48)

View on Semantic Scholar

Summary

The paper presents the DiG framework, combining contrastive learning and masked image modeling to robustly improve text recognition.
It reports a 10.2%-20.2% performance boost on irregular scene text datasets and an average gain of 5.3% across 11 benchmarks.
The approach reduces reliance on large annotated datasets and extends its benefits to tasks like text segmentation and image super-resolution.

Discriminative and Generative Modeling for Self-Supervised Text Recognition

The paper "Reading and Writing: Discriminative and Generative Modeling for Self-Supervised Text Recognition" presents a novel approach to text recognition by integrating discriminative and generative models in a self-supervised learning framework called DiG (Discriminative and Generative). Authored by Mingkun Yang et al., this work addresses the limitation posed by the reliance on large-scale, annotated training data, typically synthetic, which impedes the performance due to the domain gap between synthetic and real-world data.

The paper's innovative approach suggests a dual focus—reading and writing—as a method for humans to learn text recognition. Consequently, it combines contrastive learning (mimicking the reading process) to learn the discrimination of text images and masked image modeling (mirroring the writing process) to comprehend the context generation of images. This integration is posited to provide a more robust feature representation of text images.

Key Results and Claims

The authors present strong numerical results indicating that the DiG framework surpasses previous self-supervised text recognition models by 10.2%-20.2% on irregular scene text datasets. Additionally, DiG exceeds prior state-of-the-art methods by an average margin of 5.3% across 11 benchmarks, maintaining a similar model size. This indicates substantial performance enhancement, suggesting that the integration of discriminative and generative modeling significantly boosts the robustness of text recognition systems.

Furthermore, the pre-trained DiG models demonstrate efficacy in other text-related tasks, such as text segmentation and image super-resolution, showcasing obvious performance improvements, highlighting the versatility and potential for broad application of these models.

Methodology and Implications

The methodology encompasses a ViT-based encoder with two key components: contrastive learning and masked image modeling. These components jointly leverage the advantages of both contrasting positive and negative image pairs to extract discriminative features and reconstruct masked parts of images for generatively understanding image context.

By adopting a patch-aligned random masking strategy and optimizing with both InfoNCE and L2 loss functions, DiG is able to effectively pre-train on unlabeled real images and synthetic datasets. This paves the way for fine-tuning with annotated real data, leading to substantial improvements even when trained with a fraction of labeled data, highlighting its promise for real-world deployment scenarios.

Theoretically, this approach advances our understanding of self-supervised learning by showcasing the potential of dual-model integration. Practically, it offers a pathway to reduce dependency on large-scale annotated datasets, encouraging broader adoption of robust text recognition systems in varied applications.

Future Prospects

Given the impressive results achieved, further advancements could involve expanding this framework to encompass multilingual text recognition or adaptation to other image processing tasks beyond text-specific domains. The integration within multimodal AI systems could also be explored, combining textual and visual data for even richer and more complex information processing.

In summary, this paper offers a novel perspective and substantial improvements in text recognition through discriminative and generative model integration, presenting a significant stride in self-supervised learning methodologies with widespread practical applicability.

Markdown Report Issue