Papers
Topics
Authors
Recent
2000 character limit reached

SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition (2203.10209v1)

Published 19 Mar 2022 in cs.CV

Abstract: End-to-end scene text spotting has attracted great attention in recent years due to the success of excavating the intrinsic synergy of the scene text detection and recognition. However, recent state-of-the-art methods usually incorporate detection and recognition simply by sharing the backbone, which does not directly take advantage of the feature interaction between the two tasks. In this paper, we propose a new end-to-end scene text spotting framework termed SwinTextSpotter. Using a transformer encoder with dynamic head as the detector, we unify the two tasks with a novel Recognition Conversion mechanism to explicitly guide text localization through recognition loss. The straightforward design results in a concise framework that requires neither additional rectification module nor character-level annotation for the arbitrarily-shaped text. Qualitative and quantitative experiments on multi-oriented datasets RoIC13 and ICDAR 2015, arbitrarily-shaped datasets Total-Text and CTW1500, and multi-lingual datasets ReCTS (Chinese) and VinText (Vietnamese) demonstrate SwinTextSpotter significantly outperforms existing methods. Code is available at https://github.com/mxin262/SwinTextSpotter.

Citations (82)

Summary

  • The paper introduces a unified transformer-based framework that synergizes text detection and recognition without relying on character-level annotations.
  • It employs a dilated Swin-Transformer backbone and a Recognition Conversion module to enhance feature extraction and suppress background noise.
  • Experimental results on datasets like ICDAR 2015 and Total-Text demonstrate enhanced robustness and state-of-the-art performance.

SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition

Introduction

SwinTextSpotter addresses the limitations of state-of-the-art scene text spotting methods by enhancing the synergy between text detection and recognition. Traditional approaches often treat these tasks separately, leading to inefficiencies and error accumulations. Instead, SwinTextSpotter uses a transformer-based approach to unify detection and recognition tasks, achieving better performance without relying on additional modules such as character-level annotations.

Framework Overview (Figure 1)

SwinTextSpotter's framework consists of a four-component architecture:

  1. Backbone: Utilizes Swin-Transformer with a Feature Pyramid Network (FPN) to extract multi-scale features.
  2. Query-based Text Detector: Develops a set-prediction problem using a sequence of proposal features and boxes for efficient text localization.
  3. Recognition Conversion (RC) Module: Bridges text detection and recognition by injecting detection features into the recognition stage.
  4. Attention-based Recognizer: Implements a two-level self-attention mechanism to improve sequence modeling. Figure 1

    Figure 1: The framework of the proposed SwinTextSpotter. The gray arrows denote the feature extraction from images. The green arrows and orange arrows represent the detection and recognition stages, respectively.

Key Components and Methodologies

Dilated Swin-Transformer (Figure 2)

The backbone features a Dilated Swin-Transformer that incorporates dilated convolutions for larger receptive fields, enabling better handling of dense and arbitrarily shaped text instances. Figure 2

Figure 2: Illustration of the designed Dilated Swin-Transformer. The DC refers to two dilated convolution layers, a vanilla convolution layer, and a residual structure.

Recognition Conversion

The RC module plays a pivotal role by:

  • Generating masks using detection features to suppress background noise.
  • Enabling gradients from recognition losses to impact detector features, enhancing both detection precision and recognition accuracy. Figure 3

    Figure 3: Detailed structure of Recognition Conversion.

Query-based Detection

Leveraging the Transformer encoder with a dynamic head, the detection head processes proposal features through multiple stages of refinement, enhancing robustness across scales and aspect ratios.

Experimental Evaluations

SwinTextSpotter demonstrated its effectiveness across various datasets:

  • ICDAR 2015 & RoIC13: Achieved superior F-measure for strong lexicon tasks and robustness against rotation.
  • ReCTS & VinText: Showed significant gains in multilingual text spotting without character-level annotations.
  • Total-Text & SCUT-CTW1500: Outperformed existing methods in both detection and spotting tasks, although challenges remain for long, arbitrarily-shaped text.

Limitations and Future Work

Despite its successes, SwinTextSpotter exhibits limitations with long, arbitrary text instances, particularly noted in SCUT-CTW1500. Future work could explore improving feature extraction resolution to address recognition mismatches for complex text shapes.

Conclusion

SwinTextSpotter advances the field of scene text spotting by harnessing the power of Transformer networks and Recognition Conversion to synergize text detection and recognition tasks. This unified approach not only simplifies the framework by eliminating the need for rectification modules but also sets new benchmarks across multiple public datasets, showcasing the potential of integrated text spotting systems.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.