Video text tracking for dense and small text based on pp-yoloe-r and sort algorithm

Published 31 Mar 2023 in cs.CV | (2304.00018v1)

Abstract: Although end-to-end video text spotting methods based on Transformer can model long-range dependencies and simplify the train process, it will lead to large computation cost with the increase of the frame size in the input video. Therefore, considering the resolution of ICDAR 2023 DSText is 1080 * 1920 and slicing the video frame into several areas will destroy the spatial correlation of text, we divided the small and dense text spotting into two tasks, text detection and tracking. For text detection, we adopt the PP-YOLOE-R which is proven effective in small object detection as our detection model. For text detection, we use the sort algorithm for high inference speed. Experiments on DSText dataset demonstrate that our method is competitive on small and dense text spotting.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a two-stage pipeline combining PP-YOLOE-R for robust small text detection and SORT for efficient tracking.
The method achieves a detection mAP of 78.14 on the DOTA 1.0 dataset, demonstrating its effectiveness in handling dense text scenarios.
The research reduces computational overhead by decomposing text spotting tasks and using data augmentation to enhance robustness against varied text orientations.

Video Text Tracking for Dense and Small Text Based on PP-YOLOE-R and Sort Algorithm

The paper "Video Text Tracking for Dense and Small Text Based on PP-YOLOE-R and Sort Algorithm" presents a robust approach to addressing the challenges associated with detecting and tracking dense and small text in high-resolution videos. This research is central to the domain of automated text spotting in video content, an area that necessitates precision due to the text being small, densely packed, and often obscured by various image artifacts.

A significant challenge identified in this work is the computational burden imposed by end-to-end video text spotting models, especially those reliant on Transformers, which are known for their long-range dependency modeling. These models become computationally prohibitive as the resolution of input video frames increases. The authors propose a two-stage pipeline that sidesteps heavy computation while preserving text spatial correlation.

The methodology is structured into two primary tasks: text detection and text tracking. For text detection, the study employs PP-YOLOE-R, an efficient and effective anchor-free model devised specifically for small object detection. This model garnered attention for achieving a mean Average Precision (mAP) of 78.14 when benchmarked on the DOTA 1.0 dataset, a widely recognized dataset for small objects in aerial images.

For the tracking component, the authors utilize the SORT algorithm, noted for its simplicity and rapid inference speed in multiple object tracking scenarios. The combination of PP-YOLOE-R for detection and SORT for tracking forms a synergized approach that reportedly excels in both performance and speed, as tested on the ICDAR2023 DSText dataset. This dataset encapsulates a diverse range of scenarios, providing a comprehensive benchmark for the proposed method.

The experimental setup utilized high-performance Tesla V100 GPUs and the Paddle deep learning platform, where the PP-YOLOE-R model underwent rigorous training. Noteworthy attention was given to data augmentation techniques such as random image flips and rotated transformations, which are critical in enhancing the model's robustness to various text orientations and perspectives encountered in video frames.

The empirical results, supplemented by visualizations, underscore the method's efficacy across different scenarios, including gaming, driving, and street views. These tailored visualizations illustrate the consistent detection and tracking performance, marked by clear trace identification across consecutive frames.

In concluding, this research asserts the viability of decomposing the dense and small text detection problem into manageable sub-tasks, allowing for focused optimization on small object detection without the need for semantic understanding, which proves computationally expensive and less effective in dense text scenarios. This approach not only streamlines the text tracking process in high-resolution videos but also opens avenues for further exploration in optimizing component algorithms for enhanced text spotting in diverse video contexts. Future developments may explore the integration of more sophisticated tracking algorithms or the application of this model to other domains involving small object dynamics over time.

Markdown Report Issue