Light-Head R-CNN: In Defense of Two-Stage Object Detector (1711.07264v2)

Published 20 Nov 2017 in cs.CV

Abstract: In this paper, we first investigate why typical two-stage methods are not as fast as single-stage, fast detectors like YOLO and SSD. We find that Faster R-CNN and R-FCN perform an intensive computation after or before RoI warping. Faster R-CNN involves two fully connected layers for RoI recognition, while R-FCN produces a large score maps. Thus, the speed of these networks is slow due to the heavy-head design in the architecture. Even if we significantly reduce the base model, the computation cost cannot be largely decreased accordingly. We propose a new two-stage detector, Light-Head R-CNN, to address the shortcoming in current two-stage approaches. In our design, we make the head of network as light as possible, by using a thin feature map and a cheap R-CNN subnet (pooling and single fully-connected layer). Our ResNet-101 based light-head R-CNN outperforms state-of-art object detectors on COCO while keeping time efficiency. More importantly, simply replacing the backbone with a tiny network (e.g, Xception), our Light-Head R-CNN gets 30.7 mmAP at 102 FPS on COCO, significantly outperforming the single-stage, fast detectors like YOLO and SSD on both speed and accuracy. Code will be made publicly available.

Citations (352)

View on Semantic Scholar

Summary

The paper presents a novel lightweight head architecture that reduces computational load in two-stage object detectors.
It employs large-kernel separable convolutions and a single fully connected layer to streamline the RoI warping process.
Benchmark results show that Light-Head R-CNN achieves competitive mAP and reaches 102 FPS on the COCO dataset using ResNet-101 and Xception backbones.

Light-Head R-CNN: Advancements in Two-Stage Object Detectors

The paper "Light-Head R-CNN: In Defense of Two-Stage Object Detector" presents a novel architecture aimed at addressing the computational inefficiencies of traditional two-stage object detectors without compromising accuracy. Two-stage detectors like Faster R-CNN and R-FCN have typically been less efficient compared to single-stage detectors such as YOLO and SSD due to their computationally heavy architecture. This paper introduces the Light-Head R-CNN, which redefines the structure of two-stage detectors, achieving faster processing rates without sacrificing precision.

Technical Innovations and Architectural Design

Two-stage detectors traditionally suffer from a computationally intensive head, which limits speed. Light-Head R-CNN innovatively addresses this by implementing a light head design. In this architecture, the head of the network comprises a thin feature map and a single lightweight R-CNN subnet, thereby reducing the computational burden associated with typical two-stage methods.

A significant component of this design is the use of large-kernel separable convolution to generate thin feature maps with a small channel number. This approach efficiently handles the Region of Interest (RoI) warping, a major bottleneck in the computation chain of two-stage detectors. Moreover, the computational requirement is further minimized by employing a single fully-connected layer in the R-CNN subnet.

Performance Benchmarks

The Light-Head R-CNN achieves notable improvements over existing methods. With a ResNet-101 backbone, the architecture achieves state-of-the-art performance on the COCO dataset while maintaining competitive efficiency. Notably, replacing the backbone with a smaller network such as Xception, Light-Head R-CNN attains a mean Average Precision (mAP) of 30.7 at an impressive speed of 102 frames per second (FPS), outperforming single-stage detectors.

Comparative Analysis

The paper presents a comprehensive evaluation comparing Light-Head R-CNN with other contemporary detectors. Against popular single-stage detectors like YOLO and SSD, Light-Head R-CNN consistently surpasses both speed and accuracy metrics. Furthermore, it outperforms other two-stage methods like Faster R-CNN and Mask R-CNN in terms of computational efficiency while achieving comparable or better precision.

Implications and Future Directions

The findings have several implications for the design of object detection frameworks. By reducing the computational load without losing accuracy, Light-Head R-CNN presents a compelling argument for the potential of streamlined two-stage detectors in real-time applications.

Theoretically, this research showcases the importance of efficiently designed network architectures that prioritize computational simplicity while preserving feature richness. Practically, it paves the way for deploying high-performance object detectors in resource-constrained environments such as mobile devices and edge computing platforms.

Looking forward, the strategies employed in Light-Head R-CNN could inspire further research into optimizing other components of detection networks, potentially incorporating adaptive processing pipelines or exploring new forms of data augmentation to maximize learning efficiency.

In conclusion, Light-Head R-CNN stands as an innovative contribution to the evolution of detection architectures, balancing the competing demands of speed and accuracy in object detection. The architecture's design principles could serve as a foundation for future research endeavors in the ongoing effort to bridge the gap between computational efficiency and predictive performance in computer vision tasks.