YOLACT: Real-time Instance Segmentation (1904.02689v2)

Published 4 Apr 2019 in cs.CV

Abstract: We present a simple, fully-convolutional model for real-time instance segmentation that achieves 29.8 mAP on MS COCO at 33.5 fps evaluated on a single Titan Xp, which is significantly faster than any previous competitive approach. Moreover, we obtain this result after training on only one GPU. We accomplish this by breaking instance segmentation into two parallel subtasks: (1) generating a set of prototype masks and (2) predicting per-instance mask coefficients. Then we produce instance masks by linearly combining the prototypes with the mask coefficients. We find that because this process doesn't depend on repooling, this approach produces very high-quality masks and exhibits temporal stability for free. Furthermore, we analyze the emergent behavior of our prototypes and show they learn to localize instances on their own in a translation variant manner, despite being fully-convolutional. Finally, we also propose Fast NMS, a drop-in 12 ms faster replacement for standard NMS that only has a marginal performance penalty.

Citations (1,483)

View on Semantic Scholar

Summary

The paper presents a novel two-branch architecture that separates prototype mask generation and mask coefficient prediction for efficient instance segmentation.
It eliminates traditional feature repooling by using fast non-maximum suppression, reducing inference time by about 12 ms with minimal performance loss.
The model achieves 29.8 mAP at 33.5 fps on MS COCO, enabling real-time applications in autonomous driving, robotics, and video analysis.

Overview of YOLACT: Real-time Instance Segmentation

The paper presents YOLACT, a real-time instance segmentation framework, notable for its impressive balance between performance and speed. The authors, Daniel Bolya, Chong Zhou, Fanyi Xiao, and Yong Jae Lee from the University of California, Davis, introduce an innovative architecture that departs from the conventional two-stage methods, thereby achieving significant acceleration in processing while maintaining competitive accuracy. Evaluated on the MS COCO dataset, YOLACT achieves 29.8 mAP at 33.5 fps using a single Titan Xp GPU—a notable improvement over prior methods that focus primarily on performance at the expense of speed.

Key Contributions

Two-Branch Architecture: YOLACT simplifies the instance segmentation task by dividing it into two parallel subtasks:
- Prototype Generation: Produces a set of image-sized prototype masks.
- Mask Coefficients Prediction: Predicts coefficients for per-instance masks.

These prototypes are then linearly combined with the coefficients to produce the final instance masks.

Elimination of Repooling: Unlike traditional approaches that rely heavily on feature repooling for mask generation (e.g., Mask R-CNN), YOLACT avoids an explicit localization step, leading to both high-quality masks and enhanced temporal stability in video sequences.
Fast Non-Maximum Suppression (Fast NMS): The authors propose a novel Fast NMS technique, capable of running in parallel on a GPU, significantly reducing inference time—by approximately 12 ms compared to standard NMS—while incurring only a marginal performance loss.
Real-Time Speed: The architecture's lightweight nature allows it to achieve real-time performance, setting a new benchmark for one-stage instance segmentation frameworks.

Numerical Results and Performance

YOLACT demonstrates its competitive edge with numerical results such as:

Achieving 29.8 mAP on the MS COCO dataset at a frame rate of 33.5 fps when using a ResNet-101 backbone.
Fast NMS results in a speedup that allows YOLACT to sustain high frame rates with minimal performance trade-off.

Additionally, YOLACT outperforms other state-of-the-art methods concerning speed:

Mask R-CNN achieves only 8.6 fps with a ResNet-101 backbone, whereas YOLACT reaches 33.5 fps with a similar backbone configuration.
Performance trade-offs with different backbones (ResNet-50, DarkNet-53) and image resolutions (400, 550, and 700) provide flexibility, making YOLACT adaptable for various application requirements.

Practical and Theoretical Implications

Practical Implications:

Real-time Applications: YOLACT's real-time performance opens up practical applications in areas such as autonomous driving, robotic vision, and real-time video analysis where speed is critical.
High-Quality Masks: The elimination of repooling ensures that the masks produced are of higher quality, especially for large objects, with clear boundaries and minimal noise.

Theoretical Implications:

Instance Segmentation Approaches: The approach of using separate prototype generation and mask coefficients prediction shows that highly efficient and accurate instance segmentation can be achieved without traditional feature repooling.
Translation Variance in FCNs: The work demonstrates that an FCN can implicitly learn translation variance through padding and other internal mechanisms, challenging the belief that explicit translation variance must be reintroduced for accurate instance segmentation.

Future Developments in AI

YOLACT opens several avenues for future research and development:

Extended Prototype Space: Exploring a more extensive or adaptive prototype space could further improve instance segmentation accuracy.
Integration with Object Detectors: Combining YOLACT with state-of-the-art object detectors like YOLOv3 could yield even more efficient and accurate hybrid models.
Temporal Consistency: Augmenting YOLACT with techniques for explicit temporal consistency could enhance performance in video-related applications even further.
Hardware Optimization: Custom hardware accelerations, perhaps through FPGA or ASIC implementations, could take advantage of YOLACT's inherently parallelizable architecture.

In summary, YOLACT sets a new standard for real-time instance segmentation, combining innovative methodological advances with practical, high-speed performance. It serves as a significant step towards faster, more efficient AI systems capable of performing complex visual tasks in real-time scenarios.

PDF Markdown

Related Papers

YouTube

Show All Videos