DCT-Mask: Discrete Cosine Transform Mask Representation for Instance Segmentation (2011.09876v3)

Published 19 Nov 2020 in cs.CV

Abstract: Binary grid mask representation is broadly used in instance segmentation. A representative instantiation is Mask R-CNN which predicts masks on a $28\times 28$ binary grid. Generally, a low-resolution grid is not sufficient to capture the details, while a high-resolution grid dramatically increases the training complexity. In this paper, we propose a new mask representation by applying the discrete cosine transform(DCT) to encode the high-resolution binary grid mask into a compact vector. Our method, termed DCT-Mask, could be easily integrated into most pixel-based instance segmentation methods. Without any bells and whistles, DCT-Mask yields significant gains on different frameworks, backbones, datasets, and training schedules. It does not require any pre-processing or pre-training, and almost no harm to the running speed. Especially, for higher-quality annotations and more complex backbones, our method has a greater improvement. Moreover, we analyze the performance of our method from the perspective of the quality of mask representation. The main reason why DCT-Mask works well is that it obtains a high-quality mask representation with low complexity. Code is available at https://github.com/aliyun/DCT-Mask.git.

Citations (49)

View on Semantic Scholar

Summary

The paper introduces a novel DCT-based mask representation that transforms high-resolution masks into compact frequency domain vectors, improving boundary detail.
The approach integrates with Mask R-CNN by converting mask prediction into a regression task for DCT vectors, maintaining high IoU scores with reduced complexity.
Experiments on COCO and Cityscapes demonstrate that DCT-Mask delivers higher AP scores and improved mask quality, supporting real-time vision applications.

Discrete Cosine Transform Mask Representation for Instance Segmentation

Introduction

The paper "DCT-Mask: Discrete Cosine Transform Mask Representation for Instance Segmentation" introduces a novel approach to instance segmentation by leveraging the discrete cosine transform (DCT) to encode high-resolution binary grid masks into compact vectors. This method is designed to overcome limitations in existing low-resolution binary grid mask representations, such as those used in Mask R-CNN, which suffer from insufficient detail capture and increased complexity when scaled to higher resolutions.

DCT-Mask Representation

In standard instance segmentation methods like Mask R-CNN, masks are predicted on a fixed low-resolution binary grid, which can compromise the granularity of object boundaries. The proposed DCT-Mask representation resolves this by transforming the high-resolution mask into the frequency domain via DCT, allowing the retention of essential low-frequency components in a compact form.

Figure 1: Binary grid mask representation vs. DCT mask representation. The leftmost sub-graph is the ground truth, the left-middle is the mask representation, the middle right is the reconstructed mask, the rightmost is the error between the reconstructed mask and the ground truth.

DCT's inherent "energy compaction" effectively encapsulates critical mask information, allowing significant detail in object boundaries to be preserved in a reduced vector size. The experiments demonstrate that a 300-dimensional DCT vector from a $128 \times 128$ resolution grid maintains a high Intersection over Union (IoU) score, significantly reducing complexity compared to a direct high-resolution grid.

Integration with Mask R-CNN

Integration of the DCT-Mask necessitates minor modifications to the architecture of Mask R-CNN. Specifically, the mask branch adapts to predict DCT vectors instead of binary grids by incorporating fully connected layers following feature extraction through convolution layers.

Figure 2: The pipeline of DCT mask representation.

Figure 3: The implementation of DCT-Mask on the basis of Mask R-CNN.

Through this integration, the prediction task is transformed into a regression problem for the compact DCT vector, enabling high-quality mask representation with minimal impact on computational efficiency and operational speed.

Experimental Validation

The paper provides a comprehensive evaluation on the COCO and Cityscapes datasets, showing that DCT-Mask significantly improves the performance metrics of instance segmentation tasks. The method achieves higher AP scores across various scenarios, particularly for cases with high-quality annotations or complex backbone architectures.

Figure 4: Mean value and Variance of DCT mask vectors on COCO 2017 val dataset.

The computational efficiency of DCT is evidenced by maintaining nearly the same processing speed as traditional low-resolution methods, while offering superior mask quality.

Implications and Future Directions

The DCT-Mask approach offers several implications for the field of computer vision, particularly in tasks where high fidelity in object delineation is necessary, such as autonomous driving and robotic perception systems. The ability to improve mask quality without sacrificing computational efficiency makes this method attractive for real-time applications.

Future research may explore the utilization of DCT-Mask with other instance segmentation frameworks or extend its application to diverse datasets and tasks requiring high-resolution spatial representations. Additionally, advancements could be pursued in optimizing DCT parameterization, further reducing computational overhead while maintaining or enhancing mask quality.

Conclusion

"DCT-Mask: Discrete Cosine Transform Mask Representation for Instance Segmentation" presents an effective solution to the challenges associated with mask representations in instance segmentation. By leveraging DCT, the method provides a high-quality, low-complexity alternative that integrates seamlessly with existing frameworks, offering a promising tool for both academic exploration and practical deployment in complex visual environments.