- The paper introduces CINFormer, a novel architecture that injects multi-stage CNN features into a transformer to enhance surface defect detection.
- It leverages a UNet-like structure and a Top-K self-attention module to focus on critical tokens and suppress background noise.
- Extensive experiments show that CINFormer outperforms traditional CNN and transformer models on challenging industrial datasets.
Introduction
The paper "CINFormer: Transformer Network with Multi-Stage CNN Feature Injection for Surface Defect Segmentation" proposes a novel approach for surface defect inspection in industrial processes. Despite advancements in deep learning-based defect detection, challenges persist due to indistinguishable weak defects and defect-like interference from backgrounds. The paper introduces CINFormer, a UNet-like architecture that integrates CNN features into a transformer network to enhance the segmentation of surface defects. This architecture leverages the strengths of CNNs in capturing detailed features and transformers in mitigating background noise, improving the accuracy of defect detection.
Figure 1: Comparison of feature visualization for CNN (a), transformer (b), and the proposed CINFormer (c). It can be observed that CINFormer can better focus on defect areas and suppress redundant background interference.
CINFormer is built upon a UNet-like structure where the encoder integrates multi-level CNN features into different stages of a transformer network. The CNN features are injected to maintain the ability to capture detailed defect features while the transformer component suppresses noise interference. Specifically, a CNN stem generates hierarchical features injected into the transformer network, allowing enhanced information retention for defect identification.
The incorporation of the Top-K self-attention module further refines this process by focusing on tokens with more critical information, effectively suppressing redundant background details. This module ranks tokens and channels based on their variance, retaining the most informative aspects for defect highlighting.
Figure 2: The architecture of the CINFormer, depicting its UNet-like encoder-decoder design with CNN feature injection and the Top-K self-attention mechanism.
Experimental Evaluation
Extensive experiments were conducted using datasets such as DAGM 2007, Magnetic Tile, and NEU to demonstrate CINFormer’s efficacy. The results reveal consistent performance improvements across these datasets compared to existing methods. The architecture outperforms both CNN-based and transformer-based models, indicating effective synergy of local and global feature capturing abilities.
CINFormer showed superior performance on challenging datasets with weak defects and complex backgrounds, underlining its robust capability in practical industrial scenarios.
Figure 3: Visualization of segmentation results obtained by various methods, highlighting CINFormer’s accuracy in defect classification across different datasets.
Ablation Studies
The paper’s ablation studies indicate significant performance gains attributable to the multi-stage CNN feature injection, contrasted with bidirectional and post-transformer structures. Additionally, the integration of the Top-K self-attention mechanism demonstrated enhanced efficiency and effectiveness in defect detection by selectively processing critical feature components.
Figure 4: Illustration of the Top-K self-attention mechanism, showcasing various stages of token and channel selection.
Implications and Future Work
CINFormer offers substantial practical implications in automated industrial inspection processes, providing an adaptable framework for environments with varied defect detection challenges. By effectively leveraging CNN and transformer architectures, the approach fosters improved feature representation and noise suppression.
Future research could explore further optimizations in self-attention mechanisms and enhanced integration techniques to expand the applicability of CINFormer across broader defect typologies and industries.
Conclusion
The CINFormer presents a sophisticated yet computationally efficient solution for surface defect segmentation, combining CNN’s detailed feature representation with transformers' global context awareness. The inclusion of the Top-K self-attention mechanism emphasizes important features while minimizing background interference, leading to state-of-the-art performance in diverse defect scenarios. The paper provides a significant step forward in leveraging advanced neural architectures for industrial defect detection tasks.