SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference (2312.01597v4)

Published 4 Dec 2023 in cs.CV

Abstract: Recent advances in contrastive language-image pretraining (CLIP) have demonstrated strong capabilities in zero-shot classification by aligning visual representations with target text embeddings in an image level. However, in dense prediction tasks, CLIP often struggles to localize visual features within an image and fails to give accurate pixel-level predictions, which prevents it from functioning as a generalized visual foundation model. In this work, we aim to enhance CLIP's potential for semantic segmentation with minimal modifications to its pretrained models. By rethinking self-attention, we surprisingly find that CLIP can adapt to dense prediction tasks by simply introducing a novel Correlative Self-Attention (CSA) mechanism. Specifically, we replace the traditional self-attention block of CLIP vision encoder's last layer by our CSA module and reuse its pretrained projection matrices of query, key, and value, leading to a training-free adaptation approach for CLIP's zero-shot semantic segmentation. Extensive experiments show the advantage of CSA: we obtain a 38.2% average zero-shot mIoU across eight semantic segmentation benchmarks highlighted in this paper, significantly outperforming the existing SoTA's 33.9% and the vanilla CLIP's 14.1%.

References (71)

Citations (35)

View on Semantic Scholar

Summary

The paper introduces a novel Correlative Self-Attention mechanism that recalibrates CLIP to produce spatially covariant features for precise semantic segmentation.
The paper demonstrates significant performance gains with an average zero-shot mIoU of 38.2% across eight benchmarks compared to the standard CLIP model.
The paper illustrates that targeted architectural modifications enable efficient adaptation of large pre-trained models for specialized, dense prediction tasks.

Enhancing CLIP for Zero-Shot Semantic Segmentation Via Correlative Self-Attention

Introduction to the Research Gap

Contrastive Language-Image Pretraining (CLIP) models have emerged as powerful tools in achieving remarkable zero-shot classification results. They operate by comparing image-level representations with target text embeddings, which works exceptionally well for general classification tasks. However, when applied to the more granular and demanding task of semantic segmentation, CLIP models face challenges. Specifically, they struggle to accurately localize visual features at the pixel level, thereby limiting their effectiveness for dense prediction tasks. This paper introduces a novel Correlative Self-Attention (CSA) mechanism to tackle this limitation, aiming to augment CLIP's applicability to semantic segmentation without necessitating extensive retraining or model modifications.

Revisiting Self-Attention in Vision-LLMs

The core issue with applying CLIP to semantic segmentation stems from the spatial invariance of its learned visual features. In semantic segmentation, spatially covariant features are desirable — meaning that the model should discern how local representations change in accordance with their positions within an image. The authors pinpoint the inefficacy of CLIP’s self-attention mechanism in this context, proposing a shift towards Correlative Self-Attention (CSA) to foster spatially covariant features. By recalibrating the self-attention mechanism, CSA enables each visual token to recognize and align with semantically similar entities across the image, thereby enhancing the model's ability to localize features accurately.

Empirical Validation of CSA

The newly proposed SCLIP model, incorporating CSA, was rigorously evaluated across eight semantic segmentation benchmarks. With a remarkable average zero-shot mIoU of 38.2%, SCLIP significantly outperforms the current state-of-the-art models as well as the vanilla CLIP model, which achieves a substantially lower mIoU. This impressive improvement highlights the potential of minimal yet focused modifications to preexisting large models like CLIP, steering them towards more specialized tasks without the need for extensive retraining.

Theoretical and Practical Implications

SCLIP’s advancement presents both theoretical and practical ramifications. Theoretically, it proposes a plausible pathway for adapting large, generalized models to more specific tasks through targeted architectural changes, rather than extensive dataset-specific retraining. Practically, the findings suggest a more efficient use of existing resources - leveraging pre-trained models like CLIP for a broader spectrum of applications, including dense prediction tasks, with minimal performance trade-offs. Furthermore, the CSA mechanism's insensitivity to particular projection matrices indicates a robustness and adaptability that could simplify the transformation process for other tasks or models as well.

Future Directions in Zero-Shot Learning

While SCLIP represents a significant stride forward, it also opens avenues for future research. The exploration of additional architectural modifications that could further enhance the zero-shot learning capabilities of CLIP and similar models is a promising direction. Additionally, investigating the scalability of such approaches to accommodate a wider array of dense prediction tasks, beyond semantic segmentation, could extend the utility of pre-trained models even further. The findings also beckon a deeper dive into understanding the dynamics between language and vision models in zero-shot learning scenarios, potentially unlocking new methodologies for enhancing their interplay.

Conclusion

By introducing the novel Correlative Self-Attention mechanism, this research significantly enhances the zero-shot semantic segmentation performance of the CLIP model. The methodology and results underscore the viability of adapting large, general-purpose models for specific tasks through targeted modifications, expanding their applicability and efficiency. As AI research continues to evolve, such approaches open new horizons for leveraging pre-trained models across a wider array of tasks, pushing the boundaries of zero-shot learning and model generalizability.

PDF Markdown