AdaCLIP: Adapting CLIP with Hybrid Learnable Prompts for Zero-Shot Anomaly Detection

Published 22 Jul 2024 in cs.CV | (2407.15795v1)

Abstract: Zero-shot anomaly detection (ZSAD) targets the identification of anomalies within images from arbitrary novel categories. This study introduces AdaCLIP for the ZSAD task, leveraging a pre-trained vision-LLM (VLM), CLIP. AdaCLIP incorporates learnable prompts into CLIP and optimizes them through training on auxiliary annotated anomaly detection data. Two types of learnable prompts are proposed: static and dynamic. Static prompts are shared across all images, serving to preliminarily adapt CLIP for ZSAD. In contrast, dynamic prompts are generated for each test image, providing CLIP with dynamic adaptation capabilities. The combination of static and dynamic prompts is referred to as hybrid prompts, and yields enhanced ZSAD performance. Extensive experiments conducted across 14 real-world anomaly detection datasets from industrial and medical domains indicate that AdaCLIP outperforms other ZSAD methods and can generalize better to different categories and even domains. Finally, our analysis highlights the importance of diverse auxiliary data and optimized prompts for enhanced generalization capacity. Code is available at https://github.com/caoyunkang/AdaCLIP.

Abstract PDF Upgrade to Chat

Authors (6)

Citations (4)

View on Semantic Scholar

Summary

The paper introduces a hybrid prompting mechanism integrating static and dynamic learnable tokens to adapt CLIP for zero-shot anomaly detection.
It leverages auxiliary anomaly detection data to improve generalization, achieving superior performance across 14 industrial and medical datasets.
The framework features a projection layer and Hybrid Semantic Fusion module that align image and text embeddings for enhanced region-level scoring.

AdaCLIP: A Novel Framework for Zero-Shot Anomaly Detection Utilizing Hybrid Learnable Prompts

The paper presents AdaCLIP, an innovation in zero-shot anomaly detection (ZSAD) that enhances the CLIP vision-LLM (VLM). The method integrates hybrid learnable prompts, including static and dynamic variations, to adapt CLIP for detecting anomalies in unseen image categories. This combination facilitates ZSAD by exploiting auxiliary anomaly detection data without needing examples from the target domain during training.

Technical Summary and Methodological Advances

AdaCLIP extends CLIP's capabilities through prompt adaptation, leveraging pre-trained VLMs with enhanced static and dynamic prompts to tailor the model for anomaly detection tasks across various industrial and medical domains. Static prompts serve as universal tokens, optimized during training to capture a range of anomaly features, while dynamic prompts are generated per test image, allowing fine-tuning of the model's response based on specific image features. This approach, referred to as hybrid prompts, showcases superior ZSAD performance.

Key Contributions:

Hybrid Prompting Mechanism: AdaCLIP integrates static and dynamic learnable prompts within the CLIP framework, enhancing anomaly detection by adapting to both the data observed during training and novel test data.
Use of Auxiliary Data: The model leverages diverse auxiliary datasets, demonstrating the importance of varied training data to boost the model's ability to generalize across different application domains.
Projection and Semantic Fusion Enhancements: AdaCLIP includes a projection layer to align patch and text embeddings and proposes a Hybrid Semantic Fusion (HSF) module. This module aggregates region-level anomaly information to enhance image-level anomaly scoring.

Results

The robustness of AdaCLIP is substantiated through exhaustive experimentation over 14 datasets across industrial and medical domains, achieving state-of-the-art (SOTA) results. It consistently outperforms existing ZSAD methodologies by optimizing prompts using annotated auxiliary data, illustrating a superior generalization capability. AdaCLIP demonstrates substantial improvement in both image- and pixel-level anomaly detection metrics with an average improvement exceeding those of comparable methods, such as WinCLIP and APRIL-GAN, by notable margins.

Implications and Future Work

AdaCLIP's methodology underlines several theoretical and practical implications. The incorporation of hybrid prompts into CLIP substantiates the efficacy of prompt learning in enhancing VLMs for specific tasks, prompting further exploration into multimodal prompt learning. Practically, AdaCLIP's ability to detect anomalies without requiring known exemplars positions it as a pivotal innovation in fields like industrial inspection and medical diagnostics, where rapid deployment across varying contexts is crucial.

For future work, the authors hint at the potential for optimizing dynamism in prompt generation, exploring higher levels of contextual and functional integration of auxiliary data. Additionally, refining text prompts to capture intricate normal versus abnormal semantics in specific domains may further enhance AdaCLIP's efficacy.

AdaCLIP marks a significant advancement in ZSAD, showcasing how learnable prompts can be effectively used to adapt strong VLM backbones, like CLIP, for specialized anomaly detection purposes. This framework's promising results across diverse domains open avenues for further development in scalable and adaptable anomaly detection solutions in varying practical scenarios.

Markdown Report Issue