CLIP with Generative Latent Replay: a Strong Baseline for Incremental Learning (2407.15793v4)

Published 22 Jul 2024 in cs.CV, cs.AI, and cs.LG

Abstract: With the emergence of Transformers and Vision-LLMs (VLMs) such as CLIP, fine-tuning large pre-trained models has recently become a prevalent strategy in Continual Learning. This has led to the development of numerous prompting strategies to adapt transformer-based models without incurring catastrophic forgetting. However, these strategies often compromise the original zero-shot capabilities of the pre-trained CLIP model and struggle to adapt to domains that significantly deviate from the pre-training data. In this work, we propose Continual Generative training for Incremental prompt-Learning, a simple and novel approach to mitigate forgetting while adapting CLIP. Briefly, we employ Variational Autoencoders (VAEs) to learn class-conditioned distributions within the embedding space of the visual encoder. We then exploit these distributions to sample new synthetic visual embeddings and train the corresponding class-specific textual prompts during subsequent tasks. Through extensive experiments on different domains, we show that such a generative replay approach can adapt to new tasks while improving zero-shot capabilities, evaluated using a novel metric tailored for CL scenarios. Notably, further analysis reveals that our approach can bridge the gap with joint prompt tuning. The codebase is available at https://github.com/aimagelab/mammoth.

Citations (2)

View on Semantic Scholar

Summary

The paper introduces CGIL, which combines generative replay and prompt learning to mitigate catastrophic forgetting in continual learning.
It leverages VAEs to capture latent visual feature distributions and generate synthetic data, avoiding memory buffers.
It achieves superior zero-shot performance using the novel CI-Transfer metric across diverse datasets.

CLIP with Generative Latent Replay: a Strong Baseline for Incremental Learning

The paper "CLIP with Generative Latent Replay: a Strong Baseline for Incremental Learning" addresses a significant challenge in the field of Continual Learning (CL) by introducing a novel approach leveraging generative replay to enhance the adaptability and zero-shot learning capabilities of Vision-LLMs (VLMs), predominantly CLIP. The approach, termed Continual Generative Training for Incremental prompt-Learning (CGIL), integrates the strengths of prompt-learning and generative models to mitigate catastrophic forgetting and improve task specialization, making it a formidable framework in incremental learning settings.

Core Contributions and Methodological Advancements

The primary contributions of CGIL can be summarized as follows:

Generative Replay Strategy: The strategy utilizes Variational Autoencoders (VAEs) to learn the latent distributions of visual features, enabling the generation of synthetic data to support continual learning without requiring a memory buffer.
Prompt Learning for VLMs: CGIL fine-tunes the CLIP text encoder by learning class-specific prompts, thereby adapting to new tasks while preserving the model's zero-shot capabilities. This is achieved by using a hybrid approach that balances handcrafted prompts for unseen classes and learned prompts for seen classes.
New Zero-shot Evaluation Metric: The introduction of a novel metric, Class Incremental Transfer (CI-Transfer), which measures zero-shot performance across future tasks, provides a more comprehensive understanding of a model's continual learning capabilities.

Detailed Methodology

The CGIL approach is divided into two phases:

Learning the Distributions of Latent Representations: For each task, the visual features of input images are extracted using the CLIP visual encoder. These features are then used to train class-specific VAEs, capturing the latent distributions of these features. The VAEs can later generate synthetic features used for training. This process not only captures the underlying data distributions more efficiently than raw images but also complies with data privacy constraints by avoiding direct storage of input data.
Prompts Alignment: The alignment phase employs the synthetic features generated by the VAEs to fine-tune the CLIP text encoder. It learns textual contexts, consisting of both class-specific and generated contexts, ensuring that the model adapts to new domains and retains performance on previously seen tasks. This phase leverages gradient descent on the synthetic dataset, further aligning the prompts with the visual embeddings.

Experimental Validation

The authors conducted extensive experiments across various datasets, including Split Imagenet-R, Split Cars-196, Split CUB-200, Split EuroSAT, and Split ISIC. The results demonstrate CGIL's superior performance compared to state-of-the-art methods, including L2P, DualPrompt, CODA-Prompt, and AttriCLIP, both in terms of final average accuracy and zero-shot performance on future tasks.

Key findings include:

Final Average Accuracy (FAA): CGIL consistently outperformed competitors, particularly in more challenging settings where zero-shot CLIP underperformed, such as in medical image classification (ISIC).
CI-Transfer Metric: CGIL showcased its ability to retain and transfer knowledge across tasks, outperforming methods like MoE Adapters and AttriCLIP, thus validating its effectiveness in zero-shot scenarios.

Theoretical and Practical Implications

Theoretically, CGIL advances the understanding of generative replay in the latent space, providing insights into more efficient and privacy-compliant ways to handle sequential learning without catastrophic forgetting. The method bridges the gap between prompt learning and incremental task adaptation, highlighting the potential of prompt-learning techniques when combined with advanced generative models.

Practically, CGIL's approach has significant implications for deploying VLMs in production environments where data arrives sequentially. Its ability to preserve zero-shot capabilities while continuously learning new tasks makes it suitable for real-world applications in dynamic settings, such as medical diagnostics, autonomous driving, and remote sensing.

Future Directions

Future research could explore the scalability of CGIL to even larger datasets and more complex task sequences. Additionally, optimizing the memory and computational requirements of generative models could further enhance its applicability. Another interesting avenue could be the integration of different generative techniques, such as Generative Adversarial Networks (GANs) or advanced diffusion models, to improve the quality of synthetic data and, consequently, the performance of the framework.

In conclusion, the paper "CLIP with Generative Latent Replay: a Strong Baseline for Incremental Learning" offers a robust contribution to the field of continual learning by effectively combining generative replay and prompt learning. It sets a strong foundation for future research and practical implementations in scenarios requiring continuous adaptation and learning.

PDF Markdown

Related Papers

GitHub

GitHub - aimagelab/mammoth: An Extendible (General) Continual Learning Framework based on Pytorch - official codebase of Dark Experience for General Continual Learning (554 stars)