LiT: Zero-Shot Transfer with Locked-image text Tuning

Published 15 Nov 2021 in cs.CV, cs.CL, and cs.LG | (2111.07991v3)

Abstract: This paper presents contrastive-tuning, a simple method employing contrastive training to align image and text models while still taking advantage of their pre-training. In our empirical study we find that locked pre-trained image models with unlocked text models work best. We call this instance of contrastive-tuning "Locked-image Tuning" (LiT), which just teaches a text model to read out good representations from a pre-trained image model for new tasks. A LiT model gains the capability of zero-shot transfer to new vision tasks, such as image classification or retrieval. The proposed LiT is widely applicable; it works reliably with multiple pre-training methods (supervised and unsupervised) and across diverse architectures (ResNet, Vision Transformers and MLP-Mixer) using three different image-text datasets. With the transformer-based pre-trained ViT-g/14 model, the LiT model achieves 85.2% zero-shot transfer accuracy on the ImageNet test set, and 82.5% on the challenging out-of-distribution ObjectNet test set.

Abstract PDF Upgrade to Chat

Authors (7)

Citations (488)

View on Semantic Scholar

Summary

The paper introduces LiT, a method that locks image model parameters while tuning text models to achieve enhanced zero-shot transfer learning.
The methodology uses contrastive learning to align fixed image embeddings with adaptable text representations, achieving 85.2% accuracy on ImageNet and 82.5% on ObjectNet.
The findings imply that decoupling image and text training improves computational efficiency and broadens accessibility in zero-shot learning research.

Overview of Zero-Shot Transfer with Locked-Image Text Tuning

This paper introduces a method called Locked-image Text Tuning (LiT), a technique for improving zero-shot transfer learning by leveraging locked pre-trained image models with unlocked text models. The approach builds on the foundation of contrastive learning and is focused on teaching the text model to effectively represent new tasks using the representations from a pre-trained image model.

Methodology

LiT employs a contrastive-tuning approach, where both image and text models are used to create embeddings. However, the key innovation lies in keeping the image model's parameters locked while allowing the text model to adapt. This separation allows the system to utilize powerful pre-trained image models without additional overhead. The use of the pre-trained ViT-g/14 model demonstrated superior zero-shot transfer accuracy on both ImageNet (85.2%) and ObjectNet (82.5%).

Key Results

The empirical study evaluates LiT against established methods such as CLIP and ALIGN, highlighting improved data and computational efficiency. For example, on the ImageNet zero-shot transfer task, LiT shows a significant improvement over previous state-of-the-art models by 9% to 8.8%. Furthermore, LiT achieves high performance on out-of-distribution datasets without requiring learning from scratch or extensive fine-tuning.

The paper also provides insights into the design choices between locked and unlocked models, as well as various pre-trained model architectures and text encoders. A noteworthy observation is that locking the image tower enhances performance, as it keeps the generality and robustness of the image representation intact, while aligning well with the text embeddings.

Implications and Future Research

Practically, LiT facilitates the transformation of existing vision backbones into zero-shot learners with significantly lower computational resources. The method's adaptability offers potential to democratize the contribution of a wider audience in zero-shot learning research, even when using publicly available datasets and models.

Theoretically, LiT highlights the importance of decoupling the learning of image descriptors and vision-language alignment. The study suggests that future advancements in AI could focus on further refining these decoupled processes, perhaps through hybrid models that leverage both large-scale learned representations and task-specific knowledge.

Conclusion

LiT stands as a promising method for zero-shot transfer by efficiently harnessing pre-existing models, thereby reducing computational costs and promoting wider accessibility. The results challenge traditional training paradigms and encourage further exploration into balancing existing knowledge with new task requirements across AI research fields.

Markdown Report Issue