Supervised Fine-tuning in turn Improves Visual Foundation Models (2401.10222v2)

Published 18 Jan 2024 in cs.CV and cs.AI

Abstract: Image-text training like CLIP has dominated the pretraining of vision foundation models in recent years. Subsequent efforts have been made to introduce region-level visual learning into CLIP's pretraining but face scalability challenges due to the lack of large-scale region-level datasets. Drawing inspiration from supervised fine-tuning (SFT) in natural language processing such as instruction tuning, we explore the potential of fine-grained SFT in enhancing the generation of vision foundation models after their pretraining. Thus a two-stage method ViSFT (Vision SFT) is proposed to unleash the fine-grained knowledge of vision foundation models. In ViSFT, the vision foundation model is enhanced by performing visual joint learning on some in-domain tasks and then tested on out-of-domain benchmarks. With updating using ViSFT on 8 V100 GPUs in less than 2 days, a vision transformer with over 4.4B parameters shows improvements across various out-of-domain benchmarks including vision and vision-linguistic scenarios.

References (84)

Authors (6)

Xiaohu Jiang (4 papers)
Yixiao Ge (99 papers)
Yuying Ge (39 papers)
Chun Yuan (127 papers)
Ying Shan (252 papers)
Dachuan Shi (8 papers)

Citations (2)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Tweets

https://twitter.com/AI_inAM/status/1748457194507706866

Supervised Fine-tuning in turn Improves Visual Foundation Models (2401.10222v2)

Summary

Related Papers

Tweets