Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 37 tok/s
Gemini 2.5 Pro 44 tok/s Pro
GPT-5 Medium 14 tok/s Pro
GPT-5 High 14 tok/s Pro
GPT-4o 90 tok/s Pro
Kimi K2 179 tok/s Pro
GPT OSS 120B 462 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

Enhancing Fine-Grained Image Classifications via Cascaded Vision Language Models (2405.11301v1)

Published 18 May 2024 in cs.CL and cs.CV

Abstract: Fine-grained image classification, particularly in zero/few-shot scenarios, presents a significant challenge for vision-LLMs (VLMs), such as CLIP. These models often struggle with the nuanced task of distinguishing between semantically similar classes due to limitations in their pre-trained recipe, which lacks supervision signals for fine-grained categorization. This paper introduces CascadeVLM, an innovative framework that overcomes the constraints of previous CLIP-based methods by effectively leveraging the granular knowledge encapsulated within large vision-LLMs (LVLMs). Experiments across various fine-grained image datasets demonstrate that CascadeVLM significantly outperforms existing models, specifically on the Stanford Cars dataset, achieving an impressive 85.6% zero-shot accuracy. Performance gain analysis validates that LVLMs produce more accurate predictions for challenging images that CLIPs are uncertain about, bringing the overall accuracy boost. Our framework sheds light on a holistic integration of VLMs and LVLMs for effective and efficient fine-grained image classification.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Flamingo: a visual language model for few-shot learning. ArXiv preprint, abs/2204.14198.
  2. Openflamingo: An open-source framework for training large autoregressive vision-language models. ArXiv preprint, abs/2308.01390.
  3. Qwen-vl: A frontier large vision-language model with versatile abilities. ArXiv preprint, abs/2308.12966.
  4. Sr-gnn: Spatial relation-aware graph neural network for fine-grained image categorization. IEEE Transactions on Image Processing, 31:6017–6031.
  5. Birdsnap: Large-scale fine-grained visual categorization of birds. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 2019–2026.
  6. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  7. The devil is in the channels: Mutual-channel loss for fine-grained image classification. IEEE Transactions on Image Processing, 29:4683–4695.
  8. This looks like that: deep learning for interpretable image recognition. Advances in neural information processing systems, 32.
  9. Knowledge-embedded representation learning for fine-grained image recognition. In International Joint Conference on Artificial Intelligence.
  10. Uniter: Learning universal image-text representations. ArXiv.
  11. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  12. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
  13. A survey for in-context learning.
  14. Maximum-entropy fine-grained classification. ArXiv, abs/1809.05934.
  15. Sharpness-aware minimization for efficiently improving generalization. CoRR, abs/2010.01412.
  16. Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4476–4484.
  17. Escaping the big data paradigm with compact transformers. CoRR, abs/2104.05704.
  18. The inaturalist species classification and detection dataset.
  19. Openclip. If you use this software, please cite it as below.
  20. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 4904–4916.
  21. 3d object representations for fine-grained categorization. In 4th IEEE Workshop on 3D Representation and Recognition, at ICCV 2013 (3dRR-13).
  22. Retrieval-augmented generation for knowledge-intensive NLP tasks. CoRR, abs/2005.11401.
  23. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 12888–12900.
  24. Align before fuse: Vision and language representation learning with momentum distillation. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 9694–9705.
  25. Cascadebert: Accelerating inference of pre-trained language models via calibrated complete models cascade. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 475–486.
  26. M3IT: A large-scale dataset towards multi-modal multilingual instruction tuning. ArXiv preprint, abs/2306.04387.
  27. Visualbert: A simple and performant baseline for vision and language. ArXiv.
  28. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Proc. of ECCV.
  29. Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022.
  30. Scaling language-image pre-training via masking.
  31. Visual instruction tuning. ArXiv preprint, abs/2304.08485.
  32. Fine-grained visual classification of aircraft. Technical report.
  33. Sachit Menon and Carl Vondrick. 2022. Visual classification via description from large language models.
  34. Maria-Elena Nilsback and Andrew Zisserman. 2008. Automated flower classification over a large number of classes. In Indian Conference on Computer Vision, Graphics and Image Processing.
  35. OpenAI. 2022. Introducing chatgpt.
  36. OpenAI. 2023. Gpt-4v(ision) system card.
  37. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763.
  38. Delving into the openness of CLIP. In Findings of the Association for Computational Linguistics: ACL 2023, pages 9587–9606, Toronto, Canada. Association for Computational Linguistics.
  39. Prompt pre-training with twenty-thousand classes for open-vocabulary visual recognition.
  40. Imagenet-21k pretraining for the masses. CoRR, abs/2104.10972.
  41. LAION-5b: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  42. Burr Settles. 2009. Active learning literature survey.
  43. The effectiveness of mae pre-pretraining for billion-scale pretraining. In ICCV.
  44. Zero-shot learning through cross-modal transfer. In Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc.
  45. VL-BERT: pre-training of generic visual-linguistic representations. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020.
  46. Hao Tan and Mohit Bansal. 2019. LXMERT: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5100–5111.
  47. Llama: Open and efficient foundation language models. ArXiv preprint, abs/2302.13971.
  48. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 3485–3492.
  49. Deebert: Dynamic early exiting for accelerating bert inference. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2246–2251.
  50. Webly-supervised fine-grained visual categorization via deep domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40:1100–1113.
  51. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9.
  52. A survey of large language models. arXiv preprint arXiv:2303.18223.
  53. Learning multi-attention convolutional neural network for fine-grained image recognition. 2017 IEEE International Conference on Computer Vision (ICCV), pages 5219–5227.
  54. Learning to prompt for vision-language models. CoRR, abs/2109.01134.
  55. Conditional prompt learning for vision-language models.
  56. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348.
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.

Authors (1)

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets