Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 56 tok/s
Gemini 2.5 Pro 39 tok/s Pro
GPT-5 Medium 15 tok/s Pro
GPT-5 High 16 tok/s Pro
GPT-4o 99 tok/s Pro
Kimi K2 155 tok/s Pro
GPT OSS 120B 476 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

Pushing Boundaries: Exploring Zero Shot Object Classification with Large Multimodal Models (2401.00127v1)

Published 30 Dec 2023 in cs.CV and cs.SI

Abstract: $ $The synergy of language and vision models has given rise to Large Language and Vision Assistant models (LLVAs), designed to engage users in rich conversational experiences intertwined with image-based queries. These comprehensive multimodal models seamlessly integrate vision encoders with LLMs, expanding their applications in general-purpose language and visual comprehension. The advent of Large Multimodal Models (LMMs) heralds a new era in AI assistance, extending the horizons of AI utilization. This paper takes a unique perspective on LMMs, exploring their efficacy in performing image classification tasks using tailored prompts designed for specific datasets. We also investigate the LLVAs zero-shot learning capabilities. Our study includes a benchmarking analysis across four diverse datasets: MNIST, Cats Vs. Dogs, Hymnoptera (Ants Vs. Bees), and an unconventional dataset comprising Pox Vs. Non-Pox skin images. The results of our experiments demonstrate the model's remarkable performance, achieving classification accuracies of 85\%, 100\%, 77\%, and 79\% for the respective datasets without any fine-tuning. To bolster our analysis, we assess the model's performance post fine-tuning for specific tasks. In one instance, fine-tuning is conducted over a dataset comprising images of faces of children with and without autism. Prior to fine-tuning, the model demonstrated a test accuracy of 55\%, which significantly improved to 83\% post fine-tuning. These results, coupled with our prior findings, underscore the transformative potential of LLVAs and their versatile applications in real-world scenarios.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (17)
  1. N. S. Patil, R. S. Huang, C. B. van der Pol, and N. Larocque, “Using artificial intelligence chatbots as a radiologic decision-making tool for liver imaging: Do chatgpt and bard communicate information consistent with the acr appropriateness criteria?,” Journal of the American College of Radiology, 2023.
  2. A. Rao, M. Pang, J. Kim, M. Kamineni, W. Lie, A. K. Prasad, A. Landman, K. Dreyer, and M. D. Succi, “Assessing the utility of chatgpt throughout the entire clinical workflow: Development and usability study,” J Med Internet Res, vol. 25, p. e48659, Aug 2023.
  3. M. Y. Lee, “Building multimodal ai chatbots,” 2023.
  4. P. Handa, D. Chhabra, N. Goel, and S. Krishnan, “Exploring the role of chatgpt in medical image analysis,” Biomedical Signal Processing and Control, vol. 86, p. 105292, 2023.
  5. E. Bagdasaryan, T.-Y. Hsieh, B. Nassi, and V. Shmatikov, “Abusing images and sounds for indirect instruction injection in multi-modal llms,” 2023.
  6. H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” 2023.
  7. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” 2021.
  8. R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford alpaca: An instruction-following llama model,” 2023.
  9. H. Liu, C. Li, Y. Li, and Y. J. Lee, “Improved baselines with visual instruction tuning,” 2023.
  10. D. A. Hudson and C. D. Manning, “Gqa: A new dataset for real-world visual reasoning and compositional question answering,” Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  11. L. Deng, “The mnist database of handwritten digit images for machine learning research,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 141–142, 2012.
  12. W. Cukierski, “Dogs vs. cats,” 2013.
  13. T. Melody, “Hymenoptera Data.”
  14. M. M. Ahsan, M. R. Uddin, M. Farjana, A. N. Sakib, K. A. Momin, and S. A. Luna, “Image data collection and implementation of deep learning-based model in detecting monkeypox disease using modified vgg16,” 2022.
  15. M. M. Ahsan, M. R. Uddin, and S. A. Luna, “Monkeypox image data collection,” 2022.
  16. G. , “Autistic Children Facial Image Data Set.” https://www.kaggle.com/discussions/general/123978. [Online; accessed 2023-11-13].
  17. Haotian-Liu, “GitHub - haotian-liu/LLaVA: [NeurIPS’23 Oral] Visual Instruction Tuning: LLaVA (Large Language-and-Vision Assistant) built towards GPT-4V level capabilities..”
Citations (3)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube