Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models (2404.07973v1)
Abstract: While Ferret seamlessly integrates regional understanding into the LLM to facilitate its referring and grounding capability, it poses certain limitations: constrained by the pre-trained fixed visual encoder and failed to perform well on broader tasks. In this work, we unveil Ferret-v2, a significant upgrade to Ferret, with three key designs. (1) Any resolution grounding and referring: A flexible approach that effortlessly handles higher image resolution, improving the model's ability to process and understand images in greater detail. (2) Multi-granularity visual encoding: By integrating the additional DINOv2 encoder, the model learns better and diverse underlying contexts for global and fine-grained visual information. (3) A three-stage training paradigm: Besides image-caption alignment, an additional stage is proposed for high-resolution dense alignment before the final instruction tuning. Experiments show that Ferret-v2 provides substantial improvements over Ferret and other state-of-the-art methods, thanks to its high-resolution scaling and fine-grained visual processing.
- Flamingo: a visual language model for few-shot learning. NeurIPS, 2022.
- Vqa: Visual question answering. In CVPR, pp. 2425–2433, 2015.
- Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv:2308.12966, 2023.
- A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 2023.
- Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv:2310.09478, 2023a.
- Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv:2306.15195, 2023b.
- Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023c.
- Uniter: Universal image-text representation learning. In ECCV, pp. 104–120. Springer, 2020.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Large-scale adversarial training for vision-and-language representation learning. In NeurIPS, volume 33, pp. 6616–6628, 2020.
- Sphinx-x: Scaling data and parameters for a family of multi-modal large language models. arXiv preprint arXiv:2402.05935, 2024.
- Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5356–5364, 2019.
- Cogagent: A visual language model for gui agents. arXiv preprint arXiv:2312.08914, 2023.
- Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, pp. 6700–6709, 2019.
- From clip to dino: Visual encoders shout in multi-modal large language models. arXiv preprint arXiv:2310.08825, 2023.
- Mdetr-modulated detection for end-to-end multi-modal understanding. In ICCV, pp. 1780–1790, 2021.
- Referitgame: Referring to objects in photographs of natural scenes. In EMNLP, 2014.
- Segment anything. arXiv:2304.02643, 2023.
- Grounding language models to images for multimodal inputs and outputs. In ICML, 2023.
- Mmocr: A comprehensive toolbox for text detection, recognition and understanding. arXiv preprint arXiv:2108.06543, 2021.
- Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692, 2023.
- Otter: A multi-modal model with in-context instruction tuning. arXiv:2305.03726, 2023a.
- Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023b.
- Semantic-sam: Segment and recognize anything at any granularity. arXiv:2307.04767, 2023c.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv:2301.12597, 2023d.
- Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023e.
- Microsoft coco: Common objects in context. In ECCV, 2014.
- Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575, 2023.
- Improved baselines with visual instruction tuning. arXiv preprint, 2023a.
- Visual instruction tuning. In NeurIPS, 2023b.
- Llava-next: Improved reasoning, ocr, and world knowledge, January 2024. URL https://llava-vl.github.io/blog/2024-01-30-llava-next/.
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023c.
- Internchat: Solving vision-centric tasks by interacting with chatbots beyond language. arXiv:2305.05662, 2023d.
- Kosmos-2.5: A multimodal literate model. arXiv preprint arXiv:2309.11419, 2023.
- Generation and comprehension of unambiguous object descriptions. In CVPR, 2016.
- Mm1: Methods, analysis & insights from multimodal llm pre-training. arXiv preprint arXiv:2403.09611, 2024.
- OpenAI. GPT-4 technical report. arXiv:2303.08774, 2023.
- TB OpenAI. Chatgpt: Optimizing language models for dialogue. openai, 2022.
- Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
- Kosmos-2: Grounding multimodal large language models to the world. arXiv:2306.14824, 2023.
- Detgpt: Detect what you need via reasoning. arXiv:2305.14167, 2023.
- Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In CVPR, pp. 2641–2649, 2015.
- Jack of all tasks, master of many: Designing general-purpose coarse-to-fine vision-language model. arXiv preprint arXiv:2312.12423, 2023.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp. 8748–8763. PMLR, 2021.
- Textcaps: a dataset for image captioning with reading comprehension. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pp. 742–758. Springer, 2020.
- Towards vqa models that can read. In CVPR, pp. 8317–8326, 2019.
- Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, pp. 23318–23340. PMLR, 2022.
- Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023a.
- The all-seeing project: Towards panoptic visual recognition and understanding of the open world. arXiv preprint arXiv:2308.01907, 2023b.
- Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv:2305.11175, 2023c.
- Finetuned language models are zero-shot learners. In ICLR, 2021.
- Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv:2303.04671, 2023a.
- See, say, and segment: Teaching lmms to overcome false premises. arXiv:2312.08366, 2023b.
- Towards large-scale 3d representation learning with multi-dataset point prompt training. arXiv:2308.09718, 2023c.
- Unitab: Unifying text and box outputs for grounded vision-language modeling. In ECCV, pp. 521–539. Springer, 2022.
- Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv:2303.11381, 2023.
- mplug-owl: Modularization empowers large language models with multimodality. arXiv:2304.14178, 2023.
- Ferret: Refer and ground anything anywhere at any granularity. arXiv:2310.07704, 2023.
- Modeling context in referring expressions. In ECCV. Springer, 2016.
- Mattnet: Modular attention network for referring expression comprehension. In CVPR, pp. 1307–1315, 2018.
- Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. arXiv preprint arXiv:2312.00849, 2023a.
- Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023b.
- Osprey: Pixel understanding with visual instruction tuning. arXiv preprint arXiv:2312.10032, 2023.
- Griffon v2: Advancing multimodal perception with high-resolution scaling and visual-language co-referring. arXiv preprint arXiv:2403.09333, 2024.
- Llava-grounding: Grounded visual chat with large multimodal models. arXiv preprint arXiv:2312.02949, 2023a.
- Glipv2: Unifying localization and vision-language understanding. Advances in Neural Information Processing Systems, 35:36067–36080, 2022a.
- Gpt4roi: Instruction tuning large language model on region-of-interest. arXiv:2307.03601, 2023b.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022b.
- Bubogpt: Enabling visual grounding in multi-modal llms. arXiv preprint arXiv:2307.08581, 2023.
- Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv:2304.10592, 2023.
- Segment everything everywhere all at once. arXiv:2304.06718, 2023.