Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation (2404.19752v1)
Abstract: Existing automatic captioning methods for visual content face challenges such as lack of detail, content hallucination, and poor instruction following. In this work, we propose VisualFactChecker (VFC), a flexible training-free pipeline that generates high-fidelity and detailed captions for both 2D images and 3D objects. VFC consists of three steps: 1) proposal, where image-to-text captioning models propose multiple initial captions; 2) verification, where a LLM utilizes tools such as object detection and VQA models to fact-check proposed captions; 3) captioning, where an LLM generates the final caption by summarizing caption proposals and the fact check verification results. In this step, VFC can flexibly generate captions in various styles following complex instructions. We conduct comprehensive captioning evaluations using four metrics: 1) CLIP-Score for image-text similarity; 2) CLIP-Image-Score for measuring the image-image similarity between the original and the reconstructed image generated by a text-to-image model using the caption. 3) human study on Amazon Mechanical Turk; 4) GPT-4V for fine-grained evaluation. Evaluation results show that VFC outperforms state-of-the-art open-sourced captioning methods for 2D images on the COCO dataset and 3D assets on the Objaverse dataset. Our study demonstrates that by combining open-source models into a pipeline, we can attain captioning capability comparable to proprietary models such as GPT-4V, despite being over 10x smaller in model size.
- Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
- Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, 2018.
- Improving image generation with better captions. Technical report, OpenAI, 2023.
- Improving image captioning descriptiveness by ranking and llm-based fusion. arXiv preprint arXiv:2306.11593, 2023.
- Language models are few-shot learners. In NeurIPS, 2020.
- Ic3: Image captioning by committee consensus. arXiv preprint arXiv:2302.01328, 2023.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.
- Objaverse: A universe of annotated 3d objects. In CVPR, 2023.
- Long-term recurrent convolutional networks for visual recognition and description. In CVPR, 2015.
- Detecting and preventing hallucinations in large vision language models. arXiv preprint arXiv:2308.06394, 2023.
- Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718, 2021.
- Promptcap: Prompt-guided task-aware image captioning. arXiv preprint arXiv:2211.09699, 2022.
- Attention on attention for image captioning. In ICCV, 2019.
- Are these the same apple? comparing images based on object intrinsics. arXiv preprint arXiv:2311.00750, 2023.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
- Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023b.
- Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023a.
- Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023b.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023c.
- Evaluation and mitigation of agnosia in multimodal large language models. arXiv preprint arXiv:2309.04041, 2023.
- Scalable 3d captioning with pretrained models. In NeurIPS, 2023a.
- Scalable 3d captioning with pretrained models. arXiv preprint arXiv:2306.07279, 2023b.
- Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734, 2021.
- OpenAI. Gpt-4 technical report, 2023.
- Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
- Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
- Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512, 2023.
- Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
- Plug-and-play vqa: Zero-shot vqa by conjoining large pretrained models with zero training. arXiv preprint arXiv:2210.08773, 2022.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Vigc: Visual instruction generation and correction. arXiv preprint arXiv:2308.12714, 2023a.
- Evaluation and analysis of hallucination in large vision-language models. arXiv preprint arXiv:2308.15126, 2023b.
- Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In ICML, 2022.
- Visual clues: Bridging vision and language foundations for image paragraph captioning. In NeurIPS, 2022.
- Socratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022.
- Chatgpt asks, blip-2 answers: Automatic questioning towards enriched visual descriptions. arXiv preprint arXiv:2303.06594, 2023a.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023b.