Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model (2312.12423v2)

Published 19 Dec 2023 in cs.CV and cs.AI

Abstract: The ability of LLMs to process visual inputs has given rise to general-purpose vision systems, unifying various vision-language (VL) tasks by instruction tuning. However, due to the enormous diversity in input-output formats in the vision domain, existing general-purpose models fail to successfully integrate segmentation and multi-image inputs with coarse-level tasks into a single framework. In this work, we introduce VistaLLM, a powerful visual system that addresses coarse- and fine-grained VL tasks over single and multiple input images using a unified framework. VistaLLM utilizes an instruction-guided image tokenizer that filters global embeddings using task descriptions to extract compressed and refined features from numerous images. Moreover, VistaLLM employs a gradient-aware adaptive sampling technique to represent binary segmentation masks as sequences, significantly improving over previously used uniform sampling. To bolster the desired capability of VistaLLM, we curate CoinIt, a comprehensive coarse-to-fine instruction tuning dataset with 6.8M samples. We also address the lack of multi-image grounding datasets by introducing a novel task, AttCoSeg (Attribute-level Co-Segmentation), which boosts the model's reasoning and grounding capability over multiple input images. Extensive experiments on a wide range of V- and VL tasks demonstrate the effectiveness of VistaLLM by achieving consistent state-of-the-art performance over strong baselines across all downstream tasks. Our project page can be found at https://shramanpramanick.github.io/VistaLLM/.

Citations (17)

View on Semantic Scholar

Summary

The paper introduces VistaLLM, a unified vision-language model that leverages instruction-guided image tokenization for diverse tasks.
It employs gradient-aware adaptive sampling to convert segmentation masks into sequences, enhancing fine-grained processing.
Extensive testing on 15 benchmarks with the CoinIt dataset and the new AttCoSeg task demonstrates superior performance over specialist systems.

Introduction to Vision-LLMs

Recent progress in AI research has given rise to models that can interpret and generate content that blends visual elements with natural language. These developments enable the creation of systems that can perform a wide array of vision-language tasks by simply following instructions. Nevertheless, the diversity of formats in visual tasks poses challenges in integrating segmentation and multi-image inputs with other tasks within a single framework.

Unified Framework: VistaLLM

To confront these challenges, a novel model called VistaLLM has been introduced. This model supports instruction-guided image tokenization to filter and compress features from images according to the given task descriptions. VistaLLM also introduces a method to represent segmentation masks as sequences using a gradient-aware adaptive sampling technique. This is a significant improvement over the uniform sampling approach previously used, ensuring enhanced fine-grained processing capabilities.

Extensive Evaluation

The effectiveness of VistaLLM has been thoroughly evaluated through extensive experiments across various vision and vision-language tasks. This evaluation is supported by a specially curated comprehensive dataset, CoinIt, which includes over 6.8 million samples. An additional novel task, AttCoSeg (Attribute-level Co-Segmentation), has also been introduced to enhance multi-image reasoning and grounding. VistaLLM has consistently outperformed existing models, showcasing state-of-the-art performance across various challenging tasks.

Contributions and Benchmarks

VistaLLM's key contributions include seamless integration of coarse and fine-grained tasks over both single and multiple image inputs and the innovative sequence generation technique for grounding tasks. The newly constructed CoinIt dataset facilitates the training of VistaLLM, while AttCoSeg addresses the need for more complex datasets in the vision-language domain. The model's effectiveness is evident across 15 different evaluation benchmarks, from image captioning and visual question answering to segmentation and multi-image reasoning tasks. Notably, VistaLLM even surpasses specialist systems on many of these benchmarks.

The research signifies an important step towards developing more versatile and capable general-purpose vision-LLMs that could potentially revolutionize how machines understand and interact with visual content in the context of human language.

PDF Markdown

Related Papers

GitHub

VistaLLM

Tweets

https://twitter.com/Shramanpramani2/status/1779389351866081284

https://twitter.com/22146921/status/1737596189955219464