Emergent Mind

Abstract

The ability of LLMs to process visual inputs has given rise to general-purpose vision systems, unifying various vision-language (VL) tasks by instruction tuning. However, due to the enormous diversity in input-output formats in the vision domain, existing general-purpose models fail to successfully integrate segmentation and multi-image inputs with coarse-level tasks into a single framework. In this work, we introduce VistaLLM, a powerful visual system that addresses coarse- and fine-grained VL tasks over single and multiple input images using a unified framework. VistaLLM utilizes an instruction-guided image tokenizer that filters global embeddings using task descriptions to extract compressed and refined features from numerous images. Moreover, VistaLLM employs a gradient-aware adaptive sampling technique to represent binary segmentation masks as sequences, significantly improving over previously used uniform sampling. To bolster the desired capability of VistaLLM, we curate CoinIt, a comprehensive coarse-to-fine instruction tuning dataset with 6.8M samples. We also address the lack of multi-image grounding datasets by introducing a novel task, AttCoSeg (Attribute-level Co-Segmentation), which boosts the model's reasoning and grounding capability over multiple input images. Extensive experiments on a wide range of V- and VL tasks demonstrate the effectiveness of VistaLLM by achieving consistent state-of-the-art performance over strong baselines across all downstream tasks. Our project page can be found at https://shramanpramanick.github.io/VistaLLM/.

Overview

  • The paper introduces VistaLLM, a model that integrates vision-language tasks using a novel image tokenization technique guided by instructions.

  • VistaLLM leverages a gradient-aware adaptive sampling method to represent segmentation masks, enhancing fine-grained task processing.

  • It has been evaluated on the CoinIt dataset with over 6.8 million samples and excels in a wide range of vision and vision-language tasks.

  • A new benchmark task, AttCoSeg, was developed for better multi-image reasoning and grounding.

  • VistaLLM outperforms existing models and specialist systems in 15 different evaluation benchmarks.

Introduction to Vision-Language Models

Recent progress in AI research has given rise to models that can interpret and generate content that blends visual elements with natural language. These developments enable the creation of systems that can perform a wide array of vision-language tasks by simply following instructions. Nevertheless, the diversity of formats in visual tasks poses challenges in integrating segmentation and multi-image inputs with other tasks within a single framework.

Unified Framework: VistaLLM

To confront these challenges, a novel model called VistaLLM has been introduced. This model supports instruction-guided image tokenization to filter and compress features from images according to the given task descriptions. VistaLLM also introduces a method to represent segmentation masks as sequences using a gradient-aware adaptive sampling technique. This is a significant improvement over the uniform sampling approach previously used, ensuring enhanced fine-grained processing capabilities.

Extensive Evaluation

The effectiveness of VistaLLM has been thoroughly evaluated through extensive experiments across various vision and vision-language tasks. This evaluation is supported by a specially curated comprehensive dataset, CoinIt, which includes over 6.8 million samples. An additional novel task, AttCoSeg (Attribute-level Co-Segmentation), has also been introduced to enhance multi-image reasoning and grounding. VistaLLM has consistently outperformed existing models, showcasing state-of-the-art performance across various challenging tasks.

Contributions and Benchmarks

VistaLLM's key contributions include seamless integration of coarse and fine-grained tasks over both single and multiple image inputs and the innovative sequence generation technique for grounding tasks. The newly constructed CoinIt dataset facilitates the training of VistaLLM, while AttCoSeg addresses the need for more complex datasets in the vision-language domain. The model's effectiveness is evident across 15 different evaluation benchmarks, from image captioning and visual question answering to segmentation and multi-image reasoning tasks. Notably, VistaLLM even surpasses specialist systems on many of these benchmarks.

The research signifies an important step towards developing more versatile and capable general-purpose vision-language models that could potentially revolutionize how machines understand and interact with visual content in the context of human language.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.