Emergent Mind

Pixel Aligned Language Models

(2312.09237)
Published Dec 14, 2023 in cs.CV

Abstract

Large language models have achieved great success in recent years, so as their variants in vision. Existing vision-language models can describe images in natural languages, answer visual-related questions, or perform complex reasoning about the image. However, it is yet unclear how localization tasks, such as word grounding or referring localization, can be performed using LLMs. In this work, we aim to develop a vision-language model that can take locations, for example, a set of points or boxes, as either inputs or outputs. When taking locations as inputs, the model performs location-conditioned captioning, which generates captions for the indicated object or region. When generating locations as outputs, our model regresses pixel coordinates for each output word generated by the language model, and thus performs dense word grounding. Our model is pre-trained on the Localized Narrative dataset, which contains pixel-word-aligned captioning from human attention. We show our model can be applied to various location-aware vision-language tasks, including referring localization, location-conditioned captioning, and dense object captioning, archiving state-of-the-art performance on RefCOCO and Visual Genome. Project page: https://jerryxu.net/PixelLLM .

Overview

  • PixelLLM is a novel language model designed to understand the spatial relationship between image content and corresponding text.

  • The model can generate image captions with pixel-level word localization, offering precision in matching text to visual elements.

  • PixelLLM is trained using the Localized Narrative dataset, allowing it to learn from human-annotated visual and textual data.

  • It showcases flexibility by accepting various input combinations of text and image data and providing conditioned outputs.

  • The model achieves state-of-the-art results in tasks like referring localization and dense object captioning, surpassing existing localization methods.

Introduction to Pixel-Aligned Language Models

The growing proficiency of AI in interpreting and generating natural language has been propelled further by the integration of visual inputs. Advanced vision-language models have displayed remarkable abilities, ranging from describing images in text, responding to visually rooted questions, to executing complex image reasoning tasks. Nonetheless, the arena of localization within images, notably tasks such as word grounding or directing location, using language models has remained rather ambiguous.

The Pixel-Aligned Language Model

In an effort to address this gap, a novel approach called the Pixel-Aligned Language Model (PixelLLM) has been introduced. This model is trained to recognize the spatial alignment between visual contents and textual descriptions. It is capable of both generating captions based on a given image and identifying pixel location for each caption word.

PixelLLM operates by accepting an image and either a set of points, boxes, or even textual prompts. When provided with these location prompts, it delivers captions focused on the specific requested areas. Conversely, when generating output, PixelLLM regresses pixel coordinates for each textual word, thus attaining dense word grounding capabilities. As a foundation, the model utilizes the Localized Narrative dataset, which comes enriched with pixel-word-aligned captions birthed from human attention.

Features and Flexibility of PixelLLM

PixelLLM showcases substantial versatility:

  • It offers localization capabilities by taking an image and an optional text or location prompt as input and generating sentences along with localization for each word.
  • The architecture allows the model to adapt to different vision-language tasks by accepting any combination of text or location as either the input or output.
  • It has been trained on human-annotated captioning data, which includes image captions and trajectories that localize each word.

The model’s novel components include an image encoder and a prompt encoder. It capitalizes on visual features extracted from images and combines them with prompts to facilitate conditioned outputs.

Performance and Applications

PixelLLM's performance is notable in several key areas. It achieves state-of-the-art results on tasks such as referring localization on the RefCOCO datasets and dense object captioning on the Visual Genome dataset. Its success is underpinned by its unique formulation of per-pixel localization, driving it to outperform other localization methods that use raw strings or extra tokens to encode location sparsely.

Further showcasing PixelLLM's strengths, the model demonstrates proficiency in location-conditioned captioning, reflecting its nuanced understanding of visual regions and object-specific descriptions.

Conclusion

The introduction of PixelLLM is a significant stride in vision-language modeling. Through combining the power of language models with the intricacy of spatial understanding, this research could pave the path for AI systems that more naturally interact with both the visual and textual world. The promising outcomes suggest a bright future where AI can intricately weave together the realms of vision and language, potentially revolutionizing these interconnected fields.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.