HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding (2403.00425v2)

Published 1 Mar 2024 in cs.CV, cs.AI, and cs.LG

Abstract: While large vision-LLMs (LVLMs) have demonstrated impressive capabilities in interpreting multi-modal contexts, they invariably suffer from object hallucinations (OH). We introduce HALC, a novel decoding algorithm designed to mitigate OH in LVLMs. HALC leverages distinct fine-grained optimal visual information in vision-language tasks and operates on both local and global contexts simultaneously. Specifically, HALC integrates a robust auto-focal grounding mechanism (locally) to correct hallucinated tokens on the fly, and a specialized beam search algorithm (globally) to significantly reduce OH while preserving text generation quality. Additionally, HALC can be integrated into any LVLMs as a plug-and-play module without extra training. Extensive experimental studies demonstrate the effectiveness of HALC in reducing OH, outperforming state-of-the-arts across four benchmarks.

References (43)

Authors (6)

Zhaorun Chen (28 papers)
Zhuokai Zhao (21 papers)
Hongyin Luo (31 papers)
Huaxiu Yao (103 papers)
Bo Li (1108 papers)
Jiawei Zhou (78 papers)

Citations (33)

View on Semantic Scholar

Summary

The paper presents HALC, a novel strategy that dynamically selects visual contexts to reduce object, attribute, and relationship hallucinations.
It employs an adaptive focal-contrast grounding mechanism combined with token identification and matching-based beam search for improved visual-text alignment.
Experimental results on benchmarks like MSCOCO, MME, and LLaVA-Bench confirm HALC's superior performance compared to existing state-of-the-art methods.

HALC: A Novel Approach to Mitigate Object Hallucination in Vision-LLMs

Introduction

The development of vision-LLMs (VLMs) stands as a significant advancement in the intersection of NLP and computer vision (CV), facilitating the comprehensive interpretation of multimodal data. However, object hallucination (OH) emerges as a profound challenge within this domain, leading to the generation of inaccurately described or nonexistent objects. This issue persists even with large vision-LLMs (LVLMs) despite their enhanced capabilities. The paper introduces HALC (Object Hallucination Reduction through Adaptive FocaL-Contrast decoding), a decoding strategy designed to address OH across all its types—existence, attribute, and relationship hallucinations—while maintaining textual generation quality. HALC distinguishes itself by effectively leveraging fine-grained visual information and balancing the mitigation of OH with the preservation of narrative coherence.

Related Work

Existing strategies to confront OH predominantly concentrate on object existence hallucinations, often neglecting attribute and relationship levels. Approaches such as post-hoc correction, self-correction pipelines, and various decoding strategies aim at reducing OH by harnessing better textual or visual priors. However, these methods either require additional data, external powerful LVLMs, or result in complex adaptation processes that hinder their applicability. The significance of addressing OH, coupled with the limitations in current methodologies, underscores the necessity for novel solutions like HALC.

Methodology

HALC operates by identifying tokens related to potential OH sources and utilizing an adaptive focal-contrast grounding mechanism for fine-grained visual information processing. This dual-level approach—addressing both local and global contexts—enables the algorithm to correct hallucinated tokens dynamically during text generation. HALC incorporates:

Object-related Token Identification: This step pinpoints tokens likely to induce OH, based on their syntactic categories, for subsequent processing.
Visual Context Retrieval: Utilizing zero-shot detectors, HALC identifies the visual context related to the currently generated token, even when representing potentially hallucinated elements.
Adaptive Focal-contrast Grounding: Through a novel mechanism, HALC samples and selects contrasting fields of view (FOVs) based on their influence on token output, aiming to approximate optimal visual contexts for token generation.
Matching-based Beam Search: On a global level, HALC employs a beam search algorithm guided by a visual matching score, ensuring that selected text sequences closely align with the original visual input.

Theoretical Analysis

The paper provides a theoretical framework for HALC's FOV sampling strategy, demonstrating its effectiveness in approximating optimal visual contexts for reduced OH. Through empirical analysis, HALC's method of dynamically selecting visual contexts proves superior in mitigating hallucinated content.

Experimental Analysis

Extensive testing across various benchmarks—MSCOCO, MME, and LLaVA-Bench—demonstrates HALC's efficacy in significantly reducing OH across all types. HALC consistently outperforms existing SOTAs and baseline methods in these evaluations, offering a robust solution to the object hallucination problem without compromising text generation quality.

Conclusion

HALC presents a groundbreaking strategy for reducing OH in LVLMs by effectively balancing the use of fine-grained visual information and textual generation quality. Its comprehensive approach, applicability to a broad range of LVLMs, and superior performance underscore its potential to advance the field of vision-LLM development. The open-source availability of HALC, combined with a unified benchmarking platform, further facilitates future research and application in this critical area of paper.

PDF Markdown

Related Papers

GitHub

GitHub - BillChan226/HALC: Official implementation for "HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding" (88 stars)

Tweets

https://twitter.com/gm8xx8/status/1764511734167953812