ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild (2407.04172v2)

Published 4 Jul 2024 in cs.AI, cs.CL, and cs.CV

Abstract: Given the ubiquity of charts as a data analysis, visualization, and decision-making tool across industries and sciences, there has been a growing interest in developing pre-trained foundation models as well as general purpose instruction-tuned models for chart understanding and reasoning. However, existing methods suffer crucial drawbacks across two critical axes affecting the performance of chart representation models: they are trained on data generated from underlying data tables of the charts, ignoring the visual trends and patterns in chart images, and use weakly aligned vision-language backbone models for domain-specific training, limiting their generalizability when encountering charts in the wild. We address these important drawbacks and introduce ChartGemma, a novel chart understanding and reasoning model developed over PaliGemma. Rather than relying on underlying data tables, ChartGemma is trained on instruction-tuning data generated directly from chart images, thus capturing both high-level trends and low-level visual information from a diverse set of charts. Our simple approach achieves state-of-the-art results across $5$ benchmarks spanning chart summarization, question answering, and fact-checking, and our elaborate qualitative studies on real-world charts show that ChartGemma generates more realistic and factually correct summaries compared to its contemporaries. We release the code, model checkpoints, dataset, and demos at https://github.com/vis-nlp/ChartGemma.

Citations (11)

View on Semantic Scholar

Summary

The paper presents ChartGemma, a model that directly leverages chart images for instruction-tuning, overcoming limitations of table-based methods.
It integrates a ViT-based SigLIP encoder with the Gemma-2B language model, fine-tuning only the language model to enhance performance.
Experiments across benchmarks like ChartQA and ChartFC demonstrate state-of-the-art results, highlighting the impact of quality visual instruction data and model alignment.

An Overview of ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild

The paper "ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild" introduces a novel approach to chart understanding and reasoning using visual instruction-tuning. This model, named ChartGemma, distinguishes itself by directly utilizing chart images for instruction-tuning, a departure from existing methodologies that rely on the underlying data tables of charts.

Introduction and Motivation

Chart understanding and reasoning is a vital task across various domains, including business, economics, and scientific research, where charts serve as crucial tools for data analysis and decision-making. Existing vision-LLMs (VLMs) have shown efficacy in general-purpose multimodal tasks, but they falter in domain-specific applications such as chart reasoning. Specialist models often depend on data generated from data tables of the charts, overlooking the rich visual information contained in chart images, and thereby limiting their generalizability and effectiveness in real-world applications. This paper addresses these limitations and proposes ChartGemma, a model that leverages instruction-tuning using data generated directly from chart images.

Methodology

ChartGemma Architecture

ChartGemma uses PaliGemma, comprising a SigLIP vision encoder and the Gemma-2B LLM, as its backbone. The SigLIP vision encoder, a ViT-based model, processes images as tokenized patches. These patches are embedded and mapped into the LLM space, facilitating integrated processing with textual data within the Gemma-2B decoder-only transformer LLM.

Instruction-tuning Data Generation

A comprehensive and diverse corpus of 122,857 charts from synthetic sources, specialized websites, and general web sources forms the foundation of the instruction-tuning dataset. This dataset covers a broad spectrum of visual styles and elements. Importantly, the visual instruction-tuning data is generated directly from chart images using Gemini Flash-1.5, capturing high-level trends and complex visual features.

Training and Implementation

ChartGemma undergoes a single-stage instruction-tuning process, fine-tuning the LLM while keeping the vision encoder frozen. This approach contrasts with the two-stage processes of some existing models that initially align vision-language encoders before instruction-tuning.

Experimental Evaluation

Benchmarks and Metrics

ChartGemma was evaluated on various established benchmarks, including ChartQA, ChartFC, ChartCheck, OpenCQA, and Chart2Text. Metrics such as relaxed accuracy for ChartQA, accuracy for fact-checking tasks, and GPT-4-based evaluation for open-ended tasks, were employed to comprehensively assess model performance.

Results

ChartGemma demonstrated state-of-the-art performance across several benchmarks:

ChartQA: Achieved superior average performance on factoid question answering tasks, particularly excelling in the human-generated subset.
ChartFC and ChartCheck: Notable improvements in accuracy, showcasing its capability in chart fact-checking.
Open-ended Tasks: In both informativeness and factual correctness, ChartGemma outperformed existing models, as validated by both GPT-4 and human evaluators.

Ablation Studies

Two critical hypotheses were tested:

The effectiveness of visual instruction-tuning data compared to data generated from chart tables.
The impact of a strongly aligned backbone model (PaliGemma vs. LLaVA).

Results strongly favored ChartGemma's approach, validating the importance of both dataset quality and initial model alignment.

Analysis and Future Directions

Error Analysis

Challenges in handling high-resolution chart images, occasional coding errors, and the complexities of charts with diverse visual styles were identified as areas where ChartGemma could improve.

Human Evaluation

Human evaluations on a curated set of web-sourced charts corroborated the findings from automated evaluations, further affirming the robustness and applicability of ChartGemma in real-world scenarios.

Conclusion

The paper establishes ChartGemma as a pertinent advancement in chart understanding models, leveraging direct visual instruction-tuning to surmount limitations inherent in previous approaches. The findings underscore the significance of high-quality visual instruction datasets and robust model architectures. Future work aims to broaden the diversity of instruction datasets and develop generalized benchmarks to better address complex visual elements in charts.

ChartGemma's integration of strong numerical reasoning, factual correctness, and visual understanding posits it as a practical model for real-world chart interpretation and reasoning tasks. Through its innovative methodology and comprehensive evaluation, ChartGemma paves the way for more effective and generalizable chart understanding models in the field of artificial intelligence.

PDF Markdown

Related Papers

GitHub

GitHub - vis-nlp/ChartGemma (55 stars)

Tweets

https://twitter.com/NielsRogge/status/1810990647043588480

https://twitter.com/ADarmouni/status/1812612660858613929

https://twitter.com/Ahmed_Masry97/status/1853905769302782005

https://twitter.com/fdaudens/status/1810368395638091964