Emergent Mind

ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild

(2407.04172)
Published Jul 4, 2024 in cs.AI , cs.CL , and cs.CV

Abstract

Given the ubiquity of charts as a data analysis, visualization, and decision-making tool across industries and sciences, there has been a growing interest in developing pre-trained foundation models as well as general purpose instruction-tuned models for chart understanding and reasoning. However, existing methods suffer crucial drawbacks across two critical axes affecting the performance of chart representation models: they are trained on data generated from underlying data tables of the charts, ignoring the visual trends and patterns in chart images, and use weakly aligned vision-language backbone models for domain-specific training, limiting their generalizability when encountering charts in the wild. We address these important drawbacks and introduce ChartGemma, a novel chart understanding and reasoning model developed over PaliGemma. Rather than relying on underlying data tables, ChartGemma is trained on instruction-tuning data generated directly from chart images, thus capturing both high-level trends and low-level visual information from a diverse set of charts. Our simple approach achieves state-of-the-art results across $5$ benchmarks spanning chart summarization, question answering, and fact-checking, and our elaborate qualitative studies on real-world charts show that ChartGemma generates more realistic and factually correct summaries compared to its contemporaries. We release the code, model checkpoints, dataset, and demos at https://github.com/vis-nlp/ChartGemma.

Chart images are input into Gemini Flash 1.5, generating instructions for fine-tuning ChartGemma.

Overview

  • ChartGemma introduces a novel approach to chart understanding by leveraging visual instruction-tuning directly from chart images, unlike existing methods that rely on data tables.

  • The model employs a SigLIP vision encoder and the Gemma-2B language model, utilizing a comprehensive dataset of over 122,000 charts for effective instruction-tuning.

  • Experimental evaluations demonstrate ChartGemma's state-of-the-art performance across various benchmarks, highlighting its capabilities in chart reasoning and fact-checking tasks.

An Overview of ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild

The paper "ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild" introduces a novel approach to chart understanding and reasoning using visual instruction-tuning. This model, named ChartGemma, distinguishes itself by directly utilizing chart images for instruction-tuning, a departure from existing methodologies that rely on the underlying data tables of charts.

Introduction and Motivation

Chart understanding and reasoning is a vital task across various domains, including business, economics, and scientific research, where charts serve as crucial tools for data analysis and decision-making. Existing vision-language models (VLMs) have shown efficacy in general-purpose multimodal tasks, but they falter in domain-specific applications such as chart reasoning. Specialist models often depend on data generated from data tables of the charts, overlooking the rich visual information contained in chart images, and thereby limiting their generalizability and effectiveness in real-world applications. This paper addresses these limitations and proposes ChartGemma, a model that leverages instruction-tuning using data generated directly from chart images.

Methodology

ChartGemma Architecture

ChartGemma uses PaliGemma, comprising a SigLIP vision encoder and the Gemma-2B language model, as its backbone. The SigLIP vision encoder, a ViT-based model, processes images as tokenized patches. These patches are embedded and mapped into the language model space, facilitating integrated processing with textual data within the Gemma-2B decoder-only transformer LLM.

Instruction-tuning Data Generation

A comprehensive and diverse corpus of 122,857 charts from synthetic sources, specialized websites, and general web sources forms the foundation of the instruction-tuning dataset. This dataset covers a broad spectrum of visual styles and elements. Importantly, the visual instruction-tuning data is generated directly from chart images using Gemini Flash-1.5, capturing high-level trends and complex visual features.

Training and Implementation

ChartGemma undergoes a single-stage instruction-tuning process, fine-tuning the language model while keeping the vision encoder frozen. This approach contrasts with the two-stage processes of some existing models that initially align vision-language encoders before instruction-tuning.

Experimental Evaluation

Benchmarks and Metrics

ChartGemma was evaluated on various established benchmarks, including ChartQA, ChartFC, ChartCheck, OpenCQA, and Chart2Text. Metrics such as relaxed accuracy for ChartQA, accuracy for fact-checking tasks, and GPT-4-based evaluation for open-ended tasks, were employed to comprehensively assess model performance.

Results

ChartGemma demonstrated state-of-the-art performance across several benchmarks:

  • ChartQA: Achieved superior average performance on factoid question answering tasks, particularly excelling in the human-generated subset.
  • ChartFC and ChartCheck: Notable improvements in accuracy, showcasing its capability in chart fact-checking.
  • Open-ended Tasks: In both informativeness and factual correctness, ChartGemma outperformed existing models, as validated by both GPT-4 and human evaluators.

Ablation Studies

Two critical hypotheses were tested:

  1. The effectiveness of visual instruction-tuning data compared to data generated from chart tables.
  2. The impact of a strongly aligned backbone model (PaliGemma vs. LLaVA).

Results strongly favored ChartGemma's approach, validating the importance of both dataset quality and initial model alignment.

Analysis and Future Directions

Error Analysis

Challenges in handling high-resolution chart images, occasional coding errors, and the complexities of charts with diverse visual styles were identified as areas where ChartGemma could improve.

Human Evaluation

Human evaluations on a curated set of web-sourced charts corroborated the findings from automated evaluations, further affirming the robustness and applicability of ChartGemma in real-world scenarios.

Conclusion

The paper establishes ChartGemma as a pertinent advancement in chart understanding models, leveraging direct visual instruction-tuning to surmount limitations inherent in previous approaches. The findings underscore the significance of high-quality visual instruction datasets and robust model architectures. Future work aims to broaden the diversity of instruction datasets and develop generalized benchmarks to better address complex visual elements in charts.

ChartGemma's integration of strong numerical reasoning, factual correctness, and visual understanding posits it as a practical model for real-world chart interpretation and reasoning tasks. Through its innovative methodology and comprehensive evaluation, ChartGemma paves the way for more effective and generalizable chart understanding models in the field of artificial intelligence.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

GitHub