Emergent Mind

ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation

(2406.09961)
Published Jun 14, 2024 in cs.SE , cs.CL , and cs.CV

Abstract

We introduce a new benchmark, ChartMimic, aimed at assessing the visually-grounded code generation capabilities of large multimodal models (LMMs). ChartMimic utilizes information-intensive visual charts and textual instructions as inputs, requiring LMMs to generate the corresponding code for chart rendering. ChartMimic includes 1,000 human-curated (figure, instruction, code) triplets, which represent the authentic chart use cases found in scientific papers across various domains(e.g., Physics, Computer Science, Economics, etc). These charts span 18 regular types and 4 advanced types, diversifying into 191 subcategories. Furthermore, we propose multi-level evaluation metrics to provide an automatic and thorough assessment of the output code and the rendered charts. Unlike existing code generation benchmarks, ChartMimic places emphasis on evaluating LMMs' capacity to harmonize a blend of cognitive capabilities, encompassing visual understanding, code generation, and cross-modal reasoning. The evaluation of 3 proprietary models and 11 open-weight models highlights the substantial challenges posed by ChartMimic. Even the advanced GPT-4V, Claude-3-opus only achieve an average score of 73.2 and 53.7, respectively, indicating significant room for improvement. We anticipate that ChartMimic will inspire the development of LMMs, advancing the pursuit of artificial general intelligence.

ChartMimic's workflow using 1,000 curated (figure, instruction, code) triplets to assess LMMs' multimodal chart-to-code abilities.

Overview

  • ChartMimic is a new benchmark designed to evaluate the visually-grounded code generation capabilities of large multimodal models (LMMs) using complex visual charts and textual instructions.

  • The benchmark consists of 1,000 human-curated triplets of figures, instructions, and corresponding code from scientific papers across multiple domains, featuring 18 regular and 4 advanced chart types in 191 subcategories.

  • The paper includes a comprehensive performance evaluation of 14 LMMs, revealing a significant performance disparity between proprietary and open-weight models, and provides detailed error analysis and future research directions.

ChartMimic: Evaluating LMMs' Cross-Modal Reasoning Capability via Chart-to-Code Generation

The paper introduces ChartMimic, a new benchmark designed to evaluate the visually-grounded code generation capabilities of large multimodal models (LMMs). Unlike the majority of existing benchmarks which rely solely on textual inputs, ChartMimic leverages information-intensive visual charts accompanied by textual instructions. The benchmark challenges LMMs to generate accurate code for rendering these charts, demanding a synthesis of visual understanding, code generation, and cross-modal reasoning abilities.

Benchmark Overview

ChartMimic is composed of 1,000 human-curated triplets of figures, instructions, and corresponding code. These data points are extracted from scientific papers encompassing various domains such as Physics, Computer Science, and Economics. The charts span 18 regular types and 4 advanced types, which are further diversified into 191 subcategories. This extensive diversity ensures the benchmark provides a comprehensive evaluation of LMM capabilities in generating code from complex and varied visual inputs.

Evaluation Metrics

To thoroughly assess the performance of LMMs on ChartMimic, the authors propose multi-level evaluation metrics. These metrics include both high-level and low-level assessments. The high-level metric (GPT-4V Score) relies on GPT-4V to evaluate the visual similarity between the rendered and ground-truth figures, while low-level metrics encompass text, layout, type, and color scores. These multi-faceted metrics allow for a detailed evaluation of code accuracy and visual fidelity, providing insights into different aspects of the models' cross-modal reasoning abilities.

Model Performance

The paper benchmarks 14 LMMs, including 3 proprietary models (GPT-4V, Claude-3-opus, GeminiProVision) and 11 open-weight models (e.g., LLaVA-Next-Vicuna-7B, Phi-3-Vision). The evaluation reveals a substantial performance disparity between open-weight and proprietary models. Specifically, GPT-4V outperforms all other models, achieving an average overall score of 71.4 for the Direct Mimic task and 72.33 for the Customized Mimic task. In contrast, the best-performing open-weight model, Phi-3-Vision, scores significantly lower (31.9 and 40.18, respectively). This highlights the substantial challenges posed by ChartMimic and indicates significant room for improvement in the open-source LMM community.

Error Analysis

The paper includes a comprehensive error analysis, categorizing errors into code-related, text-related, type-related, and color-related issues. The most prevalent errors stem from dimension issues in code (e.g., incorrect data dimensions), missing text elements, and misinterpreted chart types. These insights emphasize the need for improved model capabilities in understanding and accurately reproducing the nuanced visual elements and data relationships within the charts.

Implications and Future Directions

The introduction of ChartMimic has several implications for the development of LMMs and the pursuit of artificial general intelligence (AGI). By emphasizing the necessity for advanced cross-modal reasoning, ChartMimic pushes the boundaries of current model capabilities, highlighting both strengths and areas for improvement. The benchmark’s comprehensive evaluation framework not only offers a robust tool for researchers to assess and enhance their models but also encourages the exploration of innovative techniques to bridge the performance gap between open-weight and proprietary models.

Future research may focus on various aspects such as refining prompt strategies for multimodal reasoning, enhancing data pre-processing and augmentation techniques, and developing more sophisticated model architectures. Additionally, expanding the benchmark to include more diverse and complex visual inputs could further challenge and advance the field of LMM development.

Conclusion

ChartMimic provides a rigorous and multifaceted benchmark for evaluating the cross-modal reasoning capabilities of LMMs in the context of chart-to-code generation. By incorporating diverse and information-intensive visual inputs, along with a robust evaluation framework, ChartMimic sets a high bar for future advancements in the field. The benchmark’s insights and detailed error analysis present valuable opportunities for researchers to innovate and improve large multimodal models, driving forward the quest for AGI.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.