Emergent Mind

Abstract

The remarkable progress of Multi-modal LLMs (MLLMs) has attracted significant attention due to their superior performance in visual contexts. However, their capabilities in turning visual figure to executable code, have not been evaluated thoroughly. To address this, we introduce Plot2Code, a comprehensive visual coding benchmark designed for a fair and in-depth assessment of MLLMs. We carefully collect 132 manually selected high-quality matplotlib plots across six plot types from publicly available matplotlib galleries. For each plot, we carefully offer its source code, and an descriptive instruction summarized by GPT-4. This approach enables Plot2Code to extensively evaluate MLLMs' code capabilities across various input modalities. Furthermore, we propose three automatic evaluation metrics, including code pass rate, text-match ratio, and GPT-4V overall rating, for a fine-grained assessment of the output code and rendered images. Instead of simply judging pass or fail, we employ GPT-4V to make an overall judgement between the generated and reference images, which has been shown to be consistent with human evaluation. The evaluation results, which include analyses of 14 MLLMs such as the proprietary GPT-4V, Gemini-Pro, and the open-sourced Mini-Gemini, highlight the substantial challenges presented by Plot2Code. With Plot2Code, we reveal that most existing MLLMs struggle with visual coding for text-dense plots, heavily relying on textual instruction. We hope that the evaluation results from Plot2Code on visual coding will guide the future development of MLLMs. All data involved with Plot2Code are available at https://huggingface.co/datasets/TencentARC/Plot2Code.

Overview of Plot2Code, showing ground truth plot samples, generated plots, and assessment pipeline for code generation.

Overview

  • The paper introduces a new benchmark, 'Plot2Code', designed to evaluate Multi-modal LLMs (MLLMs) in generating executable code from matplotlib plot images.

  • 'Plot2Code' provides a dataset of 132 carefully selected matplotlib plots, each paired with source code and instructions from GPT-4, to test MLLMs under various complexity conditions.

  • Results show that MLLMs perform better when given textual instructions alongside visual data, indicating areas of focus for future enhancements in AI multi-modal capabilities.

Understanding "Plot2Code": Evaluating MLLMs in Code Generation from Visual Inputs

Introduction to the Study

In recent years, the fusion of visual processing and language models has birthed Multi-modal LLMs (MLLMs). These advanced AI models are capable of understanding and generating responses based on both text and image inputs. However, one challenging aspect remains relatively underexplored: the ability of these models to turn complex visual data, like graphs or plots, into executable code. The paper introduces "Plot2Code," a benchmark designed specifically to evaluate the performance of MLLMs in converting matplotlib plot images into source code.

What is Plot2Code?

"Plot2Code" is not just another dataset. It's a meticulously crafted benchmark containing 132 high-quality matplotlib plots, selected to specifically challenge the MLLMs in diverse visual scenarios. Each plot in the dataset is paired with its source code and a descriptive instruction created by GPT-4, allowing comprehensive testing across various plot types and complexities.

How Does Plot2Code Work?

The authors of the study designed Plot2Code with two main evaluation settings:

  1. Direct Asking: The model receives only the image of the plot and must generate the source code to recreate it.
  2. Conditional Asking: The model is given the plot image along with textual instructions, which detail specifics about the plot that must be reflected in the generated code.

These settings help examine how well models can generate accurate and executable code based purely on visual input, as well as how they handle additional textual descriptions.

Key Findings from the Study

The evaluation of 14 different MLLMs using Plot2Code revealed several fascinating insights:

  • The top-performing models in the study were GPT-4V and Claude-3, with GPT-4V achieving a high score of 7.68 (out of 10) in terms of overall performance in the Conditional Asking setting.
  • Across the board, MLLMs struggled more with Direct Asking compared to Conditional Asking. This suggests that textual instructions play a significant role in guiding the models toward correct code generation.
  • Text-dense plots (plots with a lot of textual information) posed a significant challenge for most models, indicating a potential area for future improvement.

Practical Implications

The results from Plot2Code provide several practical implications for the development of MLLMs:

  • Accuracy in Code Generation: The ability to generate executable code from visual inputs can significantly streamline tasks like automated report generation, data analysis, and more, particularly in data-driven fields like statistics and data science.
  • Model Training and Improvement: Insights from the Plot2Code assessments can help researchers and developers understand current limitations and enhance model training procedures, potentially leading to more robust MLLMs.

Speculations on Future Developments

Looking forward, Plot2Code could drive several advancements in AI:

  • Enhanced Multi-modal Understanding: This benchmark could spur further research into improving the multi-modal capabilities of AI models, ensuring they understand and process combined data forms (textual, visual) more effectively.
  • Development of Specialized Models: We might see the rise of specialized MLLMs that excel in specific domains like scientific visualization or technical diagrams.

Conclusion

Plot2Code represents a significant step in testing and enhancing the capabilities of multi-modal language models in a practical, challenging area of AI: generating code from visual data. While the results indicate room for improvement, particularly in handling plots with dense textual data without supplemental text instructions, they also highlight the considerable potential of current models and set a pathway for future advancements.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.