Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots (2405.07990v1)

Published 13 May 2024 in cs.CL and cs.CV

Abstract: The remarkable progress of Multi-modal LLMs (MLLMs) has attracted significant attention due to their superior performance in visual contexts. However, their capabilities in turning visual figure to executable code, have not been evaluated thoroughly. To address this, we introduce Plot2Code, a comprehensive visual coding benchmark designed for a fair and in-depth assessment of MLLMs. We carefully collect 132 manually selected high-quality matplotlib plots across six plot types from publicly available matplotlib galleries. For each plot, we carefully offer its source code, and an descriptive instruction summarized by GPT-4. This approach enables Plot2Code to extensively evaluate MLLMs' code capabilities across various input modalities. Furthermore, we propose three automatic evaluation metrics, including code pass rate, text-match ratio, and GPT-4V overall rating, for a fine-grained assessment of the output code and rendered images. Instead of simply judging pass or fail, we employ GPT-4V to make an overall judgement between the generated and reference images, which has been shown to be consistent with human evaluation. The evaluation results, which include analyses of 14 MLLMs such as the proprietary GPT-4V, Gemini-Pro, and the open-sourced Mini-Gemini, highlight the substantial challenges presented by Plot2Code. With Plot2Code, we reveal that most existing MLLMs struggle with visual coding for text-dense plots, heavily relying on textual instruction. We hope that the evaluation results from Plot2Code on visual coding will guide the future development of MLLMs. All data involved with Plot2Code are available at https://huggingface.co/datasets/TencentARC/Plot2Code.

References (44)

Citations (8)

View on Semantic Scholar

Summary

The paper introduces Plot2Code, a benchmark evaluating MLLMs on their ability to generate source code from matplotlib plots.
It employs Direct Asking and Conditional Asking setups to assess how textual guidance impacts code generation accuracy.
Key models like GPT-4V excel in the conditional setting, though challenges persist in handling text-dense plots with image-only inputs.

Understanding "Plot2Code": Evaluating MLLMs in Code Generation from Visual Inputs

Introduction to the Study

In recent years, the fusion of visual processing and LLMs has birthed Multi-modal LLMs (MLLMs). These advanced AI models are capable of understanding and generating responses based on both text and image inputs. However, one challenging aspect remains relatively underexplored: the ability of these models to turn complex visual data, like graphs or plots, into executable code. The paper introduces "Plot2Code," a benchmark designed specifically to evaluate the performance of MLLMs in converting matplotlib plot images into source code.

What is Plot2Code?

"Plot2Code" is not just another dataset. It's a meticulously crafted benchmark containing 132 high-quality matplotlib plots, selected to specifically challenge the MLLMs in diverse visual scenarios. Each plot in the dataset is paired with its source code and a descriptive instruction created by GPT-4, allowing comprehensive testing across various plot types and complexities.

How Does Plot2Code Work?

The authors of the paper designed Plot2Code with two main evaluation settings:

Direct Asking: The model receives only the image of the plot and must generate the source code to recreate it.
Conditional Asking: The model is given the plot image along with textual instructions, which detail specifics about the plot that must be reflected in the generated code.

These settings help examine how well models can generate accurate and executable code based purely on visual input, as well as how they handle additional textual descriptions.

Key Findings from the Study

The evaluation of 14 different MLLMs using Plot2Code revealed several fascinating insights:

The top-performing models in the paper were GPT-4V and Claude-3, with GPT-4V achieving a high score of 7.68 (out of 10) in terms of overall performance in the Conditional Asking setting.
Across the board, MLLMs struggled more with Direct Asking compared to Conditional Asking. This suggests that textual instructions play a significant role in guiding the models toward correct code generation.
Text-dense plots (plots with a lot of textual information) posed a significant challenge for most models, indicating a potential area for future improvement.

Practical Implications

The results from Plot2Code provide several practical implications for the development of MLLMs:

Accuracy in Code Generation: The ability to generate executable code from visual inputs can significantly streamline tasks like automated report generation, data analysis, and more, particularly in data-driven fields like statistics and data science.
Model Training and Improvement: Insights from the Plot2Code assessments can help researchers and developers understand current limitations and enhance model training procedures, potentially leading to more robust MLLMs.

Speculations on Future Developments

Looking forward, Plot2Code could drive several advancements in AI:

Enhanced Multi-modal Understanding: This benchmark could spur further research into improving the multi-modal capabilities of AI models, ensuring they understand and process combined data forms (textual, visual) more effectively.
Development of Specialized Models: We might see the rise of specialized MLLMs that excel in specific domains like scientific visualization or technical diagrams.

Conclusion

Plot2Code represents a significant step in testing and enhancing the capabilities of multi-modal LLMs in a practical, challenging area of AI: generating code from visual data. While the results indicate room for improvement, particularly in handling plots with dense textual data without supplemental text instructions, they also highlight the considerable potential of current models and set a pathway for future advancements.

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1790233800447782981

https://twitter.com/chin_jlyc/status/1793454908685025760

https://twitter.com/gm8xx8/status/1790253569884533041

https://twitter.com/javaeeeee1/status/1790706101257670771

https://twitter.com/javaeeeee1/status/1790345371475988721