Name: Appy Pie
Rating: 4.9 (3609 reviews)

Author(s): Chengyue Wu, Yixiao Ge, Qiushan Guo, Jiahao Wang, Zhixuan Liang, Zeyu Lu, Ying Shan, Ping Luo

The advancement of Multi-modal Large Language Models (MLLMs) has garnered significant attention due to their enhanced performance in visual contexts. However, their ability to convert visual figures into executable code has not been thoroughly evaluated. To address this gap, the study introduces Plot2Code, a comprehensive visual coding benchmark designed for an in-depth assessment of MLLMs. Plot2Code consists of 132 manually selected high-quality matplotlib plots, categorized into six types, sourced from publicly available matplotlib galleries. Each plot is accompanied by its source code and a descriptive instruction summarized by GPT-4, enabling extensive evaluation of MLLMs’ coding capabilities across various input modalities.

The study proposes three automatic evaluation metrics: code pass rate, text-match ratio, and GPT-4V overall rating. These metrics provide a detailed assessment of the output code and rendered images. Rather than a simple pass or fail judgment, GPT-4V is employed to make an overall comparison between the generated and reference images, demonstrating consistency with human evaluation.

Evaluation results from 14 MLLMs, including proprietary models like GPT-4V and Gemini-Pro, as well as the open-source Mini-Gemini, highlight the substantial challenges posed by Plot2Code. The findings reveal that most existing MLLMs struggle with visual coding for text-dense plots, relying heavily on textual instructions. The results from Plot2Code are expected to guide the future development of MLLMs in visual coding.

View arXiv Page

View PDF

View HTML

Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots