AesBench: An Expert Benchmark for Multimodal Large Language Models on Image Aesthetics Perception (2401.08276v1)

Published 16 Jan 2024 in cs.CV and cs.CL

Abstract: With collective endeavors, multimodal LLMs (MLLMs) are undergoing a flourishing development. However, their performances on image aesthetics perception remain indeterminate, which is highly desired in real-world applications. An obvious obstacle lies in the absence of a specific benchmark to evaluate the effectiveness of MLLMs on aesthetic perception. This blind groping may impede the further development of more advanced MLLMs with aesthetic perception capacity. To address this dilemma, we propose AesBench, an expert benchmark aiming to comprehensively evaluate the aesthetic perception capacities of MLLMs through elaborate design across dual facets. (1) We construct an Expert-labeled Aesthetics Perception Database (EAPD), which features diversified image contents and high-quality annotations provided by professional aesthetic experts. (2) We propose a set of integrative criteria to measure the aesthetic perception abilities of MLLMs from four perspectives, including Perception (AesP), Empathy (AesE), Assessment (AesA) and Interpretation (AesI). Extensive experimental results underscore that the current MLLMs only possess rudimentary aesthetic perception ability, and there is still a significant gap between MLLMs and humans. We hope this work can inspire the community to engage in deeper explorations on the aesthetic potentials of MLLMs. Source data will be available at https://github.com/yipoh/AesBench.

Citations (22)

View on Semantic Scholar

Summary

The paper introduces AesBench, a benchmark that uses expert-labeled images to evaluate MLLMs’ aesthetic perception across four dimensions.
The framework assesses models on aesthetic perception, empathy, assessment, and interpretation using a diverse set of natural, artistic, and AI-generated images.
Experimental results highlight significant performance gaps between MLLMs and human-level aesthetic analysis, underscoring challenges in nuanced language generation.

Evaluating Multimodal LLMs on Image Aesthetics Perception: The AesBench Framework

The paper "AesBench: An Expert Benchmark for Multimodal LLMs on Image Aesthetics Perception" introduces a structured approach to evaluate the capabilities of multimodal LLMs (MLLMs) in perceiving image aesthetics. Recognizing the indeterminate performance of MLLMs like ChatGPT and LLaVA in aesthetic perception tasks, the authors propose the AesBench benchmark to bridge this gap.

Expert-Labeled Aesthetics Perception Database (EAPD)

At the core of the AesBench is the Expert-labeled Aesthetics Perception Database (EAPD), a comprehensive collection of 2,800 images categorized into natural images (NIs), artistic images (AIs), and AI-generated images (AGIs). These images are meticulously annotated by professionals, including computational aesthetics researchers and art students, ensuring high-quality data for evaluating MLLMs.

Evaluation Framework

AesBench assesses MLLMs through a four-dimensional framework:

Aesthetic Perception (AesP): This dimension evaluates the ability of models to recognize and comprehend various aesthetic attributes across images. The paper introduces an AesPQA subset with a focus on technical quality, color and light, composition, and content.
Aesthetic Empathy (AesE): The AesE dimension measures an MLLM's ability to resonate with the emotional essence conveyed through images, focusing on emotion, interest, uniqueness, and vibe.
Aesthetic Assessment (AesA): The task involves assigning aesthetic ratings to images, with the models tasked with categorizing images into high, medium, or low visual appeal.
Aesthetic Interpretation (AesI): This dimension evaluates how effectively MLLMs can articulate reasons for the aesthetic quality of images, requiring nuanced language generation capabilities.

Experimental Insights and Implications

The researchers conducted extensive evaluations with 15 MLLMs, including well-known models such as GPT-4V and Gemini Pro Vision. The results indicate significant variability in performance, with Q-Instruct and GPT-4V being the top performers across several tasks, though their overall accuracy remains below optimal human levels. This underscores the substantial gap between current MLLM capabilities and human-like aesthetic perception.

Particularly notable is the varied performance of MLLMs on different image sources, with models generally performing better on natural images than on artistic or AI-generated images. Additionally, the precision in aesthetic interpretation emerged as a critical challenge, suggesting prevalent hallucination issues within MLLMs when generating language-based aesthetic analyses.

Future Directions

The insights from AesBench suggest that future MLLMs could greatly benefit from developing more robust mechanisms to understand and assess aesthetic attributes. The implications of such advancements are significant for real-world applications like smart photography, image enhancement, and personalized content curation, all of which require sophisticated aesthetic perception abilities.

Overall, the AesBench benchmark offers a pivotal step towards systematically evaluating and improving the aesthetic perception capabilities of MLLMs. As these models evolve, further refinement of benchmarks like AesBench will be essential to guide research and application development in the field of image aesthetics perception.

PDF Markdown

Related Papers

GitHub

GitHub - yipoh/AesBench: An expert benchmark aiming to comprehensively evaluate the aesthetic perception capacities of MLLMs. (219 stars)