Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 188 tok/s

Gemini 2.5 Pro 49 tok/s Pro

GPT-5 Medium 39 tok/s Pro

GPT-5 High 39 tok/s Pro

GPT-4o 78 tok/s Pro

Kimi K2 207 tok/s Pro

GPT OSS 120B 446 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception (2404.09624v3)

Published 15 Apr 2024 in cs.CV

Abstract: The highly abstract nature of image aesthetics perception (IAP) poses significant challenge for current multimodal LLMs (MLLMs). The lack of human-annotated multi-modality aesthetic data further exacerbates this dilemma, resulting in MLLMs falling short of aesthetics perception capabilities. To address the above challenge, we first introduce a comprehensively annotated Aesthetic Multi-Modality Instruction Tuning (AesMMIT) dataset, which serves as the footstone for building multi-modality aesthetics foundation models. Specifically, to align MLLMs with human aesthetics perception, we construct a corpus-rich aesthetic critique database with 21,904 diverse-sourced images and 88K human natural language feedbacks, which are collected via progressive questions, ranging from coarse-grained aesthetic grades to fine-grained aesthetic descriptions. To ensure that MLLMs can handle diverse queries, we further prompt GPT to refine the aesthetic critiques and assemble the large-scale aesthetic instruction tuning dataset, i.e. AesMMIT, which consists of 409K multi-typed instructions to activate stronger aesthetic capabilities. Based on the AesMMIT database, we fine-tune the open-sourced general foundation models, achieving multi-modality Aesthetic Expert models, dubbed AesExpert. Extensive experiments demonstrate that the proposed AesExpert models deliver significantly better aesthetic perception performances than the state-of-the-art MLLMs, including the most advanced GPT-4V and Gemini-Pro-Vision. Project homepage: https://yipoh.github.io/aes-expert/.

References (60)

Citations (9)

View on Semantic Scholar

Summary

The paper introduces AesMMIT, a 409K-instruction dataset enriched with 88K human feedback entries to enhance aesthetic evaluation.
It fine-tunes existing multimodal models into AesExpert, achieving superior image aesthetic perception compared to state-of-the-art models.
The open-source release of data, codes, and checkpoints encourages further research in advanced aesthetic understanding and model development.

Overview of "AesExpert: Towards Multi-Modality Foundation Model for Image Aesthetics Perception"

The paper presents "AesExpert," a novel approach for enhancing the image aesthetics perception capabilities of multimodal LLMs (MLLMs). Recognizing the deficiency in human-annotated multimodal aesthetic data, the authors introduce a new dataset, Aesthetic Multi-Modality Instruction Tuning (AesMMIT), designed to bridge the gap between MLLMs and human aesthetic judgment.

Key Contributions

The authors make several notable contributions:

Aesthetic Instruction-Following Dataset: AesMMIT is built on a corpus of aesthetic critiques collected through subjective experiments. It consists of 409K multi-type instructions derived from 21,904 diverse images and 88K human feedbacks. These incorporate various perception dimensions such as quality, attribute, emotion, and context reasoning.
AesExpert Model: The paper fine-tunes existing MLLMs on the AesMMIT data, resulting in the AesExpert models. These models exhibit superior performance in aesthetics perception compared to contemporary MLLMs, including GPT-4V and Gemini-Pro-Vision.
Open-source Contribution: The dataset and models, including their codes and checkpoints, are made publicly available, which could support further advancements in MLLMs with comprehensive aesthetic capabilities.

Methodology

The methodology involves three main stages:

Collecting Human Feedback: 48 subjects provided detailed aesthetic feedback on images, focusing on coarse and fine-grained aesthetic evaluations and feelings.
GPT-Assisted Refinement: The authors used GPT to generate diverse instruction-following data formats from the human feedback, enhancing the dataset's breadth.
Model Fine-tuning: The researchers fine-tuned pre-existing MLLMs on this comprehensive dataset to derive the AesExpert models, targeting improved aesthetic interactions.

Results

Extensive evaluations on the AesBench benchmark demonstrate substantial improvements in aesthetic perception tasks. The models exhibit notably enhanced capabilities in aesthetic interpretation, perception, and empathy. Performance improvements, especially in assessing artificial intelligence-generated images, highlight the dataset’s efficacy.

Implications and Future Directions

The work implies significant potential for MLLMs to serve roles requiring nuanced aesthetic understanding, such as in smart photography and media curation. Future advancements may explore expanding datasets to include more diverse aesthetic contexts and improve MLLMs' interpretative precision further.

Overall, this research lays foundational work for developing models that replicate human-like aesthetic judgments, catalyzing progress in AI that interacts with visual domains more profoundly.