AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception (2404.09624v3)
Abstract: The highly abstract nature of image aesthetics perception (IAP) poses significant challenge for current multimodal LLMs (MLLMs). The lack of human-annotated multi-modality aesthetic data further exacerbates this dilemma, resulting in MLLMs falling short of aesthetics perception capabilities. To address the above challenge, we first introduce a comprehensively annotated Aesthetic Multi-Modality Instruction Tuning (AesMMIT) dataset, which serves as the footstone for building multi-modality aesthetics foundation models. Specifically, to align MLLMs with human aesthetics perception, we construct a corpus-rich aesthetic critique database with 21,904 diverse-sourced images and 88K human natural language feedbacks, which are collected via progressive questions, ranging from coarse-grained aesthetic grades to fine-grained aesthetic descriptions. To ensure that MLLMs can handle diverse queries, we further prompt GPT to refine the aesthetic critiques and assemble the large-scale aesthetic instruction tuning dataset, i.e. AesMMIT, which consists of 409K multi-typed instructions to activate stronger aesthetic capabilities. Based on the AesMMIT database, we fine-tune the open-sourced general foundation models, achieving multi-modality Aesthetic Expert models, dubbed AesExpert. Extensive experiments demonstrate that the proposed AesExpert models deliver significantly better aesthetic perception performances than the state-of-the-art MLLMs, including the most advanced GPT-4V and Gemini-Pro-Vision. Project homepage: https://yipoh.github.io/aes-expert/.
- ArtEmis: Affective Language for Visual Art. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 11564–11574. https://doi.org/10.1109/CVPR46437.2021.01140
- VQA: Visual Question Answering. In Proc. IEEE Int. Conf. on Comput. Vis.
- Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond. arXiv preprint arXiv:2308.12966 (2023).
- BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models. arXiv preprint arXiv:2312.02896 (2023).
- MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478 (2023).
- ShareGPT4V: Improving Large Multi-Modal Models with Better Captions. arXiv preprint arXiv:2311.12793 (2023).
- Scaling Instruction-Finetuned Language Models. arXiv preprint arXiv:2210.11416 (2022).
- InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. arXiv preprint arXiv:2305.06500 (2023).
- Image Aesthetic Assessment: An experimental survey. IEEE Signal Process. Mag. 34, 4 (2017), 80–106. https://doi.org/10.1109/MSP.2017.2696576
- GLM: General Language Model Pretraining with Autoregressive Blank Infilling. arXiv preprint arXiv:2103.10360 (2022).
- Perceptual Quality Assessment of Smartphone Photography. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 3674–3683. https://doi.org/10.1109/CVPR42600.2020.00373
- Google. 2023. Build with Gemini. https://ai.google.dev/
- The Instinctive Bias: Spurious Images lead to Hallucination in MLLMs. arXiv preprint arXiv:2402.03757 (2024).
- Rethinking Image Aesthetics Assessment: Models, Datasets and Benchmarks. Proc. Int. Joint Conf. Artif. Intell. (Jul. 2022).
- CLIP knows image aesthetics. Frontiers in Artificial Intelligence 5 (2022), 976235.
- A comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CsUR) 51, 6 (2019), 1–36.
- KonIQ-10k: An Ecologically Valid Database for Deep Learning of Blind Image Quality Assessment. IEEE Trans. Image Process. 29 (Jan. 2020), 4041–4056. https://doi.org/10.1109/TIP.2020.2967829
- Visual Hallucinations of Multi-modal Large Language Models. arXiv preprint arXiv:2402.14683 (2024).
- Explainable and Generalizable Blind Image Quality Assessment via Semantic Attribute Reasoning. IEEE Trans. Multimedia 25 (2023), 7672–7685. https://doi.org/10.1109/TMM.2022.3225728
- No-reference quality assessment for live broadcasting videos in temporal and spatial domains. IET Image Processing 14, 4 (2020), 774–781.
- Blind Quality Index of Depth Images Based on Structural Statistics for View Synthesis. IEEE Signal Process. Lett. 27 (Apr. 2020), 685–689. https://doi.org/10.1109/LSP.2020.2988830
- AesBench: An Expert Benchmark for Multimodal Large Language Models on Image Aesthetics Perception. arXiv preprint arXiv:2401.08276 (2024).
- Huggingface. 2023. Introducing IDEFICS: An open reproduction of state-of-the-art visual language model. https://huggingface.co/blog/idefics
- ITU. 2012. Methodology for the Subjective Assessment of the Quality of Television Pictures. In Recommendation ITU-R BT.500-13. ITU.
- VILA: Learning Image Aesthetics from User Comments with Vision-Language Pretraining. In Proc. IEEE Conf. Comput. Vis. Pattern Recog. 10041–10051. https://doi.org/10.1109/CVPR52729.2023.00968
- Photo aesthetics ranking network with attributes and content adaptation. In Proc. Eur. Conf. Comput. Vis. 662–679.
- Impressions: Understanding Visual Semiotics and Aesthetic Impact. arXiv preprint arXiv:2310.17887 (2023).
- LISA: Reasoning Segmentation via Large Language Model. arXiv preprint arXiv:2308.00692 (2023).
- Neural Abstract Style Transfer for Chinese Traditional Painting. arXiv preprint arXiv:1812.03264 (2018).
- Otter: A Multi-Modal Model with In-Context Instruction Tuning. arXiv preprint arXiv:2305.03726 (2023).
- AGIQA-3K: An Open Database for AI-Generated Image Quality Assessment. arXiv preprint arXiv:2306.04717 (2023).
- Theme-aware Visual Attribute Reasoning for Image Aesthetics Assessment. IEEE Trans. Circuits and Syst. Video Technol. (2023). https://doi.org/10.1109/TCSVT.2023.3249185
- Personality-assisted multi-task learning for generic and personalized image aesthetics assessment. IEEE Trans. Image Process. 29 (Jan. 2020), 3898–3910.
- Microsoft COCO: Common Objects in Context. arXiv preprint arXiv:1405.0312 (2015).
- Improved Baselines with Visual Instruction Tuning. arXiv preprint arXiv:2310.03744 (2023).
- Visual Instruction Tuning. arXiv preprint arXiv:2304.08485 (2023).
- Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering. In Proc. Neural Inf. Process. Syst. 2507–2521.
- OCR-VQA: Visual Question Answering by Reading Text in Images. In 2019 International Conference on Document Analysis and Recognition (ICDAR). 947–952. https://doi.org/10.1109/ICDAR.2019.00156
- AVA: A large-scale database for aesthetic visual analysis. In Proc. IEEE Conf. Comput. Vis. Pattern Recog. 2408–2415.
- OpenAI. 2023. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774 (2023).
- Learning Transferable Visual Models From Natural Language Supervision. arXiv preprint arXiv:2103.00020 (2021).
- AesCLIP: Multi-Attribute Contrastive Learning for Image Aesthetics Assessment. In Proc. ACM Int. Conf. Multimedia. 1117–1126.
- LLaMA: Open and Efficient Foundation Language Models. arXiv preprint arXiv:2302.13971 (2023).
- Unsplash. 2023. Access the world’s largest open library dataset. https://unsplash.com/data
- DiffusionDB: A Large-scale Prompt Gallery Dataset for Text-to-Image Generative Models. arXiv preprint arXiv:2210.14896 (2023).
- Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision. arXiv preprint arXiv:2309.14181 (2024).
- Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models. arXiv preprint arXiv:2311.06783 (2023).
- Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels. arXiv preprint arXiv:2312.17090 (2023).
- Towards Open-ended Visual Quality Comparison.
- Personalized Image Aesthetics Assessment with Rich Attributes. In Proc. IEEE Conf. Comput. Vis. Pattern Recog. 19861–19869.
- The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision). arXiv preprint arXiv:2309.17421 (2023).
- Multi-Level Transitional Contrast Learning for Personalized Image Aesthetics Assessment. IEEE Trans. Multimedia 26 (2024), 1944–1956.
- mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality. arXiv preprint arXiv:2304.14178 (2023).
- mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration. arXiv preprint arXiv:2311.04257 (2023).
- Towards Artistic Image Aesthetics Assessment: a Large-scale Dataset and a New Method. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 22388–22397. https://doi.org/10.1109/CVPR52729.2023.02144
- TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones. arXiv preprint arXiv:2312.16862 (2023).
- Instruction Tuning for Large Language Models: A Survey. arXiv preprint arXiv:2308.10792 (2024).
- A Perceptual Quality Assessment Exploration for AIGC Images. arXiv preprint arXiv:2303.12618 (2023).
- Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems 36 (2024).
- MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. arXiv preprint arXiv:2304.10592 (2023).
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.