AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception (2404.09624v3)
Abstract: The highly abstract nature of image aesthetics perception (IAP) poses significant challenge for current multimodal LLMs (MLLMs). The lack of human-annotated multi-modality aesthetic data further exacerbates this dilemma, resulting in MLLMs falling short of aesthetics perception capabilities. To address the above challenge, we first introduce a comprehensively annotated Aesthetic Multi-Modality Instruction Tuning (AesMMIT) dataset, which serves as the footstone for building multi-modality aesthetics foundation models. Specifically, to align MLLMs with human aesthetics perception, we construct a corpus-rich aesthetic critique database with 21,904 diverse-sourced images and 88K human natural language feedbacks, which are collected via progressive questions, ranging from coarse-grained aesthetic grades to fine-grained aesthetic descriptions. To ensure that MLLMs can handle diverse queries, we further prompt GPT to refine the aesthetic critiques and assemble the large-scale aesthetic instruction tuning dataset, i.e. AesMMIT, which consists of 409K multi-typed instructions to activate stronger aesthetic capabilities. Based on the AesMMIT database, we fine-tune the open-sourced general foundation models, achieving multi-modality Aesthetic Expert models, dubbed AesExpert. Extensive experiments demonstrate that the proposed AesExpert models deliver significantly better aesthetic perception performances than the state-of-the-art MLLMs, including the most advanced GPT-4V and Gemini-Pro-Vision. Project homepage: https://yipoh.github.io/aes-expert/.
- ArtEmis: Affective Language for Visual Art. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 11564–11574. https://doi.org/10.1109/CVPR46437.2021.01140
- VQA: Visual Question Answering. In Proc. IEEE Int. Conf. on Comput. Vis.
- Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond. arXiv preprint arXiv:2308.12966 (2023).
- BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models. arXiv preprint arXiv:2312.02896 (2023).
- MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478 (2023).
- ShareGPT4V: Improving Large Multi-Modal Models with Better Captions. arXiv preprint arXiv:2311.12793 (2023).
- Scaling Instruction-Finetuned Language Models. arXiv preprint arXiv:2210.11416 (2022).
- InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. arXiv preprint arXiv:2305.06500 (2023).
- Image Aesthetic Assessment: An experimental survey. IEEE Signal Process. Mag. 34, 4 (2017), 80–106. https://doi.org/10.1109/MSP.2017.2696576
- GLM: General Language Model Pretraining with Autoregressive Blank Infilling. arXiv preprint arXiv:2103.10360 (2022).
- Perceptual Quality Assessment of Smartphone Photography. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 3674–3683. https://doi.org/10.1109/CVPR42600.2020.00373
- Google. 2023. Build with Gemini. https://ai.google.dev/
- The Instinctive Bias: Spurious Images lead to Hallucination in MLLMs. arXiv preprint arXiv:2402.03757 (2024).
- Rethinking Image Aesthetics Assessment: Models, Datasets and Benchmarks. Proc. Int. Joint Conf. Artif. Intell. (Jul. 2022).
- CLIP knows image aesthetics. Frontiers in Artificial Intelligence 5 (2022), 976235.
- A comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CsUR) 51, 6 (2019), 1–36.
- KonIQ-10k: An Ecologically Valid Database for Deep Learning of Blind Image Quality Assessment. IEEE Trans. Image Process. 29 (Jan. 2020), 4041–4056. https://doi.org/10.1109/TIP.2020.2967829
- Visual Hallucinations of Multi-modal Large Language Models. arXiv preprint arXiv:2402.14683 (2024).
- Explainable and Generalizable Blind Image Quality Assessment via Semantic Attribute Reasoning. IEEE Trans. Multimedia 25 (2023), 7672–7685. https://doi.org/10.1109/TMM.2022.3225728
- No-reference quality assessment for live broadcasting videos in temporal and spatial domains. IET Image Processing 14, 4 (2020), 774–781.
- Blind Quality Index of Depth Images Based on Structural Statistics for View Synthesis. IEEE Signal Process. Lett. 27 (Apr. 2020), 685–689. https://doi.org/10.1109/LSP.2020.2988830
- AesBench: An Expert Benchmark for Multimodal Large Language Models on Image Aesthetics Perception. arXiv preprint arXiv:2401.08276 (2024).
- Huggingface. 2023. Introducing IDEFICS: An open reproduction of state-of-the-art visual language model. https://huggingface.co/blog/idefics
- ITU. 2012. Methodology for the Subjective Assessment of the Quality of Television Pictures. In Recommendation ITU-R BT.500-13. ITU.
- VILA: Learning Image Aesthetics from User Comments with Vision-Language Pretraining. In Proc. IEEE Conf. Comput. Vis. Pattern Recog. 10041–10051. https://doi.org/10.1109/CVPR52729.2023.00968
- Photo aesthetics ranking network with attributes and content adaptation. In Proc. Eur. Conf. Comput. Vis. 662–679.
- Impressions: Understanding Visual Semiotics and Aesthetic Impact. arXiv preprint arXiv:2310.17887 (2023).
- LISA: Reasoning Segmentation via Large Language Model. arXiv preprint arXiv:2308.00692 (2023).
- Neural Abstract Style Transfer for Chinese Traditional Painting. arXiv preprint arXiv:1812.03264 (2018).
- Otter: A Multi-Modal Model with In-Context Instruction Tuning. arXiv preprint arXiv:2305.03726 (2023).
- AGIQA-3K: An Open Database for AI-Generated Image Quality Assessment. arXiv preprint arXiv:2306.04717 (2023).
- Theme-aware Visual Attribute Reasoning for Image Aesthetics Assessment. IEEE Trans. Circuits and Syst. Video Technol. (2023). https://doi.org/10.1109/TCSVT.2023.3249185
- Personality-assisted multi-task learning for generic and personalized image aesthetics assessment. IEEE Trans. Image Process. 29 (Jan. 2020), 3898–3910.
- Microsoft COCO: Common Objects in Context. arXiv preprint arXiv:1405.0312 (2015).
- Improved Baselines with Visual Instruction Tuning. arXiv preprint arXiv:2310.03744 (2023).
- Visual Instruction Tuning. arXiv preprint arXiv:2304.08485 (2023).
- Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering. In Proc. Neural Inf. Process. Syst. 2507–2521.
- OCR-VQA: Visual Question Answering by Reading Text in Images. In 2019 International Conference on Document Analysis and Recognition (ICDAR). 947–952. https://doi.org/10.1109/ICDAR.2019.00156
- AVA: A large-scale database for aesthetic visual analysis. In Proc. IEEE Conf. Comput. Vis. Pattern Recog. 2408–2415.
- OpenAI. 2023. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774 (2023).
- Learning Transferable Visual Models From Natural Language Supervision. arXiv preprint arXiv:2103.00020 (2021).
- AesCLIP: Multi-Attribute Contrastive Learning for Image Aesthetics Assessment. In Proc. ACM Int. Conf. Multimedia. 1117–1126.
- LLaMA: Open and Efficient Foundation Language Models. arXiv preprint arXiv:2302.13971 (2023).
- Unsplash. 2023. Access the world’s largest open library dataset. https://unsplash.com/data
- DiffusionDB: A Large-scale Prompt Gallery Dataset for Text-to-Image Generative Models. arXiv preprint arXiv:2210.14896 (2023).
- Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision. arXiv preprint arXiv:2309.14181 (2024).
- Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models. arXiv preprint arXiv:2311.06783 (2023).
- Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels. arXiv preprint arXiv:2312.17090 (2023).
- Towards Open-ended Visual Quality Comparison.
- Personalized Image Aesthetics Assessment with Rich Attributes. In Proc. IEEE Conf. Comput. Vis. Pattern Recog. 19861–19869.
- The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision). arXiv preprint arXiv:2309.17421 (2023).
- Multi-Level Transitional Contrast Learning for Personalized Image Aesthetics Assessment. IEEE Trans. Multimedia 26 (2024), 1944–1956.
- mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality. arXiv preprint arXiv:2304.14178 (2023).
- mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration. arXiv preprint arXiv:2311.04257 (2023).
- Towards Artistic Image Aesthetics Assessment: a Large-scale Dataset and a New Method. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 22388–22397. https://doi.org/10.1109/CVPR52729.2023.02144
- TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones. arXiv preprint arXiv:2312.16862 (2023).
- Instruction Tuning for Large Language Models: A Survey. arXiv preprint arXiv:2308.10792 (2024).
- A Perceptual Quality Assessment Exploration for AIGC Images. arXiv preprint arXiv:2303.12618 (2023).
- Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems 36 (2024).
- MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. arXiv preprint arXiv:2304.10592 (2023).