FoodieQA: A Multimodal Dataset for Fine-Grained Understanding of Chinese Food Culture

Published 16 Jun 2024 in cs.CL | (2406.11030v2)

Abstract: Food is a rich and varied dimension of cultural heritage, crucial to both individuals and social groups. To bridge the gap in the literature on the often-overlooked regional diversity in this domain, we introduce FoodieQA, a manually curated, fine-grained image-text dataset capturing the intricate features of food cultures across various regions in China. We evaluate vision-LLMs (VLMs) and LLMs on newly collected, unseen food images and corresponding questions. FoodieQA comprises three multiple-choice question-answering tasks where models need to answer questions based on multiple images, a single image, and text-only descriptions, respectively. While LLMs excel at text-based question answering, surpassing human accuracy, the open-sourced VLMs still fall short by 41% on multi-image and 21% on single-image VQA tasks, although closed-weights models perform closer to human levels (within 10%). Our findings highlight that understanding food and its cultural implications remains a challenging and under-explored direction.

Abstract PDF HTML Upgrade to Chat

Citations (4)

View on Semantic Scholar

Summary

The paper introduces FoodieQA, a manually curated dataset covering 14 Chinese cuisine types to capture detailed regional food culture.
It evaluates vision-language and large language models, revealing significant performance gaps, especially in visual question-answering tasks.
The study highlights LLMs' strong text-only performance while exposing challenges in integrating cultural visual cues, calling for enhanced multimodal models.

An Analysis of "FoodieQA: A Multimodal Dataset for Fine-Grained Understanding of Chinese Food Culture"

The paper "FoodieQA: A Multimodal Dataset for Fine-Grained Understanding of Chinese Food Culture" by Wenyan Li et al. introduces FoodieQA, a novel dataset aimed at advancing the understanding of Chinese food culture through multimodal question-answering tasks. This dataset fills a crucial gap in current literature, as it emphasizes the intricacies of regional food culture in China, often overlooked in generalized studies. Specifically, the dataset focuses on multiple-choice question-answering tasks across multi-image, single-image, and text-only formats, addressing a breadth of attributes including visual presentation, ingredients, culinary techniques, and regional associations.

Key Contributions and Findings

Dataset Structure and Diversity: FoodieQA is composed of manually curated data sourced from native Chinese speakers, ensuring authenticity and regional relevance. The dataset encompasses 14 distinct Chinese cuisine types, each rich in regional differences, reflecting the nuanced diversity within Chinese culinary traditions.
Evaluation of Vision-LLMs (VLMs) and LLMs: The dataset was tested on a selection of state-of-the-art VLMs and LLMs. A notable finding is the substantial gap between model performance and human-level accuracy, particularly in tasks requiring visual input. For instance, open-weights VLMs lagged significantly, showing a 41% deficit on multi-image and 21% on single-image VQA tasks compared to human accuracy. This highlights current models' limitations in visual cultural context integration and fine-grained reasoning tasks.
Text-Based Question Answering: Interestingly, LLMs demonstrated superior abilities in text-only tasks, even surpassing human performance by leveraging extensive text-based knowledge. This suggests that while models can encapsulate and process vast text data efficiently, the integration of visual cultural cues remains a significant hurdle.
Analysis by Question Type: Performance analyses reveal that models can handle tasks related to cooking techniques and ingredient identification relatively better. However, they struggle severely with understanding regional and taste-related information, evidencing a limited cultural adaptability in these domains.
Challenges in Visual Understanding and Cultural Context: The multi-image VQA posed the greatest challenge to models, particularly in scenarios that resemble real-world complexities such as browsing menus. This underscores the need to enhance current models' capacities in discerning and utilizing visual contexts in culturally nuanced settings.

Implications and Future Directions

The introduction of FoodieQA underscores the necessity for datasets that capture cultural specificity, beyond the monolithic representations often seen in general datasets. The significant disparity between model performance and human-level understanding, especially in visual tasks, indicates an urgent need for advancements in models' multimodal comprehension capabilities. Enhanced model architectures that better integrate visual inputs with contextual, culturally-inclined information could bridge this gap.

Moreover, the paper suggests potential expansions of the dataset to include dishes from other countries or regions, broadening the study of cultural food understanding across global contexts. Such expansions could not only enhance model robustness but also contribute to a richer understanding of cultural dynamics in AI interpretations.

In conclusion, "FoodieQA" offers a pivotal step toward addressing the complex challenge of integrating cultural nuances into AI systems. As the field progresses, research inspired by this work will likely catalyze more culture-specific datasets, improving models' applicability in diverse cultural landscapes and moving closer to comprehensive AI-based cultural understanding in multimodal frameworks.

Markdown Report Issue