VizWiz Grand Challenge: Answering Visual Questions from Blind People

Published 22 Feb 2018 in cs.CV, cs.CL, and cs.HC | (1802.08218v4)

Abstract: The study of algorithms to automatically answer visual questions currently is motivated by visual question answering (VQA) datasets constructed in artificial VQA settings. We propose VizWiz, the first goal-oriented VQA dataset arising from a natural VQA setting. VizWiz consists of over 31,000 visual questions originating from blind people who each took a picture using a mobile phone and recorded a spoken question about it, together with 10 crowdsourced answers per visual question. VizWiz differs from the many existing VQA datasets because (1) images are captured by blind photographers and so are often poor quality, (2) questions are spoken and so are more conversational, and (3) often visual questions cannot be answered. Evaluation of modern algorithms for answering visual questions and deciding if a visual question is answerable reveals that VizWiz is a challenging dataset. We introduce this dataset to encourage a larger community to develop more generalized algorithms that can assist blind people.

Abstract PDF Upgrade to Chat

Authors (8)

Citations (678)

View on Semantic Scholar

Summary

The paper introduces a novel dataset with over 31,000 visual questions sourced directly from blind photographers.
It highlights distinct dataset features such as low-quality images, conversational queries, and unanswerable questions that challenge standard VQA models.
The study evaluates nine attention-based models and pioneers answerability estimation, paving the path for more robust assistive technologies.

Insightful Overview of the VizWiz Grand Challenge Paper

The paper "VizWiz Grand Challenge: Answering Visual Questions from Blind People" addresses a unique aspect of Visual Question Answering (VQA) by constructing the first goal-oriented dataset derived from real-world interactions involving blind users. The VizWiz dataset encompasses over 31,000 visual questions sourced from blind individuals. Each contributor took a photo using mobile devices and recorded a spoken inquiry regarding the image, subsequently acquiring ten crowdsourced responses per inquiry.

Key Dataset Characteristics

The VizWiz dataset distinguishes itself with several defining characteristics:

Blind Photographers: Unlike many VQA datasets that utilize images captured by sighted individuals or simulations, VizWiz images are often of lower quality, exhibiting issues such as poor lighting, focus, and framing.
Spoken Questions: The questions are naturally conversational, displaying nuances and variabilities typical of spoken language, often including incomplete or clipped phrases.
Unanswerable Questions: A substantial portion of the visual questions cannot be answered due to the image quality or irrelevance of the content, marking a departure from typical assumptions in VQA datasets.

Algorithmic Evaluation

The paper evaluates contemporary VQA algorithms using the VizWiz dataset and finds them challenged by its complexity. Nine models, including state-of-the-art methods enhanced with attention mechanisms, exhibit limited effectiveness when trained on standard datasets and tested against VizWiz data. Fine-tuning and training from scratch moderately improve performance, yet a notable gap remains when compared to human-level accuracy.

Answerability Challenge

The study also pioneers in estimating the answerability of visual questions, leveraging pre-trained models that gauge relevance based on question and image congruence. The results underline the inadequacy of existing models developed for cleaner datasets, highlighting a substantial opportunity for methodological advances in predicting answerability within real-world constraints.

Implications and Future Directions

From a practical standpoint, VizWiz underscores the necessity for more robust, generalized algorithms capable of adapting to varied image qualities and conversational question structures typical of interactions involving assistive technology.

Theoretical implications span towards refining VQA models that inherently recognize and contend with challenges of unpredictability and real-world visual data variability, thereby pushing the boundaries of current AI applications.

Looking forward, the research suggests several potential directions:

Development of novel attention mechanisms better suited to degraded images.
Advanced models that handle conversational nuances in spoken queries.
Enhanced algorithms for determining question answerability that integrate seamlessly with assistive technologies.

Conclusion

In summary, VizWiz enhances the understanding of real-world VQA applications, presenting a benchmark that is not only challenging but also critical in the deployment of technology designed to assist visually impaired individuals. The dataset fuels the broader AI community's agenda towards creating more inclusive and effective automated systems, bridging the gap between theoretical advancements and societal applications.

Markdown Report Issue