TinyVQA: Compact Multimodal Deep Neural Network for Visual Question Answering on Resource-Constrained Devices (2404.03574v1)

Published 4 Apr 2024 in cs.CV, cs.AI, and cs.LG

Abstract: Traditional machine learning models often require powerful hardware, making them unsuitable for deployment on resource-limited devices. Tiny Machine Learning (tinyML) has emerged as a promising approach for running machine learning models on these devices, but integrating multiple data modalities into tinyML models still remains a challenge due to increased complexity, latency, and power consumption. This paper proposes TinyVQA, a novel multimodal deep neural network for visual question answering tasks that can be deployed on resource-constrained tinyML hardware. TinyVQA leverages a supervised attention-based model to learn how to answer questions about images using both vision and language modalities. Distilled knowledge from the supervised attention-based VQA model trains the memory aware compact TinyVQA model and low bit-width quantization technique is employed to further compress the model for deployment on tinyML devices. The TinyVQA model was evaluated on the FloodNet dataset, which is used for post-disaster damage assessment. The compact model achieved an accuracy of 79.5%, demonstrating the effectiveness of TinyVQA for real-world applications. Additionally, the model was deployed on a Crazyflie 2.0 drone, equipped with an AI deck and GAP8 microprocessor. The TinyVQA model achieved low latencies of 56 ms and consumes 693 mW power while deployed on the tiny drone, showcasing its suitability for resource-constrained embedded systems.

References (30)

Citations (3)

View on Semantic Scholar

Summary

The paper demonstrates a novel approach to distill a baseline VQA model into a compact version tailored for tinyML devices.
It employs knowledge distillation and low bit-width quantization to achieve 79.5% accuracy with a 100% reduction in memory usage.
The model’s deployment on platforms like the Crazyflie 2.0 drone highlights its real-world viability for rapid, autonomous decision-making.

TinyVQA: A Novel Approach for Visual Question Answering on Resource-Limited Devices

Introduction to TinyVQA

In the field of tiny Machine Learning (tinyML), the proposition of TinyVQA marks a significant stride toward deploying multimodal deep neural networks on devices with limited resources. It introduces a compact, efficient framework tailored for visual question answering (VQA), a domain that necessitates the integration of visual and textual data to generate insights. Historically, the deployment of such sophisticated models on constrained hardware has been fraught with challenges, chiefly due to their complexity and the substantial computational resources they demand. TinyVQA, through its innovative design and strategic optimizations, overcomes these barriers, facilitating the deployment of VQA tasks on tinyML hardware with minimal compromise on performance.

TinyVQA Model Architecture

The architecture of TinyVQA is divided into two primary components, echoing the model's objective to balance performance with efficiency:

The Baseline VQA Model leverages an attention-based mechanism, integrating visual and textual cues to answer queries about images. This sophisticated model, while highly accurate, is not inherently optimized for deployment on resource-constrained devices. Its role is pivotal in training, providing a high-quality knowledge base for distilling into the more compact TinyVQA model.
The Memory-Aware Compact VQA Model signifies the core of TinyVQA's innovation. This model distills the knowledge from the baseline model, employing techniques such as knowledge distillation and low bit-width quantization to drastically reduce its size without significantly compromising its accuracy. Designed with the limitations of tinyML hardware in mind, it exemplifies a significant reduction in model size while maintaining functional integrity.

Evaluation of TinyVQA

The effectiveness of TinyVQA was measured using the FloodNet dataset, chosen for its relevance to real-world application in post-disaster scenarios. The dataset, derived from imagery collected post-Hurricane Harvey, provides a diverse set of visual and textual queries, including damage assessment and environmental condition questions. The results are commendable:

The TinyVQA model achieved an accuracy of 79.5%, showcasing a mere 1.5% reduction in performance compared to the baseline model, yet demonstrating a 100% decrease in memory usage.
These outcomes underscore the model's potential in executing complex VQA tasks within the stringent limitations of tinyML devices, heralding a new era of efficiency and applicability in edge computing.

Deployment on Resource-Constrained Hardware

The deployment of TinyVQA on the Crazyflie 2.0 drone equipped with an AI deck and powered by the GAP8 microprocessor is a testament to its real-world viability. The deployment highlights include:

Implementation within the tight memory constraints of the GAP8 architecture, utilizing a mix of model compression techniques to fit within the available resources.
The model's operational efficiency, with low latencies of 56 ms and minimal power consumption of 0.7 W, paves the way for real-time, autonomous VQA applications in scenarios where rapid, informed decision-making is crucial.

Conclusion and Future Perspectives

TinyVQA represents a significant leap forward in deploying multimodal deep learning models on resource-limited devices. By demonstrating high accuracy in visual question answering tasks with remarkably low resource consumption, TinyVQA paves the way for advanced, intelligent applications in areas previously constrained by hardware limitations. As tinyML continues to evolve, the principles and methodologies underpinning TinyVQA offer a blueprint for future research and development in the field, especially in scenarios demanding rapid, on-site intelligence, such as disaster response and remote sensing.

With a proven capability to operate on the cutting edge of efficiency and performance, the future of tinyML looks bright, promising unprecedented advancements in how computational intelligence is deployed in the real world.

PDF Markdown

Related Papers

Tweets

https://twitter.com/mryalamanchi/status/1776976570814545985