ORBIT: A Real-World Few-Shot Dataset for Teachable Object Recognition (2104.03841v5)

Published 8 Apr 2021 in cs.CV

Abstract: Object recognition has made great advances in the last decade, but predominately still relies on many high-quality training examples per object category. In contrast, learning new objects from only a few examples could enable many impactful applications from robotics to user personalization. Most few-shot learning research, however, has been driven by benchmark datasets that lack the high variation that these applications will face when deployed in the real-world. To close this gap, we present the ORBIT dataset and benchmark, grounded in the real-world application of teachable object recognizers for people who are blind/low-vision. The dataset contains 3,822 videos of 486 objects recorded by people who are blind/low-vision on their mobile phones. The benchmark reflects a realistic, highly challenging recognition problem, providing a rich playground to drive research in robustness to few-shot, high-variation conditions. We set the benchmark's first state-of-the-art and show there is massive scope for further innovation, holding the potential to impact a broad range of real-world vision applications including tools for the blind/low-vision community. We release the dataset at https://doi.org/10.25383/city.14294597 and benchmark code at https://github.com/microsoft/ORBIT-Dataset.

Citations (42)

View on Semantic Scholar

Summary

The paper presents the ORBIT dataset as a real-world benchmark that enables few-shot learning for teachable object recognition.
It details a user-centric methodology that evaluates system performance using per-user tasks and metrics such as frame and video accuracy.
The findings underscore the need for adaptable, low-computation models to advance assistive technologies and real-time object recognition.

An Insightful Exploration of the ORBIT Dataset for Teachable Object Recognition

The paper "ORBIT: A Real-World Few-Shot Dataset for Teachable Object Recognition" offers an in-depth exploration into the complexities of developing robust object recognition systems, particularly for blind or low-vision individuals. The work presented introduces the ORBIT dataset, which plays a pivotal role in advancing few-shot learning by providing a benchmark dataset that captures the inherent variability in real-world applications.

Motivation and Construction of the ORBIT Dataset

Significant advancements have been made in object recognition through datasets that provide numerous high-quality examples per category. However, these systems struggle to adapt to new objects using only a few examples—a capability crucial for assistive technologies, among other applications. This paper fills a crucial gap by presenting a dataset that reflects realistic conditions of object recognition for people with visual impairments. The ORBIT dataset was meticulously constructed with 3822 videos of 486 objects, recorded directly by individuals who are blind or visually impaired using mobile devices. This dataset is an invaluable resource for developing teachable object recognizers (TORs), providing real-world variation and complexity necessary for developing robust and adaptable recognition systems.

Methodology: User-Centric Benchmarking

The authors establish a user-centric benchmark designed to evaluate the performance of teachable object recognizers under few-shot conditions. This is characterized by tasks sampled per user rather than by object class, offering a nuanced insight into system performance when adapted to an individual user's objects. This approach also incorporates metrics that address computational constraints relevant to real-world deployment on mobile devices. The metrics—frame accuracy, frames-to-recognition, and video accuracy—capture the system's efficacy and usability while factorizing the computational complexity through macs to personalize and the number of parameters.

Baseline Evaluations and Analysis

The paper provides detailed analyses of baseline methodologies across three few-shot learning approaches: metric-based, optimization-based, and amortization-based. Notably, the proto-typical network and CNAPs show competent performance with relatively lower computation costs, indicating potential for mobile deployment. In contrast, the fine-tuning approach, despite achieving competitive accuracy, incurs higher computational costs, suggesting more extensive resources are necessary for real-time applications. The findings signify a substantial gap between existing few-shot learning techniques and the exigencies of real-world operations, highlighting the need for innovative solutions that build on the current understanding and capabilities.

Implications and Future Directions

The implications of this research are profound, with potential impacts extending beyond the immediate field of accessibility tools. The ORBIT dataset's introduction not only helps refine TORs but also forces a reconsideration of existing dataset paradigms that primarily focus on many-shot learning. This work suggests promising avenues in enhancing model robustness under variable conditions, quantifying uncertainties, and real-time contextual adaptability.

Furthermore, by involving the blind and low-vision community in the dataset creation, the authors emphasize the broader ethical and practical considerations in AI development. This highlights a critical shift toward user-centered design principles in machine learning, as systems increasingly embody collaborative human-AI partnerships.

In conclusion, this paper presents a compelling case for the ORBIT dataset's use as a catalyst in the continuing evolution of adaptable and resilient object recognition systems. Such systems are not only foundational in the assistive technologies sphere but also catalyze a spectrum of applications demanding similar resilience and adaptability in complex, unstructured environments.

PDF Markdown

Related Papers

GitHub

GitHub - microsoft/ORBIT-Dataset: The ORBIT dataset is a collection of videos of objects in clean and cluttered scenes recorded by people who are blind/low-vision on a mobile phone. The dataset is presented with a teachable object recognition benchmark task which aims to drive few-shot learning on challenging real-world data. (97 stars)