Emergent Mind

Abstract

We present INDICVOICES, a dataset of natural and spontaneous speech containing a total of 7348 hours of read (9%), extempore (74%) and conversational (17%) audio from 16237 speakers covering 145 Indian districts and 22 languages. Of these 7348 hours, 1639 hours have already been transcribed, with a median of 73 hours per language. Through this paper, we share our journey of capturing the cultural, linguistic and demographic diversity of India to create a one-of-its-kind inclusive and representative dataset. More specifically, we share an open-source blueprint for data collection at scale comprising of standardised protocols, centralised tools, a repository of engaging questions, prompts and conversation scenarios spanning multiple domains and topics of interest, quality control mechanisms, comprehensive transcription guidelines and transcription tools. We hope that this open source blueprint will serve as a comprehensive starter kit for data collection efforts in other multilingual regions of the world. Using INDICVOICES, we build IndicASR, the first ASR model to support all the 22 languages listed in the 8th schedule of the Constitution of India. All the data, tools, guidelines, models and other materials developed as a part of this work will be made publicly available

Chart showing hours of audio data from various domains and topics, highlighting data diversity.

Overview

  • IndicVoices is a comprehensive dataset for Indian languages, featuring 7,348 hours of audio from 16,237 speakers across 22 languages and 145 districts.

  • It focuses on demographic, linguistic, and cultural diversity, with a significant portion of extempore and conversational speech, to improve Automatic Speech Recognition (ASR) technologies.

  • IndicVoices outcompetes existing datasets in scale and diversity, aiding in the development of inclusive ASR models like IndicASR, which shows superior performance across all supported languages.

  • The dataset supports further speech and language processing research, with open access encouraging future advancements towards digital inclusivity for India's linguistic variety.

IndicVoices: Towards an Inclusive Multilingual Speech Dataset for Indian Languages

Introduction

The paper introduces IndicVoices, a comprehensive dataset encapsulating the linguistic, cultural, and demographic diversity of India, spanning 22 languages across 145 districts with contributions from 16,237 speakers. This initiative addresses the critical gap in labeled data for Indian languages, which has historically impeded the performance of Automatic Speech Recognition (ASR) technologies in non-English languages. The dataset, with a total of 7348 hours of audio data, predominantly comprises extempore (74%) and conversational (17%) speech, offering a rich resource for developing inclusive language technologies.

The Dataset's Composition and Collection Process

The paper delineates the meticulous process of dataset creation, emphasizing the commitment to capturing the multifaceted diversity of India. The authors crafted a dataset reflecting varied demographics (age, gender, educational background), types of speech (read, extempore, conversational), and recording conditions (diverse environments, wide/narrow-band recordings). A pivotal component of their methodology was the development of a centralized, open-source blueprint for scalable data collection. This framework facilitated the structured collection of spontaneous speech data reflecting real-world usage scenarios, thereby enhancing the dataset's applicability for practical ASR applications.

Comparison with Existing Datasets

IndicVoices distinguishes itself by its sheer scale and scope - covering 22 languages and providing extensive hours of transcribed speech, far surpassing existing datasets in terms of linguistic and demographic diversity. This breadth ensures a more holistic representation of India's linguistic landscape, making it an unparalleled resource for training robust, inclusive ASR models.

ASR Model Development and Benchmarking

Utilizing IndicVoices, the authors developed IndicASR, a pioneering ASR model supporting all 22 languages in the dataset. Initial benchmarking shows that IndicASR significantly outperforms existing models, underscoring the dataset's effectiveness in enhancing ASR performance for Indian languages. This model sets a new standard for speech recognition accuracy and inclusivity, demonstrating the potential of well-curated, diverse datasets in advancing language technologies.

Practical and Theoretical Implications

Beyond ASR, the dataset's structure and comprehensiveness offer vast potential for exploring several other speech and language processing tasks such as speaker diarization, language identification, and query by example. The open availability of IndicVoices and the accompanying tools and guidelines are poised to catalyze further research, making significant strides towards digital inclusivity and the development of speech technologies that cater to India's linguistic diversity.

Future Directions

The authors acknowledge certain limitations, such as the coverage of districts and the representation of conversational speech. Addressing these aspects in future iterations could further enhance the dataset's utility. Moreover, the ongoing collection and transcription efforts aim to expand the dataset, and subsequent work could focus on a more detailed evaluation across varied demographics and use cases. The development of IndicVoices is a stepping stone towards realizing the vision of truly inclusive speech technologies, opening avenues for multilingual research and applications.

Concluding Remarks

IndicVoices represents a significant contribution to the field of speech technology, particularly for the underrepresented languages of India. By facilitating the development of more accurate and inclusive ASR models, this work paves the way for greater digital accessibility and equity. Future research and innovations leveraging this dataset have the potential to transform the landscape of speech technology, making digital services more accessible to the linguistically diverse population of India.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.