Using Descriptive Video Services to Create a Large Data Source for Video Annotation Research

Published 3 Mar 2015 in cs.CV and cs.AI | (1503.01070v1)

Abstract: In this work, we introduce a dataset of video annotated with high quality natural language phrases describing the visual content in a given segment of time. Our dataset is based on the Descriptive Video Service (DVS) that is now encoded on many digital media products such as DVDs. DVS is an audio narration describing the visual elements and actions in a movie for the visually impaired. It is temporally aligned with the movie and mixed with the original movie soundtrack. We describe an automatic DVS segmentation and alignment method for movies, that enables us to scale up the collection of a DVS-derived dataset with minimal human intervention. Using this method, we have collected the largest DVS-derived dataset for video description of which we are aware. Our dataset currently includes over 84.6 hours of paired video/sentences from 92 DVDs and is growing.

Abstract PDF Upgrade to Chat

Authors (4)

Citations (202)

View on Semantic Scholar

Summary

The paper presents a novel automated method that extracts DVS narrations to build M-VAD, the largest video annotation dataset derived from 92 DVDs.
It employs vocal isolation and noise cancellation techniques to accurately segment narrations with a temporal misalignment under two seconds.
M-VAD comprises over 84.6 hours of video across 48,986 clips averaging 6.2 seconds each, offering a high-quality resource for deep learning research.

An Analysis of Creating a Large Video Annotation Dataset Using Descriptive Video Services

The paper introduces a novel dataset derived from Descriptive Video Service (DVS) narrations found on DVDs, which aims to serve as a significant resource for video annotation research. The authors developed an automated process for segmenting and aligning DVS audio tracks with corresponding video content, creating what they believe to be the largest DVS-derived dataset available. This collection, named the Montreal Video Annotation Dataset (M-VAD), comprises over 84.6 hours of paired video and descriptive text from 92 DVDs.

Methodological Approach

The core contribution of this work lies in the creation of a scalable method for dataset construction using DVS audio tracks, which serve as narrated descriptions of visual content tailored for the visually impaired. DVS differs from traditional movie scripts as it is tightly aligned with visual scenes, providing descriptions of actions, appearances, and other visual elements with a temporal misalignment typically not exceeding two seconds.

The authors implemented a semi-automated system to isolate and segment these narrations from the mixed movie soundtracks. They utilized vocal isolation techniques combined with Least Mean Square (LMS) noise cancellation to distinguish the DVS narration from the original soundtrack. This methodology capitalizes on the fact that DVS narrations are inserted in natural dialogue pauses, hence aiding in the clean extraction of narrations.

Dataset Characteristics and Comparison

The resultant M-VAD dataset boasts a vast number of video clips, totaling 48,986, with each clip averaging 6.2 seconds. This dataset is unique due to its reliance on professionally produced descriptions, as opposed to crowd-sourced descriptions found in many other datasets.

A detailed comparison with existing datasets highlights M-VAD's comprehensive scope:

M-VAD covers a wide variety of movies and genres compared to cooking-centric datasets like TACoS.
It surpasses previous efforts in size and the quality of its natural language descriptions owing to the professional nature of DVS.

The dataset's corpus has been analyzed using POS tagging to identify its linguistic features. The vocabulary includes a large number of nouns, verbs, and adjectives, indicative of the descriptive richness offered by DVS narrations.

Implications and Future Work

Practically, the M-VAD dataset holds significant promise for advancing video annotation models, particularly in deep learning contexts that require extensive paired data. The dataset's high-quality descriptions can facilitate more nuanced understanding and generation of natural language descriptions from visual data.

Theoretically, this research also posits questions about the comparative efficacy of different sources of video descriptions. By contrasting DVS with film scripts, it lays the groundwork for further studies on optimal annotation sources for machine learning tasks.

There is potential for extensions of this work, with possible directions including the enrichment of the dataset with additional DVDs as more become available, or improving the automation of the segmentation and transcription processes. Future research might also explore the domain adaptation of machine learning models trained on this dataset to related video understanding tasks.

In sum, this paper contributes a valuable resource and methodological insights to the video annotation and AI research community, fostering improvements in machine understanding of videos through richer datasets.

Markdown Report Issue