A Survey of Video Datasets for Grounded Event Understanding (2406.09646v1)

Published 14 Jun 2024 in cs.CV and cs.AI

Abstract: While existing video benchmarks largely consider specialized downstream tasks like retrieval or question-answering (QA), contemporary multimodal AI systems must be capable of well-rounded common-sense reasoning akin to human visual understanding. A critical component of human temporal-visual perception is our ability to identify and cognitively model "things happening", or events. Historically, video benchmark tasks have implicitly tested for this ability (e.g., video captioning, in which models describe visual events with natural language), but they do not consider video event understanding as a task in itself. Recent work has begun to explore video analogues to textual event extraction but consists of competing task definitions and datasets limited to highly specific event types. Therefore, while there is a rich domain of event-centric video research spanning the past 10+ years, it is unclear how video event understanding should be framed and what resources we have to study it. In this paper, we survey 105 video datasets that require event understanding capability, consider how they contribute to the study of robust event understanding in video, and assess proposed video event extraction tasks in the context of this body of research. We propose suggestions informed by this survey for dataset curation and task framing, with an emphasis on the uniquely temporal nature of video events and ambiguity in visual content.

Citations (3)

View on Semantic Scholar

Summary

The paper introduces a framework that categorizes video datasets by event content, presentation, and structure.
It surveys 105 datasets spanning domains like sports, news, and daily activities to assess real-world event complexities.
The study outlines multimodal event extraction tasks and advocates unified approaches for human-like temporal and semantic reasoning.

A Survey of Video Datasets for Grounded Event Understanding

The paper "A Survey of Video Datasets for Grounded Event Understanding" by Kate Sanders and Benjamin Van Durme provides a detailed examination of the current landscape of video datasets that require event understanding capabilities. This examination is contextualized within contemporary multimodal AI systems' goal of attaining human-like common-sense reasoning for visual content. The authors critically survey 105 datasets, categorizing them based on their event-centric requirements, and provide insights into the types of events, their presentation, and structural composition.

Core Contributions

The paper provides a structured framework for analyzing video datasets, focusing on three primary axes: the type of events presented (content), the manner in which they are presented (presentation), and their temporal and structural interpretation (structure). This framework is essential for understanding the dataset's alignment with the task of robust video event understanding.

Dataset Content:

Categories like action recognition, hierarchical action recognition, and scene parsing are explored in depth.
The analysis covers domains such as sports, daily activities, and professional content like news and TV shows, highlighting a range in complexity and naturalness of events.

Dataset Presentation:

The presentation of video events is analyzed based on factors such as video quality, modality (video, audio, depth), and whether the content is naturally occurring or staged.
This dimension is crucial for assessing the dataset's applicability to real-world scenarios and its potential biases.

Dataset Structure:

The temporal and semantic structures of video events are considered, with formal classifications based on propositional, Davidsonian, Neo-Davidsonian, and string-based approaches.
Temporality is further dissected into time-agnostic, temporal, compositional, and hierarchical structures to ascertain the depth of temporal relationships presented in video content.

Analysis of Surveyed Datasets

A substantial portion of the paper explores summarizing the general trends observed across the surveyed datasets:

Action Recognition: Foundationally simple datasets focused on identifying singular actions. Some extend to more complex actions such as group activities, though generally limited in domain scope.
Hierarchical Action Recognition: These datasets decompose actions into hierarchies, common in domains like cooking, posing challenges due to detailed annotation requirements.
Scene Graphs: Highly detailed in annotations of relationships between entities, essential for tasks requiring in-depth temporal and spatial understanding.
Retrieval and Captioning: Ranges from short, simple events to lengthy, complex scenarios, with a diverse range of event types often paired with natural language descriptions.
Question-Answering: Leverages both existing datasets and organically collected data, focusing on narratives and socially complex content particularly fit for TV shows and movies.
Specialized Categories: Includes datasets like news videos, multilingual content, and miscellaneous datasets facilitating a range of applications beyond traditional event recognition.

Implications of Semantic and Temporal Structures

The paper differentiates event understanding through formal semantic structures and temporal dimensions. Semantic structures are categorized along the lines of propositional, Davidsonian, Neo-Davidsonian, and string-based, each with varying complexity and suitability for different dataset types. Temporal structures are considered through their agnosticism to time, temporal sequencing, compositional nature, and hierarchical relationships. This categorization helps in identifying the richness and computational requirements of event modeling in video content.

Proposed Video Event Extraction Tasks

The authors review three primary tasks framed for video event understanding:

Multimodal Event Extraction (MMEE): Emphasizes joint video and text input but is limited by its requirement of perfect text-event alignment.
Video Semantic Role Labeling (VidSRL): Covers compositional events with temporal markers, but is limited to clips of fixed length.
Multimodal Event Hierarchy Extraction (MEHE): Focuses on identifying hierarchical relationships using paired text and video, restricted by its reliance on text input and shot-based event definitions.

Future Directions

The paper advocates for a unified task amalgamating the strengths of MMEE, VidSRL, and MEHE, including hierarchical, temporally defined, and text-alignable events. Moreover, it highlights the necessity for models capable of understanding event uncertainty, aligning more closely with human visual reasoning. Future research should focus on creating more diverse synthetic datasets to mitigate biases related to geographic, linguistic, and cultural factors.

Conclusion

Sanders and Van Durme's survey provides a comprehensive analysis of video datasets relevant to grounded event understanding, offering a solid foundation for future research. Their categorization framework and recommendations for video event extraction tasks serve as critical guidelines for developing more robust multimodal AI systems capable of human-like visual and temporal reasoning.

References

The paper concludes with an extensive bibliography, evidencing the breadth of research and datasets consulted for this comprehensive survey. However, the specific references are encapsulated within the comprehensive IEEE-style citation format customary for such academic contributions.

Related Papers

Tweets

https://twitter.com/kesnet50/status/1805276104623849580