Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Enhance Temporal Relations in Audio Captioning with Sound Event Detection (2306.01533v2)

Published 2 Jun 2023 in cs.SD and eess.AS

Abstract: Automated audio captioning aims at generating natural language descriptions for given audio clips, not only detecting and classifying sounds, but also summarizing the relationships between audio events. Recent research advances in audio captioning have introduced additional guidance to improve the accuracy of audio events in generated sentences. However, temporal relations between audio events have received little attention while revealing complex relations is a key component in summarizing audio content. Therefore, this paper aims to better capture temporal relationships in caption generation with sound event detection (SED), a task that locates events' timestamps. We investigate the best approach to integrate temporal information in a captioning model and propose a temporal tag system to transform the timestamps into comprehensible relations. Results evaluated by the proposed temporal metrics suggest that great improvement is achieved in terms of temporal relation generation.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Zeyu Xie (14 papers)
  2. Xuenan Xu (29 papers)
  3. Mengyue Wu (57 papers)
  4. Kai Yu (202 papers)
Citations (10)

Summary

We haven't generated a summary for this paper yet.