Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DASB - Discrete Audio and Speech Benchmark (2406.14294v2)

Published 20 Jun 2024 in cs.SD, cs.AI, and eess.AS

Abstract: Discrete audio tokens have recently gained considerable attention for their potential to connect audio and language processing, enabling the creation of modern multimodal LLMs. Ideal audio tokens must effectively preserve phonetic and semantic content along with paralinguistic information, speaker identity, and other details. While several types of audio tokens have been recently proposed, identifying the optimal tokenizer for various tasks is challenging due to the inconsistent evaluation settings in existing studies. To address this gap, we release the Discrete Audio and Speech Benchmark (DASB), a comprehensive leaderboard for benchmarking discrete audio tokens across a wide range of discriminative tasks, including speech recognition, speaker identification and verification, emotion recognition, keyword spotting, and intent classification, as well as generative tasks such as speech enhancement, separation, and text-to-speech. Our results show that, on average, semantic tokens outperform compression tokens across most discriminative and generative tasks. However, the performance gap between semantic tokens and standard continuous representations remains substantial, highlighting the need for further research in this field.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Pooneh Mousavi (9 papers)
  2. Luca Della Libera (14 papers)
  3. Jarod Duret (10 papers)
  4. Artem Ploujnikov (6 papers)
  5. Cem Subakan (35 papers)
  6. Mirco Ravanelli (72 papers)
Citations (7)

Summary

We haven't generated a summary for this paper yet.