Emergent Mind

SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound

(2405.00233)
Published Apr 30, 2024 in cs.SD , cs.AI , cs.MM , eess.AS , and eess.SP

Abstract

LLMs have significantly advanced audio processing through audio codecs that convert audio into discrete tokens, enabling the application of language modelling techniques to audio data. However, traditional codecs often operate at high bitrates or within narrow domains such as speech and lack the semantic clues required for efficient language modelling. Addressing these challenges, we introduce SemantiCodec, a novel codec designed to compress audio into fewer than a hundred tokens per second across diverse audio types, including speech, general audio, and music, without compromising quality. SemantiCodec features a dual-encoder architecture: a semantic encoder using a self-supervised AudioMAE, discretized using k-means clustering on extensive audio data, and an acoustic encoder to capture the remaining details. The semantic and acoustic encoder outputs are used to reconstruct audio via a diffusion-model-based decoder. SemantiCodec is presented in three variants with token rates of 25, 50, and 100 per second, supporting a range of ultra-low bit rates between 0.31 kbps and 1.43 kbps. Experimental results demonstrate that SemantiCodec significantly outperforms the state-of-the-art Descript codec on reconstruction quality. Our results also suggest that SemantiCodec contains significantly richer semantic information than all evaluated audio codecs, even at significantly lower bitrates. Our code and demos are available at https://haoheliu.github.io/SemantiCodec/.

Comparison of SemantiCodec with other codecs on reconstruction quality, semantic information, and bitrate sizes.

Overview

  • SemantiCodec introduces an ultra-low bitrate audio compression technique utilizing AI, focusing on maintaining audio quality while achieving high compression rates. It employs a dual-encoder structure with unique semantic and acoustic processing capabilities.

  • The codec lowers token rates dramatically to enhance computational efficiency and uses cutting-edge diffusion models for high-quality audio reconstruction.

  • SemantiCodec outperforms traditional codecs in preserving semantic richness and audio quality at bitrates as low as 0.31 kbps, demonstrating potential for diverse applications in telecommunications, broadcasting, and AI-driven audio processing.

Exploring SemantiCodec: An Innovative Approach to Ultra Low Bitrate Audio Compression

Introduction to Semantic Audio Codecs

Audio codecs are crucial tools that help encode and decode digital audio, optimizing it for efficient telecommunication and broadcasting. Traditional audio codecs focus primarily on discarding inaudible parts of sound to compress data, but latest advancements utilize AI to improve both quality and compression rates. Most notably, these AI-driven codecs use techniques like vector quantization, where audio data is transformed into tokens, much like how words are tokenized in NLP.

However, when it comes to efficiently encoding varied audio types (like speech, music, or ambient sounds), maintaining a balance between compression (low bitrate) and audio quality becomes increasingly complex. Addressing this balance is where SemantiCodec, a novel semantic audio codec, makes its mark, achieving impressive compression at ultra low bit rates without sacrificing the quality.

Core Innovations of SemantiCodec

Dual-Encoder Structure: SemantiCodec uses a unique dual-encoder system comprising a semantic encoder and an acoustic encoder. This architecture allows it to effectively compress audio while retaining crucial sound details.

  1. Semantic Encoder: It leverages a machine learning model called AudioMAE, designed for self-supervised learning, which means it learns from the data without needing explicit labels. The encoder processes the audio to extract meaningful features, which are then clustered using k-means to produce a compact representation—referred to as semantic tokens.
  2. Acoustic Encoder: This component captures the finer acoustic details that the semantic encoder might miss. It's essential for restoring the audio to a high quality during decoding.

Token Efficiency: Classic codecs often require high token rates (hundreds of tokens per second), which can hamper computational efficiency. SemantiCodec, however, manages to lower the token rate drastically, to as few as 25 tokens per second, significantly easing the computational load without degrading the audio output.

Diffusion Model-Based Decoder: For reconstructing audio from the encoded tokens, SemantiCodec uses advanced generative models known as diffusion models, acclaimed for their ability to generate high-quality outputs. By conditioning on both semantic and acoustic tokens, the system ensures the reconstructed audio remains both accurate and semantically rich.

Empirical Evaluation and Results

SemantiCodec is thoroughly evaluated against existing codecs like the Descript codec under various metrics:

  • Semantic Richness: It excels in retaining more semantic information at even lower bitrates, important for applications in language models and more intuitive audio processing tasks.
  • Reconstruction Quality: Semantically rich and lower bitrates allow for high-quality audio reconstruction, surpassing many state-of-the-art codecs, particularly at bitrates below 1.5 kbps.

The tests confirm that at ultra-low bitrates (as low as 0.31 kbps), SemantiCodec still provides satisfactory audio quality which is competitive with if not superior to rates offered by much higher bitrate systems.

The Path Forward

While SemantiCodec introduces promising advancements in audio processing, future developments could explore even deeper integrations of semantic information. Enhancing the efficiency of the encoding and decoding processes, possibly through further AI optimizations, could allow for real-time applications in more bandwidth-sensitive environments.

Moreover, incorporating multi-modal learning, where the system could learn from not only audio but related modalities like video or text, could pave the way for more robust and versatile semantic audio codecs.

Conclusion

SemantiCodec has made significant strides in demonstrating that it's indeed possible to retain high audio quality at remarkably low bitrates with rich semantic understanding. This codec not only stands to benefit the traditional domains of telecommunications and broadcasting but also opens new avenues in smart devices, streaming services, and AI-powered audio applications, where efficiency and quality are paramount.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.