Emergent Mind

Look Once to Hear: Target Speech Hearing with Noisy Examples

(2405.06289)
Published May 10, 2024 in cs.SD , cs.AI , and eess.AS

Abstract

In crowded settings, the human brain can focus on speech from a target speaker, given prior knowledge of how they sound. We introduce a novel intelligent hearable system that achieves this capability, enabling target speech hearing to ignore all interfering speech and noise, but the target speaker. A naive approach is to require a clean speech example to enroll the target speaker. This is however not well aligned with the hearable application domain since obtaining a clean example is challenging in real world scenarios, creating a unique user interface problem. We present the first enrollment interface where the wearer looks at the target speaker for a few seconds to capture a single, short, highly noisy, binaural example of the target speaker. This noisy example is used for enrollment and subsequent speech extraction in the presence of interfering speakers and noise. Our system achieves a signal quality improvement of 7.01 dB using less than 5 seconds of noisy enrollment audio and can process 8 ms of audio chunks in 6.24 ms on an embedded CPU. Our user studies demonstrate generalization to real-world static and mobile speakers in previously unseen indoor and outdoor multipath environments. Finally, our enrollment interface for noisy examples does not cause performance degradation compared to clean examples, while being convenient and user-friendly. Taking a step back, this paper takes an important step towards enhancing the human auditory perception with artificial intelligence. We provide code and data at: https://github.com/vb000/LookOnceToHear.

End-to-end target speech hearing system featuring noise cancellation capabilities.

Overview

  • The paper discusses innovative 'binaural hearables' that employ AI to allow users to selectively hear specific voices in noisy environments by focusing only on the desired auditory inputs.

  • The system works through an enrollment phase where the target speaker's voice is recorded, followed by noise and speaker separation through machine learning models, and then selective enhancement of the target voice.

  • The technology has been extensively tested and shown to be effective in real-world noisy settings, and it holds potential not only for personalized listening experiences but also as an aid for those with hearing impairments.

Enhanced Listening: AI-powered Binaural Hearables for Selective Hearing in Noisy Environments

Introduction to Selective Listening with Hearables

Imagine attending a crowded event, trying to focus on a conversation with someone while your ears are bombarded with countless other noises and voices. Traditional noise-canceling devices block out all sounds, which isn't always ideal. Enter the innovative concept of selective listening through "binaural hearables" — devices equipped to enhance our auditory experience by focusing only on sounds we want to hear, specifically, the voice of a chosen speaker.

How Does Selective Listening Work?

The paper introduces a sophisticated setup involving hearable devices that make use of binaural audio inputs, which means they capture sound the way it's heard by both ears. This setup is not just about silencing unwanted noise but smartly filtering and focusing on a chosen sound source. Here’s how it functions:

  1. Enrollment Phase: The user starts by 'enrolling' the target speaker. This means, briefly looking at and listening to the speaker, while the device records a short, noisy audio sample via binaural microphones.
  2. Noise and Speaker Separation: Using the recorded sample, the device employs machine learning models to distinguish and learn the unique speech characteristics (or acoustic signature) of the target speaker despite the background noise.
  3. Selective Enhancement: Once the target speaker's characteristics are learned, the system can then amplify their voice while suppressing other sounds — even in a dynamic environment where both the listener and the speaker might be moving.

Technical Achievements and Practical Applications

  • Real-time Processing: The system is designed to operate in real-time on everyday hearable devices like wireless earbuds. It achieves this by using optimized neural networks that process audio faster than real-time requirements, allowing for seamless auditory experiences.
  • Effective in Noisy, Real-world Environments: Extensive testing demonstrates the system’s ability to function in diverse settings — from bustling streets to windy outdoor scenarios, providing a proof of concept for potential everyday use.
  • User-friendly Interface for Enrollment: Enrolling a target speaker can be as simple as pressing a button or using a smartphone interface while looking at the speaker. This makes the technology accessible and easy to use in real-world scenarios.

Exploring the Implications

The practical implications of this research are vast:

  • Personalized Listening in Public Spaces: Users could tune into specific sources of sound (like a tour guide's narration amidst a noisy crowd) without missing out on the overall ambient experience.
  • Aid for the Hearing Impaired: This technology could evolve into a valuable tool for those with hearing impairments, allowing for clearer conversations in challenging auditory environments.

Future Perspectives and Challenges

While promising, the technology does face challenges such as handling environments where multiple people talk simultaneously from the same direction or discerning speech in highly chaotic noise conditions. Future developments might focus on enhancing the ability of the system to handle multiple target voices and integrating even more seamlessly with a broader range of personal devices.

Conclusion

Binaural hearables equipped with AI-driven selective listening capabilities could significantly enhance the way we experience sound in noisy environments, making it possible to focus on what we choose to hear, without being isolated from the world around us. As research progresses, these technologies hint at a new era of personalized auditory experiences, making listening not just a passive but an actively controlled personal experience.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.