How LLMs are learning to differentiate spatial sounds

Key Points:

  • BAT is a spatial, audio-based LLM that can classify types of audio, sound direction, and distance with precision.
  • Spatial audio integration into LLMs signifies a significant step towards multimodal AI systems.
  • BAT demonstrates strong capabilities in spatial reasoning with mixed sounds and sources.

Summary:

Humans pride themselves on their ability to pick out the nuances of sound and pinpoint their origin. Large language models (LLMs) have been somewhat lagging when it comes to deciphering spatial audio phenomena. The developers behind BAT, a new spatial audio-based LLM, have finally cracked the 3-D sound code, enabling it to identify various sounds, their direction, and distance — all while juggling multiple audio sources.

 

While other apps like AudioGPT and LTU have been advancing the audio domain, none have quite mastered the complexity of spatial audio in diverse 3-D environments like BAT. Armed with spatial reasoning skills, BAT hit an impressive 77% accuracy rate in handling mixed sounds and sources. Its spatial audio encoder classifies sound and pinpoints sound directions with less than 18-degree error, and estimates distances within a foot-and-a-half of reality — talk about audio precision!

 

Fueling this tech are developers at the University of Texas and Shanghai Jiao Tong University, who crafted a Spatial Audio Spectrogram Transformer and tasks for spatial question answering, which BAT subsumed. With a dataset comprised of binaural labels and large-scale building renderings, BAT proved its mettle by acing tasks like recognizing a baby’s laughter or music sources lurking behind.

 

From leveling up virtual reality and gaming experiences to enhancing robotic and autonomous vehicle sensory smarts, the applications are numerous.

DAILY LINKS TO YOUR INBOX

PROMPT ENGINEERING

Prompt Engineering Guides

ShareGPT

 

©2024 The Horizon