Humans pride themselves on their ability to pick out the nuances of sound and pinpoint their origin. Large language models (LLMs) have been somewhat lagging when it comes to deciphering spatial audio phenomena. The developers behind BAT, a new spatial audio-based LLM, have finally cracked the 3-D sound code, enabling it to identify various sounds, their direction, and distance — all while juggling multiple audio sources.
While other apps like AudioGPT and LTU have been advancing the audio domain, none have quite mastered the complexity of spatial audio in diverse 3-D environments like BAT. Armed with spatial reasoning skills, BAT hit an impressive 77% accuracy rate in handling mixed sounds and sources. Its spatial audio encoder classifies sound and pinpoints sound directions with less than 18-degree error, and estimates distances within a foot-and-a-half of reality — talk about audio precision!
Fueling this tech are developers at the University of Texas and Shanghai Jiao Tong University, who crafted a Spatial Audio Spectrogram Transformer and tasks for spatial question answering, which BAT subsumed. With a dataset comprised of binaural labels and large-scale building renderings, BAT proved its mettle by acing tasks like recognizing a baby’s laughter or music sources lurking behind.
From leveling up virtual reality and gaming experiences to enhancing robotic and autonomous vehicle sensory smarts, the applications are numerous.