Menu
See all NewsEngineering News
Research

Improving Acoustic and Bone Conduction Speech Enhancement

Audio super-resolution model leverages neural networks to process speech using a head-worn accelerometer

The Problem

Bone conduction acoustic modalities are robust against ambient noise but attenuate vocal frequency.

Our Idea

The research team developed an audio super-resolution model that leverages neural networks to process speech using only a single, head-worn accelerometer.

Why It Matters

The technology could be adapted for applications including improving privacy and accessibility for audio-based smart devices, speech disfluency and training, and creating silent speech interfaces.

Our Team

Professor Stephen Xia, computer engineering PhD students Yueyuan Sui and Junxi Xia

Frustrated by background noise making it difficult to hear and understand calls, Northwestern Engineering’s Stephen Xia aimed to develop a system that transmits speech clearly, even in noisy public spaces.

Unlike traditional, over-the-air microphones commonly found in earbuds or headphones, which convert variations in air pressure into electrical signals, bone conduction microphones (BCMs) pick up vocal cord vibrations through the skin and skull. Used in devices such as osseointegrated hearing aids and sports headsets, BCMs are not sensitive to changes in air pressure, making them naturally robust against ambient noise.

Bone-conduction acoustic modalities, however, have a significant drawback. They attenuate vocal frequency, reducing sound quality and degrading intelligibility, particularly in the higher range.

Xia and his team, which included first-year computer engineering PhD students Yueyuan Sui and Junxi Xia, set out to reconstruct the missing and attenuated frequencies — a process called super resolution, or bandwidth expansion — to provide high quality and noise-free speech for real-time mobile, wearable, and ‘earable’ (audio plus sensor application) systems.

Stephen Xia“Existing super resolution methods can provide high quality speech but are computationally expensive and not practical for our personal mobile devices, such as smartphones, earbuds, or other wearables, while methods that have a small footprint provide significantly lower fidelity speech,” said Xia, assistant professor of electrical and computer engineering and (by courtesy) computer science at the McCormick School of Engineering. “Our goal was to bridge the gap between speed, efficiency, and performance.”

The team developed TRAMBA, a hybrid transformer and Mamba deep learning architecture for acoustic and bone conduction speech enhancement. Their method achieved higher quality speech with a memory footprint of only ~5MB compared to ~500MB for other state-of-the-art models. TRAMBA can process half a second of audio on a smartphone in ~20 milliseconds.

The super-resolution model also improves word error rate — the ratio of incorrect words perceived to the total number of words spoken — by up to 75 percent in noisy environments compared to traditional noise-suppression approaches.

Moreover, the team demonstrated that the lack of high-frequency components in bone conduction-based sensing modalities significantly reduces both the sampling rate of the sensor and the transmission rate, which can improve the battery life of wearables by up to 160 percent.

TRAMBA wearable prototype
The wearable prototype features an accelerometer and bone conduction microphone to collect inputs and a holder for an over-the-air microphone to capture high resolution ground truth speech.Photos provided by Stephen Xia
TRAMBA wearable prototype deployed in various settings.
The research team incorporated TRAMBA into a vibration-based mobile and glasses wearable system to both collect data and deploy in various environments.Photos provided by Stephen Xia
TRAMBA wearable prototype deployed in various settings.
The research team incorporated TRAMBA into a vibration-based mobile and glasses wearable system to both collect data and deploy in various environments.Photos provided by Stephen Xia

Their work is also the first to sense intelligible speech using only a single, head-worn accelerometer. The team incorporated TRAMBA into a vibration-based mobile and wearable system that transmits audio via Bluetooth Low Energy to a smartphone or computer. For fine tuning of the pre-trained model, users collected up to 15 minutes of their own voice. The wearable pairs input from the over-the-air microphone with vibration-based speech gathered from a BCM.

Columbia University researchers Xiaofan Jiang and Minghui Zhao also collaborated on the speech super resolution and enhancement project, which will be presented at the UbiComp/International Symposium on Wearable Computers conference in 2025. The paper was published in the Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies on November 21.

“This technology could be adapted for other important applications in the future, such as improving privacy and accessibility for audio-based smart devices, speech disfluency and training, and creating silent speech interfaces that may not require a person to explicitly voice words to interface with devices,” Xia said.

The research team is also investigating the application of their super resolution techniques to mixed reality and augmented reality domains.

Xia, who directs the Intelligent Mobile and Embedded Computing Lab, joined Northwestern last fall. Previously, he was a postdoctoral scholar in the Department of Electrical Engineering and Computer Sciences at University of California, Berkeley, advised by TRAMBA collaborator Xiaofan Jiang and Prabal Dutta. Xia earned a PhD in electrical engineering from Columbia University and a bachelor’s degree in electrical engineering from Rice University.