Skip to main content
Neural engineering

Neural engineering

Thinking aloud: translating thoughts directly into speech

30 Jan 2019 Tami Freeman
Schematic of speech reconstruction

The ability to translate a person’s thoughts directly into speech could enable new ways for computers to communicate directly with the brain. A neuroprosthetic device that can reconstruct speech from neural activity could help people who cannot speak — such as paralyzed patients or those recovering from stroke — regain their ability to communicate with the outside world.

Previous studies demonstrated the feasibility of reconstructing speech from brain signals. The low quality of the resulting speech, however, is currently a major obstacle in the development of a speech neuroprosthetic. To address this limitation, a research team headed up at Columbia University has combined recent advances in deep learning and speech synthesis technologies to create a system that can translate thoughts into intelligible, recognizable speech (Sci. Reports 10.1038/s41598-018-37359-z).

“Our voices help connect us to our friends, family and the world around us, which is why losing the power of one’s voice due to injury or disease is so devastating,” says senior author Nima Mesgarani, from Columbia University’s Zuckerman Institute, who did the work with Hassan Akbari and colleagues. “With today’s study, we have a potential way to restore that power. We’ve shown that, with the right technology, these people’s thoughts could be decoded and understood by any listener.”

Model comparisons

Speaking — or even imagining speech — generates specific patterns of activity within the brain. Distinct patterns of signals also emerge when listening (or imagining listening) to someone speak. Mesgarani and colleagues compared the ability of various techniques to decode these patterns and translate them into speech.

To reconstruct the acoustic stimulus from recorded neural signals, the researchers employed linear regression (LR) and nonlinear deep neural network (DNN) regression models. They also examined two acoustic representations: auditory spectrograms, as used in previous studies; and the vocoder — a computer algorithm that can synthesize speech after being trained on recordings of people talking.

Teaming up with neurosurgeon Ashesh Dinesh Mehta, the researchers used electrocorticography to measure neural activity patterns in five epilepsy patients already undergoing brain surgery while they listened to continuous stories spoken by four actors. The evoked neural activity recorded from each patient’s auditory cortex was then used to train the LR and the DNN models.

Next, the patients listened to eight repeated sentences, enabling the team to objectively evaluate the quality of the models. Comparing the reconstructed auditory spectrograms from each combination of regression model and acoustic representation showed that the overall frequency profile of the speech was better preserved by the DNN than the LR model. The frequency profiles of the voiced speech showed that the harmonic structure was only recovered using the DNN–vocoder combination.

Counting clearly

The patients next listened to ten digits (zero to nine) spoken by two male and two female speakers. The researchers used each model to reconstruct the 40 sounds. Eleven people with normal hearing then listened to the reconstructed digits in a random order and rated the quality and intelligibility of each.

Speech reconstructed using the DNN–vocoder

The DNN–vocoder combination exhibited the best intelligibility, with 75% accuracy. This represents a 67% increase on the performance of the baseline method using LR to reconstruct the auditory spectrogram. In all cases, the DNN models performed significantly better than the LR models. The listeners also rated the speech quality significantly higher for the DNN–vocoder system than for the other three models, implying that it sounded closest to natural speech.

“We found that people could understand and repeat the sounds about 75% of the time, which is well above and beyond any previous attempts,” says Mesgarani. “The sensitive vocoder and powerful neural networks represented the sounds the patients had originally listened to with surprising accuracy.”

Next, Mesgarani and his team plan to test more complicated words and sentences. They also want to run the same tests on brain signals emitted when a person speaks or imagines speaking. Ultimately, they hope their system could be part of an implant, similar to those worn by some epilepsy patients, that translates the wearer’s thoughts directly into words.

“In this scenario, if the wearer thinks ‘I need a glass of water’, our system could take the brain signals generated by that thought and turn them into synthesized, verbal speech,” explains Mesgarani. “This would be a game changer. It would give anyone who has lost their ability to speak, whether through injury or disease, the renewed chance to connect to the world around them.”

Copyright © 2024 by IOP Publishing Ltd and individual contributors