Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Aug;620(7976):1037-1046.
doi: 10.1038/s41586-023-06443-4. Epub 2023 Aug 23.

A high-performance neuroprosthesis for speech decoding and avatar control

Affiliations

A high-performance neuroprosthesis for speech decoding and avatar control

Sean L Metzger et al. Nature. 2023 Aug.

Abstract

Speech neuroprostheses have the potential to restore communication to people living with paralysis, but naturalistic speed and expressivity are elusive1. Here we use high-density surface recordings of the speech cortex in a clinical-trial participant with severe limb and vocal paralysis to achieve high-performance real-time decoding across three complementary speech-related output modalities: text, speech audio and facial-avatar animation. We trained and evaluated deep-learning models using neural data collected as the participant attempted to silently speak sentences. For text, we demonstrate accurate and rapid large-vocabulary decoding with a median rate of 78 words per minute and median word error rate of 25%. For speech audio, we demonstrate intelligible and rapid speech synthesis and personalization to the participant's pre-injury voice. For facial-avatar animation, we demonstrate the control of virtual orofacial movements for speech and non-speech communicative gestures. The decoders reached high performance with less than two weeks of training. Our findings introduce a multimodal speech-neuroprosthetic approach that has substantial promise to restore full, embodied communication to people living with severe paralysis.

PubMed Disclaimer

Conflict of interest statement

Competing interests S.L.M., D.A.M., J.R.L. and E.F.C. are inventors on a pending provisional UCSF patent application that is relevant to the neural-decoding approaches used in this work. G.K.A. and E.F.C. are inventors on patent application PCT/US2020/028926, D.A.M. and E.F.C. are inventors on patent application PCT/US2020/043706, and E.F.C. is an inventor on patent US9905239B2, which are broadly relevant to the neural-decoding approaches in this work. M.A.B. is chief technical officer at Speech Graphics. All other authors declare no competing interests.

Figures

Extended Data Fig. 1 |
Extended Data Fig. 1 |. Relationship between PER and WER.
Relationship between phone error rate and word error rate across n = 549 points. Each point represents the phone and word error rate for all sentences used during model evaluation for all evaluation sets. The points display a linear trend, with the linear equation corresponding with an R2 of .925. Shading denotes 99% confidence interval which was calculated using bootstrapping over 2000 iterations.
Extended Data Fig. 2 |
Extended Data Fig. 2 |. Character and phone error rates for simulated text decoding with larger vocabularies.
a,b, We computed character (a) and phone (b) error rates on sentences obtained by simulating text decoding with the 1024-word-General sentence set using log-spaced vocabularies of 1,506, 2,269, 3,419, 5,152, 7,763, 11,696, 17,621, 26,549, and 39,378 words, and we compared performance to the real-time results using our 1,024 word vocabulary. Each point represents the median character or phone error rate across n = 25 real-time evaluation pseudo-blocks, and error bars represent 99% confidence intervals of the median. With our largest 39,378 word vocabulary, we found a median character error rate of 21.7% (99% CI [16.3%, 28.1%]), and median phone error rate of 20.6% (99% CI [15.9%, 26.1%]). We compared the WER, CER, and PER of the simulation with the largest vocabulary size to the real-time results, and found that there was no significant increase in any error rate (P > .01 for all comparisons. Test statistic = 48.5, 93.0, 88.0, respectively, p = .342, .342, .239, respectively, Wilcoxon signed-rank test with 3-way Holm-Bonferroni correction).
Extended Data Fig. 3 |
Extended Data Fig. 3 |. Simulated text decoding results on the 50-phrase-AAC sentence set.
a–c, We computed phone (a), word (b), and character (c) error rates on simulated text-decoding results with the real-time 50-phrase-AAC blocks used for evaluation of the synthesis models. Across n = 15 pseudo-blocks, we observed a median PER of 5.63% (99% CI [2.10, 12.0]), median WER of 4.92% (99% CI [3.18, 14.0]) and median CER of 5.91% (99% CI [2.21, 11.4]). The PER, WER, and CER were also significantly better than chance (P < .001 for all metrics, Wilcoxon signed-rank test with 3-way Holm-Bonferonni Correction for multiple comparisons). Statistics compare n = 15 total pseudo-blocks. For PER: stat = 0, P = 1.83e-4. For CER: stat = 0, P = 1.83e-4. For WER: stat = 0, P = 1.83e-4. d, Speech was decoded at high rates with a median WPM of 101 (99% CI [95.6, 103]).
Extended Data Fig. 4 |
Extended Data Fig. 4 |. Simulated text decoding results on the 529-phrase-AAC sentence set.
a–c, We computed phone (a), word (b), and character (c) error rates on simulated text-decoding results with the real-time 529-phrase-AAC blocks used for evaluation of the synthesis models. Across n = 15 pseudo-blocks, we observed a median PER of 17.3 (99% CI [12.6, 20.1]), median WER of 17.1% (99% CI [8.89, 28.9]) and median CER of 15.2% (99% CI [10.1, 22.7]). The PER, WER, and CER were also significantly better than chance (p < .001 for all metrics, two-sided Wilcoxon signed-rank test with 3-way Holm-Bonferonni Correction for multiple comparisons). Statistics compare n = 15 total pseudo-blocks. For PER: stat = 0, p = 1.83e-4. For CER: stat = 0, p = 1.83e-4. For WER: stat = 0, P = 1.83e-4. d, Speech was decoded at high rates with a median WPM of 89.9 (99% CI [83.6, 93.3]).
Extended Data Fig. 5 |
Extended Data Fig. 5 |. Mel-cepstral distortions (MCDs) using a personalized voice tailored to the participant.
We calculate the Mel-cepstral distortion (MCDs) between decoded speech with the participant’s personalized voice and voice-converted reference waveforms for the 50-phrase-AAC, 529-phrase-AAC, and 1024-word-General set. Lower MCD indicates better performance. We achieved mean MCDs of 3.87 (99% CI [3.83, 4.45]), 5.12 (99% CI [4.41, 5.35]), and 5.57 (99% CI [5.17, 5.90]) dB for the 50-phrase-AAC (N = 15 pseudo-blocks), 529-phrase-AAC (N = 15 pseudoblocks), and 1024-word-General sets (N = 20 pseudo-blocks) Chance MCDs were computed by shuffling electrode indices in the test data with the same synthesis pipeline and computed on the 50-phrase-AAC evaluation set. The MCDs of all sets are significantly lower than the chance. 529-phrase-AAC vs. 1024-word-General ∗∗∗ = P < 0.001, otherwise all ∗∗∗∗ = P < 0.0001. Two-sided Wilcoxon rank-sum tests were used for comparisons within-dataset and Mann-Whitney U-test outside of dataset with 9-way Holm-Bonferroni correct.
Extended Data Fig. 6 |
Extended Data Fig. 6 |. Comparison of perceptual word error rate and mel-cepstral distortion.
Scatter plot illustrating relationship between perceptual word error rate (WER) and mel-cepstral distortion (MCD) for the 50-phraseAAC sentence set, the 529-phrase-AAC sentence set, the 1024-word-General sentence set. Each data point represents the mean accuracy from a single pseudo-block. A dashed black line indicates the best linear fit to the pseudo-blocks, providing a visual representation of the overall trend. Consistent with expectation, this plot suggests a positive correlation between WER and MCD for our speech synthesizer.
Extended Data Fig. 7 |
Extended Data Fig. 7 |. Electrode contributions to decoding performance.
a, MRI reconstruction of the participant’s brain overlaid with the locations of implanted electrodes. Cortical regions and electrodes are colored according to anatomical region (PoCG: postcentral gyrus, PrCG: precentral gyrus, SMC: sensorimotor cortex). b–d, Electrode contributions to text decoding (b), speech synthesis (c), and avatar direct decoding (d). Black lines denote the central sulcus (CS) and sylvian fissure (SF). e–g, Each plot shows each electrode’s contributions to two modalities as well as the Pearson correlation across electrodes and associated p-value.
Extended Data Fig. 8 |
Extended Data Fig. 8 |. Effect of anatomical regions on decoding performance.
a–c, Effect of excluding each region during training and testing on text-decoding word error rate (a), speech-synthesis mel-cepstral distortion (b), and avatar-direct-decoding correlation (c; average DTW correlation of jaw, lip, and mouth-width landmarks between the avatar and healthy speakers), computed using neural data as the participant attempted to silently say sentences from the 1024-word-General set. Significance markers indicate comparisons against the None condition, which uses all electrodes. *P < 0.01, **P < 0.005, ***P < 0.001, ****P < 0.0001, two-sided Wilcoxon signed-rank test with 15-way Holm-Bonferroni correction (full comparisons are given in Table S5). Distributions are over 25 pseudo-blocks for text decoding, 20 pseudo-blocks for speech synthesis, and 152 pseudo-blocks (19 pseudo-blocks each for 8 healthy speakers) for avatar direct decoding.
Fig. 1 |
Fig. 1 |. Multimodal speech decoding in a participant with vocal-tract paralysis.
a, Overview of the speech-decoding pipeline. A brainstem-stroke survivor with anarthria was implanted with a 253-channel high-density ECoG array 18 years after injury. Neural activity was processed and used to train deep-learning models to predict phone probabilities, speech-sound features and articulatory gestures. These outputs were used to decode text, synthesize audible speech and animate a virtual avatar, respectively. b, A sagittal magnetic resonance imaging scan showing brainstem atrophy (in the bilateral pons; red arrow) resulting from stroke. c, Magnetic resonance imaging reconstruction of the participant’s brain overlaid with the locations of implanted electrodes. The ECoG array was implanted over the participant’s lateral cortex, centred on the central sulcus. d, Top: simple articulatory movements attempted by the participant. Middle: Electrode-activation maps demonstrating robust electrode tunings across articulators during attempted movements. Only the electrodes with the strongest responses (top 20%) are shown for each movement type. Colour indicates the magnitude of the average evoked HGA response with each type of movement. Bottom: z-scored trial-averaged evoked HGA responses with each movement type for each of the outlined electrodes in the electrode-activation maps. In each plot, each response trace shows mean ± standard error across trials and is aligned to the peak-activation time (n = 130 trials for jaw open, n = 260 trials each for lips forwards or back and tongue up or down).
Fig. 2 |
Fig. 2 |. High-performance text decoding from neural activity.
a, During attempts by the participant to silently speak, a bidirectional RNN decodes neural features into a time series of phone and silence (denoted as Ø) probabilities. From these probabilities, a CTC beam search computes the most likely sequence of phones that can be translated into words in the vocabulary. An n-gram language model rescores sentences created from these sequences to yield the most likely sentence. b, Median PERs, calculated using shuffled neural data (Chance), neural decoding without applying vocabulary constraints or language modelling (Neural decoding alone) and the full real-time system (Real-time results) across n = 25 pseudo-blocks. c,d, Word (c) and character (d) error rates for chance and real-time results. In bd, ****P < 0.0001, two-sided Wilcoxon signed-rank test with five-way Holm–Bonferroni correction for multiple comparisons; P values and statistics in Extended Data Table 1. e, Decoded WPM. Dashed line denotes previous state-of-the-art speech BCI decoding rate in a person with paralysis. f, Offline evaluation of error rates as a function of training-data quantity. g, Offline evaluation of WER as a function of the number of words used to apply vocabulary constraints and train the language model. Error bars in f,g represent 99% CIs of the median, calculated using 1,000 bootstraps across n = 125 pseudo-blocks (f) and n = 25 pseudo-blocks (g) at each point. h, Decoder stability as assessed using real-time classification accuracy during attempts to silently say 26 NATO code words across days and weeks. The vertical line represents when the classifier was no longer retrained before each session. In bg, results were computed using the real-time evaluation trials with the 1024-word-General sentence set. Box plots in all figures depict median (horizontal line inside box), 25th and 75th percentiles (box) ± 1.5 times the interquartile range (whiskers) and outliers (diamonds).
Fig. 3 |
Fig. 3 |. Intelligible speech synthesis from neural activity.
a, Schematic diagram of the speech-synthesis decoding algorithm. During attempts by the participant to silently speak, a bidirectional RNN decodes neural features into a time series of discrete speech units. The RNN was trained using reference speech units computed by applying a large pretrained acoustic model (HuBERT) on basis waveforms. Predicted speech units are then transformed into the mel spectrogram and vocoded into audible speech. The decoded waveform is played back to the participant in real time after a brief delay. Offline, the decoded speech was transformed to be in the participant’s personalized synthetic voice using a voice-conversion model. b, Top two rows: three example decoded spectrograms and accompanying perceptual transcriptions (top) and waveforms (bottom) from the 529-phrase-AAC sentence set. Bottom two rows: the corresponding reference spectrograms, transcriptions and waveforms representing the decoding targets. c, MCDs for the decoded waveforms during real-time evaluation with the three sentence sets and from chance waveforms computed offline. Lower MCD indicates better performance. Chance waveforms were computed by shuffling electrode indices in the test data for the 50-phrase-AAC set with the same synthesis pipeline. d, Perceptual WERs from untrained human evaluators during a transcription task. e, Perceptual CERs from the same human-evaluation results as d. In ce, ****P < 0.0001, Mann–Whitney U-test with 19-way Holm–Bonferroni correction for multiple comparisons; all non-adjacent comparisons were also significant (P < 0.0001; not depicted); n = 15 pseudo-blocks for the AAC sets, n = 20 pseudo-blocks for the 1024-word-General set. P values and statistics in Extended Data Table 2. In be, all decoded waveforms, spectrograms and quantitative results use the non-personalized voice (see Extended Data Fig. 5 and Supplementary Table 1 for results with the personalized voice). A.u., arbitrary units.
Fig. 4 |
Fig. 4 |. Direct decoding of orofacial articulatory gestures from neural activity to drive an avatar.
a, Schematic diagram of the avatar-decoding algorithm. Offline, a bidirectional RNN decodes neural activity recorded during attempts to silently speak into discretized articulatory gestures (quantized by a VQ-VAE). A convolutional neural network dequantizer (VQ-VAE decoder) is then applied to generate the final predicted gestures, which are then passed through a pretrained gesture-animation model to animate the avatar in a virtual environment. b, Binary perceptual accuracies from human evaluators on avatar animations generated from neural activity, n = 2,000 bootstrapped points. c, Correlations after applying dynamics time warping (DTW) for jaw, lip and mouth-width movements between decoded avatar renderings and videos of real human speakers on the 1024-word-General sentence set across all pseudo-blocks for each comparison (n = 152 for avatar–person comparison, n = 532 for person–person comparisons; ****P < 0.0001, Mann–Whitney U-test with nine-way Holm–Bonferroni correction; P values and U-statistics in Supplementary Table 3). A facial-landmark detector (dlib) was used to measure orofacial movements from the videos. d, Top: snapshots of avatar animations of six non-speech articulatory movements in the articulatory-movement task. Bottom: confusion matrix depicting classification accuracy across the movements. The classifier was trained to predict which movement the participant was attempting from her neural activity, and the prediction was used to animate the avatar. e, Top: snapshots of avatar animations of three non-speech emotional expressions in the emotional-expression task. Bottom: confusion matrix depicting classification accuracy across three intensity levels (high, medium and low) of the three expressions, ordered using a hierarchical agglomerative clustering on the confusion values. The classifier was trained to predict which expression the participant was attempting from her neural activity, and the prediction was used to animate the avatar.
Fig. 5 |
Fig. 5 |. Articulatory encodings driving speech decoding.
a, Mid-sagittal schematic of the vocal tract with phone POA features labelled. b, Phone-encoding vectors for each electrode computed by a temporal receptive-field model on neural activity recorded during attempts to silently say sentences from the 1024-word-General set, organized by unsupervised hierarchical clustering. a.u., arbitrary units. c, z-scored POA encodings for each electrode, computed by averaging across positive phone encodings within each POA category. z values are clipped at 0. d,e, Projection of consonant (d) and vowel (e) phone encodings into a 2D space using multidimensional scaling (MDS). f, Bottom right: visualization of the locations of electrodes with the greatest encoding weights for labial, front-tongue and vocalic phones on the ECoG array. The electrodes that most strongly encoded finger flexion during the NATO-motor task are also included. Only the top 30% of electrodes within each condition are shown, and the strongest tuning was used for categorization if an electrode was in the top 30% for multiple conditions. Black lines denote the central sulcus (CS) and Sylvian fissure (SF). Top and left: the spatial electrode distributions for each condition along the anterior–posterior and ventral–dorsal axes, respectively. gi, Electrode-tuning comparisons between front-tongue phone encoding and tongue-raising attempts (g; r = 0.84, P < 0.0001, ordinary least-squares regression), labial phone encoding and lip-puckering attempts (h; r = 0.89, P < 0.0001, ordinary least-squares regression) and tongue-raising and lip-rounding attempts (i). Non-phonetic tunings were computed from neural activations during the articulatory-movement task. Each plot depicts the same electrodes encoding front-tongue and labial phones (from f) as blue and orange dots, respectively; all other electrodes are shown as grey dots. Max., maximum; min., minimum.

Comment in

  • Restoring speech.
    Whalley K. Whalley K. Nat Rev Neurosci. 2023 Nov;24(11):653. doi: 10.1038/s41583-023-00746-1. Nat Rev Neurosci. 2023. PMID: 37740095 No abstract available.

Similar articles

Cited by

References

    1. Moses DA et al. Neuroprosthesis for decoding speech in a paralyzed person with anarthria. N. Engl. J. Med 385, 217–227 (2021). - PMC - PubMed
    1. Peters B et al. Brain-computer interface users speak up: The Virtual Users’ Forum at the 2013 International Brain-Computer Interface Meeting. Arch. Phys. Med. Rehabil 96, S33–S37 (2015). - PMC - PubMed
    1. Metzger SL et al. Generalizable spelling using a speech neuroprosthesis in an individual with severe limb and vocal paralysis. Nat. Commun 13, 6510 (2022). - PMC - PubMed
    1. Beukelman DR et al. Augmentative and Alternative Communication (Paul H. Brookes, 1998).
    1. Graves A, Fernández S, Gomez F & Schmidhuber J Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proc. 23rd International Conference on Machine learning - ICML ’06 (eds Cohen W & Moore A) 369–376 (ACM Press, 2006); 10.1145/1143844.1143891. - DOI
-