7,479 research outputs found
Visual-only recognition of normal, whispered and silent speech
Silent speech interfaces have been recently proposed as a way to enable communication when the acoustic signal is not available. This introduces the need to build visual speech recognition systems for silent and whispered speech. However, almost all the recently proposed systems have been trained on vocalised data only. This is in contrast with evidence in the literature which suggests that lip movements change depending on the speech mode. In this work, we introduce a new audiovisual database which is publicly available and contains normal, whispered and silent speech. To the best of our knowledge, this is the first study which investigates the differences between the three speech modes using the visual modality only. We show that an absolute decrease in classification rate of up to 3.7% is observed when training and testing on normal and whispered, respectively, and vice versa. An even higher decrease of up to 8.5% is reported when the models are tested on silent speech. This reveals that there are indeed visual differences between the 3 speech modes and the common assumption that vocalized training data can be used directly to train a silent speech recognition system may not be true
EMG-to-Speech: Direct Generation of Speech from Facial Electromyographic Signals
The general objective of this work is the design, implementation, improvement and evaluation of a system that uses surface electromyographic (EMG) signals and directly synthesizes an audible speech output: EMG-to-speech
A silent speech system based on permanent magnet articulography and direct synthesis
In this paper we present a silent speech interface (SSI) system aimed at restoring speech communication for individuals who have lost their voice due to laryngectomy or diseases affecting the vocal folds. In the proposed system, articulatory data captured from the lips and tongue using permanent magnet articulography (PMA) are converted into audible speech using a speaker-dependent transformation learned from simultaneous recordings of PMA and audio signals acquired before laryngectomy. The transformation is represented using a mixture of factor analysers, which is a generative model that allows us to efficiently model non-linear behaviour and perform dimensionality reduction at the same time. The learned transformation is then deployed during normal usage of the SSI to restore the acoustic speech signal associated with the captured PMA data. The proposed system is evaluated using objective quality measures and listening tests on two databases containing PMA and audio recordings for normal speakers. Results show that it is possible to reconstruct speech from articulator movements captured by an unobtrusive technique without an intermediate recognition step. The SSI is capable of producing speech of sufficient intelligibility and naturalness that the speaker is clearly identifiable, but problems remain in scaling up the process to function consistently for phonetically rich vocabularies
Silent versus modal multi-speaker speech recognition from ultrasound and video
We investigate multi-speaker speech recognition from ultrasound images of the
tongue and video images of the lips. We train our systems on imaging data from
modal speech, and evaluate on matched test sets of two speaking modes: silent
and modal speech. We observe that silent speech recognition from imaging data
underperforms compared to modal speech recognition, likely due to a
speaking-mode mismatch between training and testing. We improve silent speech
recognition performance using techniques that address the domain mismatch, such
as fMLLR and unsupervised model adaptation. We also analyse the properties of
silent and modal speech in terms of utterance duration and the size of the
articulatory space. To estimate the articulatory space, we compute the convex
hull of tongue splines, extracted from ultrasound tongue images. Overall, we
observe that the duration of silent speech is longer than that of modal speech,
and that silent speech covers a smaller articulatory space than modal speech.
Although these two properties are statistically significant across speaking
modes, they do not directly correlate with word error rates from speech
recognition.Comment: 5 pages, 5 figures, Submitted to Interspeech 202
Advancing Electromyographic Continuous Speech Recognition: Signal Preprocessing and Modeling
Speech is the natural medium of human communication, but audible speech can be overheard by bystanders and excludes speech-disabled people. This work presents a speech recognizer based on surface electromyography, where electric potentials of the facial muscles are captured by surface electrodes, allowing speech to be processed nonacoustically. A system which was state-of-the-art at the beginning of this book is substantially improved in terms of accuracy, flexibility, and robustness
Advancing Electromyographic Continuous Speech Recognition: Signal Preprocessing and Modeling
Speech is the natural medium of human communication, but audible speech can be overheard by bystanders and excludes speech-disabled people. This work presents a speech recognizer based on surface electromyography, where electric potentials of the facial muscles are captured by surface electrodes, allowing speech to be processed nonacoustically. A system which was state-of-the-art at the beginning of this book is substantially improved in terms of accuracy, flexibility, and robustness
Improving the Gap in Visual Speech Recognition Between Normal and Silent Speech Based on Metric Learning
This paper presents a novel metric learning approach to address the
performance gap between normal and silent speech in visual speech recognition
(VSR). The difference in lip movements between the two poses a challenge for
existing VSR models, which exhibit degraded accuracy when applied to silent
speech. To solve this issue and tackle the scarcity of training data for silent
speech, we propose to leverage the shared literal content between normal and
silent speech and present a metric learning approach based on visemes.
Specifically, we aim to map the input of two speech types close to each other
in a latent space if they have similar viseme representations. By minimizing
the Kullback-Leibler divergence of the predicted viseme probability
distributions between and within the two speech types, our model effectively
learns and predicts viseme identities. Our evaluation demonstrates that our
method improves the accuracy of silent VSR, even when limited training data is
available.Comment: Accepted by INTERSPEECH 202
- …