100 research outputs found
Ultrasound-Based Silent Speech Interface Built on a Continuous Vocoder
Recently it was shown that within the Silent Speech Interface (SSI) field,
the prediction of F0 is possible from Ultrasound Tongue Images (UTI) as the
articulatory input, using Deep Neural Networks for articulatory-to-acoustic
mapping. Moreover, text-to-speech synthesizers were shown to produce higher
quality speech when using a continuous pitch estimate, which takes non-zero
pitch values even when voicing is not present. Therefore, in this paper on
UTI-based SSI, we use a simple continuous F0 tracker which does not apply a
strict voiced / unvoiced decision. Continuous vocoder parameters (ContF0,
Maximum Voiced Frequency and Mel-Generalized Cepstrum) are predicted using a
convolutional neural network, with UTI as input. The results demonstrate that
during the articulatory-to-acoustic mapping experiments, the continuous F0 is
predicted with lower error, and the continuous vocoder produces slightly more
natural synthesized speech than the baseline vocoder using standard
discontinuous F0.Comment: 5 pages, 3 figures, accepted for publication at Interspeech 201
EMG-to-Speech: Direct Generation of Speech from Facial Electromyographic Signals
The general objective of this work is the design, implementation, improvement and evaluation of a system that uses surface electromyographic (EMG) signals and directly synthesizes an audible speech output: EMG-to-speech
Reconstructing intelligible audio speech from visual speech features
This work describes an investigation into the feasibility of producing intelligible audio speech from only visual speech fea- tures. The proposed method aims to estimate a spectral enve- lope from visual features which is then combined with an arti- ficial excitation signal and used within a model of speech pro- duction to reconstruct an audio signal. Different combinations of audio and visual features are considered, along with both a statistical method of estimation and a deep neural network. The intelligibility of the reconstructed audio speech is measured by human listeners, and then compared to the intelligibility of the video signal only and when combined with the reconstructed audio
A Silent-Speech Interface using Electro-Optical Stomatography
Sprachtechnologie ist eine große und wachsende Industrie, die das Leben von technologieinteressierten Nutzern auf zahlreichen Wegen bereichert. Viele potenzielle Nutzer werden jedoch ausgeschlossen: Nämlich alle Sprecher, die nur schwer oder sogar gar nicht Sprache produzieren können.
Silent-Speech Interfaces bieten einen Weg, mit Maschinen durch ein bequemes sprachgesteuertes Interface zu kommunizieren ohne dafür akustische Sprache zu benötigen. Sie können außerdem prinzipiell eine Ersatzstimme stellen, indem sie die intendierten Äußerungen, die der Nutzer nur still artikuliert, künstlich synthetisieren. Diese Dissertation stellt ein neues Silent-Speech Interface vor, das auf einem neu entwickelten Messsystem namens Elektro-Optischer Stomatografie und einem neuartigen parametrischen Vokaltraktmodell basiert, das die Echtzeitsynthese von Sprache basierend auf den gemessenen Daten ermöglicht. Mit der Hardware wurden Studien zur Einzelworterkennung durchgeführt, die den Stand der Technik in der intra- und inter-individuellen Genauigkeit erreichten und übertrafen. Darüber hinaus wurde eine Studie abgeschlossen, in der die Hardware zur Steuerung des Vokaltraktmodells in einer direkten Artikulation-zu-Sprache-Synthese verwendet wurde. Während die Verständlichkeit der Synthese von Vokalen sehr hoch eingeschätzt wurde, ist die Verständlichkeit von Konsonanten und kontinuierlicher Sprache sehr schlecht. Vielversprechende Möglichkeiten zur Verbesserung des Systems werden im Ausblick diskutiert.:Statement of authorship iii
Abstract v
List of Figures vii
List of Tables xi
Acronyms xiii
1. Introduction 1
1.1. The concept of a Silent-Speech Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2. Structure of this work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2. Fundamentals of phonetics 7
2.1. Components of the human speech production system . . . . . . . . . . . . . . . . . . . 7
2.2. Vowel sounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3. Consonantal sounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4. Acoustic properties of speech sounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5. Coarticulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6. Phonotactics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.7. Summary and implications for the design of a Silent-Speech Interface (SSI) . . . . . . . 21
3. Articulatory data acquisition techniques in Silent-Speech Interfaces 25
3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2. Scope of the literature review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3. Video Recordings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4. Ultrasonography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.5. Electromyography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.6. Permanent-Magnetic Articulography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.7. Electromagnetic Articulography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.8. Radio waves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.9. Palatography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.10.Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4. Electro-Optical Stomatography 55
4.1. Contact sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2. Optical distance sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3. Lip sensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.4. Sensor Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.5. Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.6. Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5. Articulation-to-Text 99
5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.2. Command word recognition pilot study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.3. Command word recognition small-scale study . . . . . . . . . . . . . . . . . . . . . . . . 102
6. Articulation-to-Speech 109
6.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.2. Articulatory synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.3. The six point vocal tract model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.4. Objective evaluation of the vocal tract model . . . . . . . . . . . . . . . . . . . . . . . . 116
6.5. Perceptual evaluation of the vocal tract model . . . . . . . . . . . . . . . . . . . . . . . . 120
6.6. Direct synthesis using EOS to control the vocal tract model . . . . . . . . . . . . . . . . 125
6.7. Pitch and voicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
7. Summary and outlook 145
7.1. Summary of the contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.2. Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
A. Overview of the International Phonetic Alphabet 151
B. Mathematical proofs and derivations 153
B.1. Combinatoric calculations illustrating the reduction of possible syllables using phonotactics
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
B.2. Signal Averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
B.3. Effect of the contact sensor area on the conductance . . . . . . . . . . . . . . . . . . . . 155
B.4. Calculation of the forward current for the OP280V diode . . . . . . . . . . . . . . . . . . 155
C. Schematics and layouts 157
C.1. Schematics of the control unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
C.2. Layout of the control unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
C.3. Bill of materials of the control unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
C.4. Schematics of the sensor unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
C.5. Layout of the sensor unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
C.6. Bill of materials of the sensor unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
D. Sensor unit assembly 169
E. Firmware flow and data protocol 177
F. Palate file format 181
G. Supplemental material regarding the vocal tract model 183
H. Articulation-to-Speech: Optimal hyperparameters 189
Bibliography 191Speech technology is a major and growing industry that enriches the lives of technologically-minded people in a number of ways. Many potential users are, however, excluded: Namely, all speakers who cannot easily or even at all produce speech. Silent-Speech Interfaces offer a way to communicate with a machine by a convenient speech recognition interface without the need for acoustic speech. They also can potentially provide a full replacement voice by synthesizing the intended utterances that are only silently articulated by the user. To that end, the speech movements need to be captured and mapped to either text or acoustic speech. This dissertation proposes a new Silent-Speech Interface based on a newly developed measurement technology called Electro-Optical Stomatography and a novel parametric vocal tract model to facilitate real-time speech synthesis based on the measured data. The hardware was used to conduct command word recognition studies reaching state-of-the-art intra- and inter-individual performance. Furthermore, a study on using the hardware to control the vocal tract model in a direct articulation-to-speech synthesis loop was also completed. While the intelligibility of synthesized vowels was high, the intelligibility of consonants and connected speech was quite poor. Promising ways to improve the system are discussed in the outlook.:Statement of authorship iii
Abstract v
List of Figures vii
List of Tables xi
Acronyms xiii
1. Introduction 1
1.1. The concept of a Silent-Speech Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2. Structure of this work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2. Fundamentals of phonetics 7
2.1. Components of the human speech production system . . . . . . . . . . . . . . . . . . . 7
2.2. Vowel sounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3. Consonantal sounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4. Acoustic properties of speech sounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5. Coarticulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6. Phonotactics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.7. Summary and implications for the design of a Silent-Speech Interface (SSI) . . . . . . . 21
3. Articulatory data acquisition techniques in Silent-Speech Interfaces 25
3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2. Scope of the literature review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3. Video Recordings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4. Ultrasonography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.5. Electromyography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.6. Permanent-Magnetic Articulography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.7. Electromagnetic Articulography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.8. Radio waves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.9. Palatography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.10.Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4. Electro-Optical Stomatography 55
4.1. Contact sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2. Optical distance sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3. Lip sensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.4. Sensor Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.5. Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.6. Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5. Articulation-to-Text 99
5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.2. Command word recognition pilot study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.3. Command word recognition small-scale study . . . . . . . . . . . . . . . . . . . . . . . . 102
6. Articulation-to-Speech 109
6.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.2. Articulatory synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.3. The six point vocal tract model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.4. Objective evaluation of the vocal tract model . . . . . . . . . . . . . . . . . . . . . . . . 116
6.5. Perceptual evaluation of the vocal tract model . . . . . . . . . . . . . . . . . . . . . . . . 120
6.6. Direct synthesis using EOS to control the vocal tract model . . . . . . . . . . . . . . . . 125
6.7. Pitch and voicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
7. Summary and outlook 145
7.1. Summary of the contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.2. Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
A. Overview of the International Phonetic Alphabet 151
B. Mathematical proofs and derivations 153
B.1. Combinatoric calculations illustrating the reduction of possible syllables using phonotactics
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
B.2. Signal Averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
B.3. Effect of the contact sensor area on the conductance . . . . . . . . . . . . . . . . . . . . 155
B.4. Calculation of the forward current for the OP280V diode . . . . . . . . . . . . . . . . . . 155
C. Schematics and layouts 157
C.1. Schematics of the control unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
C.2. Layout of the control unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
C.3. Bill of materials of the control unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
C.4. Schematics of the sensor unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
C.5. Layout of the sensor unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
C.6. Bill of materials of the sensor unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
D. Sensor unit assembly 169
E. Firmware flow and data protocol 177
F. Palate file format 181
G. Supplemental material regarding the vocal tract model 183
H. Articulation-to-Speech: Optimal hyperparameters 189
Bibliography 19
SYNTHESIZING DYSARTHRIC SPEECH USING MULTI-SPEAKER TTS FOR DSYARTHRIC SPEECH RECOGNITION
Dysarthria is a motor speech disorder often characterized by reduced speech intelligibility through slow, uncoordinated control of speech production muscles. Automatic Speech recognition (ASR) systems may help dysarthric talkers communicate more effectively. However, robust dysarthria-specific ASR requires a significant amount of training speech is required, which is not readily available for dysarthric talkers.
In this dissertation, we investigate dysarthric speech augmentation and synthesis methods. To better understand differences in prosodic and acoustic characteristics of dysarthric spontaneous speech at varying severity levels, a comparative study between typical and dysarthric speech was conducted. These characteristics are important components for dysarthric speech modeling, synthesis, and augmentation. For augmentation, prosodic transformation and time-feature masking have been proposed. For dysarthric speech synthesis, this dissertation has introduced a modified neural multi-talker TTS by adding a dysarthria severity level coefficient and a pause insertion model to synthesize dysarthric speech for varying severity levels. In addition, we have extended this work by using a label propagation technique to create more meaningful control variables such as a continuous Respiration, Laryngeal and Tongue (RLT) parameter, even for datasets that only provide discrete dysarthria severity level information. This approach increases the controllability of the system, so we are able to generate more dysarthric speech with a broader range.
To evaluate their effectiveness for synthesis of training data, dysarthria-specific speech recognition was used. Results show that a DNN-HMM model trained on additional synthetic dysarthric speech achieves WER improvement of 12.2% compared to the baseline, and that the addition of the severity level and pause insertion controls decrease WER by 6.5%, showing the effectiveness of adding these parameters. Overall results on the TORGO database demonstrate that using dysarthric synthetic speech to increase the amount of dysarthric-patterned speech for training has a significant impact on the dysarthric ASR systems
Ultrasound-Based Articulatory-to-Acoustic Mapping with WaveGlow Speech Synthesis
For articulatory-to-acoustic mapping using deep neural networks, typically
spectral and excitation parameters of vocoders have been used as the training
targets. However, vocoding often results in buzzy and muffled final speech
quality. Therefore, in this paper on ultrasound-based articulatory-to-acoustic
conversion, we use a flow-based neural vocoder (WaveGlow) pre-trained on a
large amount of English and Hungarian speech data. The inputs of the
convolutional neural network are ultrasound tongue images. The training target
is the 80-dimensional mel-spectrogram, which results in a finer detailed
spectral representation than the previously used 25-dimensional Mel-Generalized
Cepstrum. From the output of the ultrasound-to-mel-spectrogram prediction,
WaveGlow inference results in synthesized speech. We compare the proposed
WaveGlow-based system with a continuous vocoder which does not use strict
voiced/unvoiced decision when predicting F0. The results demonstrate that
during the articulatory-to-acoustic mapping experiments, the WaveGlow neural
vocoder produces significantly more natural synthesized speech than the
baseline system. Besides, the advantage of WaveGlow is that F0 is included in
the mel-spectrogram representation, and it is not necessary to predict the
excitation separately.Comment: 5 pages, accepted for publication at Interspeech 2020. arXiv admin
note: substantial text overlap with arXiv:1906.0988
- …