150 research outputs found

    A Silent-Speech Interface using Electro-Optical Stomatography

    Get PDF
    Sprachtechnologie ist eine große und wachsende Industrie, die das Leben von technologieinteressierten Nutzern auf zahlreichen Wegen bereichert. Viele potenzielle Nutzer werden jedoch ausgeschlossen: Nämlich alle Sprecher, die nur schwer oder sogar gar nicht Sprache produzieren können. Silent-Speech Interfaces bieten einen Weg, mit Maschinen durch ein bequemes sprachgesteuertes Interface zu kommunizieren ohne dafür akustische Sprache zu benötigen. Sie können außerdem prinzipiell eine Ersatzstimme stellen, indem sie die intendierten Äußerungen, die der Nutzer nur still artikuliert, künstlich synthetisieren. Diese Dissertation stellt ein neues Silent-Speech Interface vor, das auf einem neu entwickelten Messsystem namens Elektro-Optischer Stomatografie und einem neuartigen parametrischen Vokaltraktmodell basiert, das die Echtzeitsynthese von Sprache basierend auf den gemessenen Daten ermöglicht. Mit der Hardware wurden Studien zur Einzelworterkennung durchgeführt, die den Stand der Technik in der intra- und inter-individuellen Genauigkeit erreichten und übertrafen. Darüber hinaus wurde eine Studie abgeschlossen, in der die Hardware zur Steuerung des Vokaltraktmodells in einer direkten Artikulation-zu-Sprache-Synthese verwendet wurde. Während die Verständlichkeit der Synthese von Vokalen sehr hoch eingeschätzt wurde, ist die Verständlichkeit von Konsonanten und kontinuierlicher Sprache sehr schlecht. Vielversprechende Möglichkeiten zur Verbesserung des Systems werden im Ausblick diskutiert.:Statement of authorship iii Abstract v List of Figures vii List of Tables xi Acronyms xiii 1. Introduction 1 1.1. The concept of a Silent-Speech Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2. Structure of this work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2. Fundamentals of phonetics 7 2.1. Components of the human speech production system . . . . . . . . . . . . . . . . . . . 7 2.2. Vowel sounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3. Consonantal sounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4. Acoustic properties of speech sounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.5. Coarticulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.6. Phonotactics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.7. Summary and implications for the design of a Silent-Speech Interface (SSI) . . . . . . . 21 3. Articulatory data acquisition techniques in Silent-Speech Interfaces 25 3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2. Scope of the literature review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3. Video Recordings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.4. Ultrasonography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.5. Electromyography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.6. Permanent-Magnetic Articulography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.7. Electromagnetic Articulography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.8. Radio waves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.9. Palatography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.10.Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4. Electro-Optical Stomatography 55 4.1. Contact sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.2. Optical distance sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.3. Lip sensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.4. Sensor Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.5. Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.6. Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5. Articulation-to-Text 99 5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.2. Command word recognition pilot study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.3. Command word recognition small-scale study . . . . . . . . . . . . . . . . . . . . . . . . 102 6. Articulation-to-Speech 109 6.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.2. Articulatory synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.3. The six point vocal tract model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6.4. Objective evaluation of the vocal tract model . . . . . . . . . . . . . . . . . . . . . . . . 116 6.5. Perceptual evaluation of the vocal tract model . . . . . . . . . . . . . . . . . . . . . . . . 120 6.6. Direct synthesis using EOS to control the vocal tract model . . . . . . . . . . . . . . . . 125 6.7. Pitch and voicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 7. Summary and outlook 145 7.1. Summary of the contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 7.2. Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 A. Overview of the International Phonetic Alphabet 151 B. Mathematical proofs and derivations 153 B.1. Combinatoric calculations illustrating the reduction of possible syllables using phonotactics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 B.2. Signal Averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 B.3. Effect of the contact sensor area on the conductance . . . . . . . . . . . . . . . . . . . . 155 B.4. Calculation of the forward current for the OP280V diode . . . . . . . . . . . . . . . . . . 155 C. Schematics and layouts 157 C.1. Schematics of the control unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 C.2. Layout of the control unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 C.3. Bill of materials of the control unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 C.4. Schematics of the sensor unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 C.5. Layout of the sensor unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 C.6. Bill of materials of the sensor unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 D. Sensor unit assembly 169 E. Firmware flow and data protocol 177 F. Palate file format 181 G. Supplemental material regarding the vocal tract model 183 H. Articulation-to-Speech: Optimal hyperparameters 189 Bibliography 191Speech technology is a major and growing industry that enriches the lives of technologically-minded people in a number of ways. Many potential users are, however, excluded: Namely, all speakers who cannot easily or even at all produce speech. Silent-Speech Interfaces offer a way to communicate with a machine by a convenient speech recognition interface without the need for acoustic speech. They also can potentially provide a full replacement voice by synthesizing the intended utterances that are only silently articulated by the user. To that end, the speech movements need to be captured and mapped to either text or acoustic speech. This dissertation proposes a new Silent-Speech Interface based on a newly developed measurement technology called Electro-Optical Stomatography and a novel parametric vocal tract model to facilitate real-time speech synthesis based on the measured data. The hardware was used to conduct command word recognition studies reaching state-of-the-art intra- and inter-individual performance. Furthermore, a study on using the hardware to control the vocal tract model in a direct articulation-to-speech synthesis loop was also completed. While the intelligibility of synthesized vowels was high, the intelligibility of consonants and connected speech was quite poor. Promising ways to improve the system are discussed in the outlook.:Statement of authorship iii Abstract v List of Figures vii List of Tables xi Acronyms xiii 1. Introduction 1 1.1. The concept of a Silent-Speech Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2. Structure of this work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2. Fundamentals of phonetics 7 2.1. Components of the human speech production system . . . . . . . . . . . . . . . . . . . 7 2.2. Vowel sounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3. Consonantal sounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4. Acoustic properties of speech sounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.5. Coarticulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.6. Phonotactics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.7. Summary and implications for the design of a Silent-Speech Interface (SSI) . . . . . . . 21 3. Articulatory data acquisition techniques in Silent-Speech Interfaces 25 3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2. Scope of the literature review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3. Video Recordings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.4. Ultrasonography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.5. Electromyography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.6. Permanent-Magnetic Articulography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.7. Electromagnetic Articulography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.8. Radio waves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.9. Palatography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.10.Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4. Electro-Optical Stomatography 55 4.1. Contact sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.2. Optical distance sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.3. Lip sensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.4. Sensor Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.5. Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.6. Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5. Articulation-to-Text 99 5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.2. Command word recognition pilot study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.3. Command word recognition small-scale study . . . . . . . . . . . . . . . . . . . . . . . . 102 6. Articulation-to-Speech 109 6.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.2. Articulatory synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.3. The six point vocal tract model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6.4. Objective evaluation of the vocal tract model . . . . . . . . . . . . . . . . . . . . . . . . 116 6.5. Perceptual evaluation of the vocal tract model . . . . . . . . . . . . . . . . . . . . . . . . 120 6.6. Direct synthesis using EOS to control the vocal tract model . . . . . . . . . . . . . . . . 125 6.7. Pitch and voicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 7. Summary and outlook 145 7.1. Summary of the contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 7.2. Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 A. Overview of the International Phonetic Alphabet 151 B. Mathematical proofs and derivations 153 B.1. Combinatoric calculations illustrating the reduction of possible syllables using phonotactics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 B.2. Signal Averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 B.3. Effect of the contact sensor area on the conductance . . . . . . . . . . . . . . . . . . . . 155 B.4. Calculation of the forward current for the OP280V diode . . . . . . . . . . . . . . . . . . 155 C. Schematics and layouts 157 C.1. Schematics of the control unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 C.2. Layout of the control unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 C.3. Bill of materials of the control unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 C.4. Schematics of the sensor unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 C.5. Layout of the sensor unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 C.6. Bill of materials of the sensor unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 D. Sensor unit assembly 169 E. Firmware flow and data protocol 177 F. Palate file format 181 G. Supplemental material regarding the vocal tract model 183 H. Articulation-to-Speech: Optimal hyperparameters 189 Bibliography 19

    Children\u27s Sensitivity to Pitch Variation in Language

    Get PDF
    Children acquire consonant and vowel categories by 12 months, but take much longer to learn to interpret perceptible variation. This dissertation considers children’s interpretation of pitch variation. Pitch operates, often simultaneously, at different levels of linguistic structure. English-learning children must disregard pitch at the lexical level—since English is not a tone language—while still attending to pitch for its other functions. Chapters 1 and 5 outline the learning problem and suggest ways children might solve it. Chapter 2 demonstrates that 2.5-year-olds know pitch cannot differentiate words in English. Chapter 3 finds that not until age 4–5 do children correctly interpret pitch cues to emotions. Chapter 4 demonstrates some sensitivity between 2.5 and 5 years to the pitch cue to lexical stress, but continuing difficulties at the older ages. These findings suggest a late trajectory for interpretation of prosodic variation; throughout, I propose explanations for this protracted time-course

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

    A Sound Approach to Language Matters: In Honor of Ocke-Schwen Bohn

    Get PDF
    The contributions in this Festschrift were written by Ocke’s current and former PhD-students, colleagues and research collaborators. The Festschrift is divided into six sections, moving from the smallest building blocks of language, through gradually expanding objects of linguistic inquiry to the highest levels of description - all of which have formed a part of Ocke’s career, in connection with his teaching and/or his academic productions: “Segments”, “Perception of Accent”, “Between Sounds and Graphemes”, “Prosody”, “Morphology and Syntax” and “Second Language Acquisition”. Each one of these illustrates a sound approach to language matters

    Prosody generation for text-to-speech synthesis

    Get PDF
    The absence of convincing intonation makes current parametric speech synthesis systems sound dull and lifeless, even when trained on expressive speech data. Typically, these systems use regression techniques to predict the fundamental frequency (F0) frame-by-frame. This approach leads to overlysmooth pitch contours and fails to construct an appropriate prosodic structure across the full utterance. In order to capture and reproduce larger-scale pitch patterns, we propose a template-based approach for automatic F0 generation, where per-syllable pitch-contour templates (from a small, automatically learned set) are predicted by a recurrent neural network (RNN). The use of syllable templates mitigates the over-smoothing problem and is able to reproduce pitch patterns observed in the data. The use of an RNN, paired with connectionist temporal classification (CTC), enables the prediction of structure in the pitch contour spanning the entire utterance. This novel F0 prediction system is used alongside separate LSTMs for predicting phone durations and the other acoustic features, to construct a complete text-to-speech system. Later, we investigate the benefits of including long-range dependencies in duration prediction at frame-level using uni-directional recurrent neural networks. Since prosody is a supra-segmental property, we consider an alternate approach to intonation generation which exploits long-term dependencies of F0 by effective modelling of linguistic features using recurrent neural networks. For this purpose, we propose a hierarchical encoder-decoder and multi-resolution parallel encoder where the encoder takes word and higher level linguistic features at the input and upsamples them to phone-level through a series of hidden layers and is integrated into a Hybrid system which is then submitted to Blizzard challenge workshop. We then highlight some of the issues in current approaches and a plan for future directions of investigation is outlined along with on-going work

    The role of auditory perceptual gestalts on the processing of phrase structure

    Get PDF
    Hierarchical centre embeddings (HCEs) in natural language have been taken as evidence that language is not processed as a finite state system (Chomsky, 1957). While phrase structure may be necessary to produce HCEs, finite state, sequential processing may underlie their comprehension (Frank, Bod, & Christiansen, 2012). Under this account, listeners employ surface level cues (e.g. semantic content) to determine the dependencies within an utterance, instead of processing the words in a hierarchy. The acoustic structure of speech reflects the speaker’s syntactic representation during production (Cooper, Paccia & Lapointe, 1978). In comprehension, temporal (Snedeker & Trueswell, 2003) and pitch (Watson, Tanenhaus, & Gunlogson, 2008) cues rapidly influence processing. Therefore, temporal and pitch variation in speech could contain cues to dependencies. We examine whether grouping behaviour may be driven by Gestalt principles. Temporal proximity suggests that individuals group sequential words that occur closer together in time. Pitch similarity states that individuals group sequential words that are similar in pitch. In this thesis, I examine whether these Gestalts support dependency detection in speech, providing a mechanism through which hierarchical structure can be processed non-hierarchically. In Chapter 3, we assessed whether temporal proximity and pitch similarity explicitly relate to the structure of a corpus of spontaneously produced active and passive relative clauses. This was the case for actives; the embedded clause was preceded by a lengthened pause and a large pitch reduction. For passives, a longer pause and pitch reduction occurred after the verb-phrase of the embedded clause, counter to prediction. The results for actives suggest that temporal proximity and pitch similarity cues could be used to group the phrases of the embedded clause, obviating the need to process hierarchically structured speech hierarchically. Two artificial grammar learning studies assessed whether pitch similarity and temporal proximity cues support the acquisition of phrase structure grammar. Chapter 4 emphasised temporal proximity cues, while chapter 5 emphasised pitch similarity cues. In Chapter 5, pitch similarity cues improved classification performance for structures with two levels of embedding. In both, participants did not benefit from temporal proximity cues. However, the results of a cross-species meta-analysis of artificial grammar learning studies (Chapter 2) raised the possibility that reflection-based measures (e.g. grammaticality judgements) are not well suited for assessing processing-based learning, such as online speech processing (Christiansen, 2018). To properly assess the role of Gestalt cues in speech processing therefore requires processing-based measures. To assess the influence of auditory Gestalts on online speech processing, in Chapter 6 we analysed participants’ gaze behaviour in response to pitch similarity and temporal proximity cues using the visual world paradigm. Participants heard speech-synthesised active-object and passive relative clauses, whilst viewing four potential targets. Each sentence had a prosodic structure consistent with either syntactic form (Chapter 3), or two control prosodic structures. Pitch similarity results indicated that these cues facilitated processing. Temporal proximity cues consistent with syntactic structure did not facilitate processing, instead results suggested a general benefit of increased processing time. Overall, these studies suggest that participants can use the pitch similarity Gestalt to group together syntactically dependent phrases in hierarchical speech, offering a mechanism through which individuals could process hierarchical structures non-hierarchically. The results of Chapters 4, 5, and 6 suggest temporal proximity cues did not facilitate performance to the same extent. Thus, we suggest that unfilled pauses in isolation may be insufficient to facilitate groupings on the basis of temporal proximity

    A Phonetic model of English intonation

    Get PDF
    This thesis proposes a phonetic model of English intonation which is a system for linking the phonological and F₀, descriptions of an utterance.It is argued that such a model should take the form of a rigorously defined formal system which does not require any human intuition or expertise to operate. It is also argued that this model should be capable of both analysis (F₀ to phonology) and synthesis (phonology to F₀). Existing phonetic models are reviewed and it is shown that none meet the specification for the type of formal model required.A new phonetic model is presented that has three levels of description: the F₀ level, the intermediate level and the phonological level. The intermediate level uses the three basic elements of rise,fall and connection to model F₀ contours. A mathematical equation is specified for each of these elements so that a continuous lb contour can be created from a sequence of elements. The phonological system uses H and L to describe high and low pitch accents, C to describe connection elements and B to describe the rises that occur at phrase boundaries. A fully specified grammar is described which links the intermediate and F₀ levels. A grammar is specified for linking the phonological and intermediate levels, but this is only partly complete due to problems with the phonological level of description.A computer implementation of the model is described. Most of the implementation work concentrated on the relationship between the intermediate level and the F₀ level. Results are given showing that the computer analysis system labels F₀ contours quite accurately, but is significantly worse than a human labeller. It is shown that the synthesis system produces artificial F₀ contours that are very similar to naturally occurring F₀ contoursThe thesis concludes with some indications of further work and ideas on how the computer implementation of the model could be of practical benefit in speech synthesis and recognition

    The Processing of Emotional Sentences by Young and Older Adults: A Visual World Eye-movement Study

    Get PDF
    Carminati MN, Knoeferle P. The Processing of Emotional Sentences by Young and Older Adults: A Visual World Eye-movement Study. Presented at the Architectures and Mechanisms of Language and Processing (AMLaP), Riva del Garda, Italy
    corecore