45 research outputs found

    Prosody generation for text-to-speech synthesis

    Get PDF
    The absence of convincing intonation makes current parametric speech synthesis systems sound dull and lifeless, even when trained on expressive speech data. Typically, these systems use regression techniques to predict the fundamental frequency (F0) frame-by-frame. This approach leads to overlysmooth pitch contours and fails to construct an appropriate prosodic structure across the full utterance. In order to capture and reproduce larger-scale pitch patterns, we propose a template-based approach for automatic F0 generation, where per-syllable pitch-contour templates (from a small, automatically learned set) are predicted by a recurrent neural network (RNN). The use of syllable templates mitigates the over-smoothing problem and is able to reproduce pitch patterns observed in the data. The use of an RNN, paired with connectionist temporal classification (CTC), enables the prediction of structure in the pitch contour spanning the entire utterance. This novel F0 prediction system is used alongside separate LSTMs for predicting phone durations and the other acoustic features, to construct a complete text-to-speech system. Later, we investigate the benefits of including long-range dependencies in duration prediction at frame-level using uni-directional recurrent neural networks. Since prosody is a supra-segmental property, we consider an alternate approach to intonation generation which exploits long-term dependencies of F0 by effective modelling of linguistic features using recurrent neural networks. For this purpose, we propose a hierarchical encoder-decoder and multi-resolution parallel encoder where the encoder takes word and higher level linguistic features at the input and upsamples them to phone-level through a series of hidden layers and is integrated into a Hybrid system which is then submitted to Blizzard challenge workshop. We then highlight some of the issues in current approaches and a plan for future directions of investigation is outlined along with on-going work

    Suprasegmental representations for the modeling of fundamental frequency in statistical parametric speech synthesis

    Get PDF
    Statistical parametric speech synthesis (SPSS) has seen improvements over recent years, especially in terms of intelligibility. Synthetic speech is often clear and understandable, but it can also be bland and monotonous. Proper generation of natural speech prosody is still a largely unsolved problem. This is relevant especially in the context of expressive audiobook speech synthesis, where speech is expected to be fluid and captivating. In general, prosody can be seen as a layer that is superimposed on the segmental (phone) sequence. Listeners can perceive the same melody or rhythm in different utterances, and the same segmental sequence can be uttered with a different prosodic layer to convey a different message. For this reason, prosody is commonly accepted to be inherently suprasegmental. It is governed by longer units within the utterance (e.g. syllables, words, phrases) and beyond the utterance (e.g. discourse). However, common techniques for the modeling of speech prosody - and speech in general - operate mainly on very short intervals, either at the state or frame level, in both hidden Markov model (HMM) and deep neural network (DNN) based speech synthesis. This thesis presents contributions supporting the claim that stronger representations of suprasegmental variation are essential for the natural generation of fundamental frequency for statistical parametric speech synthesis. We conceptualize the problem by dividing it into three sub-problems: (1) representations of acoustic signals, (2) representations of linguistic contexts, and (3) the mapping of one representation to another. The contributions of this thesis provide novel methods and insights relating to these three sub-problems. In terms of sub-problem 1, we propose a multi-level representation of f0 using the continuous wavelet transform and the discrete cosine transform, as well as a wavelet-based decomposition strategy that is linguistically and perceptually motivated. In terms of sub-problem 2, we investigate additional linguistic features such as text-derived word embeddings and syllable bag-of-phones and we propose a novel method for learning word vector representations based on acoustic counts. Finally, considering sub-problem 3, insights are given regarding hierarchical models such as parallel and cascaded deep neural networks

    Visual prosody in speech-driven facial animation: elicitation, prediction, and perceptual evaluation

    Get PDF
    Facial animations capable of articulating accurate movements in synchrony with a speech track have become a subject of much research during the past decade. Most of these efforts have focused on articulation of lip and tongue movements, since these are the primary sources of information in speech reading. However, a wealth of paralinguistic information is implicitly conveyed through visual prosody (e.g., head and eyebrow movements). In contrast with lip/tongue movements, however, for which the articulation rules are fairly well known (i.e., viseme-phoneme mappings, coarticulation), little is known about the generation of visual prosody. The objective of this thesis is to explore the perceptual contributions of visual prosody in speech-driven facial avatars. Our main hypothesis is that visual prosody driven by acoustics of the speech signal, as opposed to random or no visual prosody, results in more realistic, coherent and convincing facial animations. To test this hypothesis, we have developed an audio-visual system capable of capturing synchronized speech and facial motion from a speaker using infrared illumination and retro-reflective markers. In order to elicit natural visual prosody, a story-telling experiment was designed in which the actors were shown a short cartoon video, and subsequently asked to narrate the episode. From this audio-visual data, four different facial animations were generated, articulating no visual prosody, Perlin-noise, speech-driven movements, and ground truth movements. Speech-driven movements were driven by acoustic features of the speech signal (e.g., fundamental frequency and energy) using rule-based heuristics and autoregressive models. A pair-wise perceptual evaluation shows that subjects can clearly discriminate among the four visual prosody animations. It also shows that speech-driven movements and Perlin-noise, in that order, approach the performance of veridical motion. The results are quite promising and suggest that speech-driven motion could outperform Perlin-noise if more powerful motion prediction models are used. In addition, our results also show that exaggeration can bias the viewer to perceive a computer generated character to be more realistic motion-wise

    Vocal accommodation in human-computer interaction : modeling and integration into spoken dialogue systems

    Get PDF
    With the rapidly increasing usage of voice-activated devices worldwide, verbal communication with computers is steadily becoming more common. Although speech is the principal natural manner of human communication, it is still challenging for computers, and users had been growing accustomed to adjusting their speaking style for computers. Such adjustments occur naturally, and typically unconsciously, in humans during an exchange to control the social distance between the interlocutors and improve the conversation’s efficiency. This phenomenon is called accommodation and it occurs on various modalities in human communication, like hand gestures, facial expressions, eye gaze, lexical and grammatical choices, and others. Vocal accommodation deals with phonetic-level changes occurring in segmental and suprasegmental features. A decrease in the difference between the speakers’ feature realizations results in convergence, while an increasing distance leads to divergence. The lack of such mutual adjustments made naturally by humans in computers’ speech creates a gap between human-human and human-computer interactions. Moreover, voice-activated systems currently speak in exactly the same manner to all users, regardless of their speech characteristics or realizations of specific features. Detecting phonetic variations and generating adaptive speech output would enhance user personalization, offer more human-like communication, and ultimately should improve the overall interaction experience. Thus, investigating these aspects of accommodation will help to understand and improving human-computer interaction. This thesis provides a comprehensive overview of the required building blocks for a roadmap toward the integration of accommodation capabilities into spoken dialogue systems. These include conducting human-human and human-computer interaction experiments to examine the differences in vocal behaviors, approaches for modeling these empirical findings, methods for introducing phonetic variations in synthesized speech, and a way to combine all these components into an accommodative system. While each component is a wide research field by itself, they depend on each other and hence should be jointly considered. The overarching goal of this thesis is therefore not only to show how each of the aspects can be further developed, but also to demonstrate and motivate the connections between them. A special emphasis is put throughout the thesis on the importance of the temporal aspect of accommodation. Humans constantly change their speech over the course of a conversation. Therefore, accommodation processes should be treated as continuous, dynamic phenomena. Measuring differences in a few discrete points, e.g., beginning and end of an interaction, may leave many accommodation events undiscovered or overly smoothed. To justify the effort of introducing accommodation in computers, it should first be proven that humans even show any phonetic adjustments when talking to a computer as they do with a human being. As there is no definitive metric for measuring accommodation and evaluating its quality, it is important to empirically study humans productions to later use as references for possible behaviors. In this work, this investigation encapsulates different experimental configurations to achieve a better picture of accommodation effects. First, vocal accommodation was inspected where it naturally occurs, namely in spontaneous human-human conversations. For this purpose, a collection of real-world sales conversations, each with a different representative-prospect pair, was collected and analyzed. These conversations offer a glance into accommodation effects in authentic, unscripted interactions with the common goal of negotiating a deal on the one hand, but with the individual facet of each side of trying to get the best terms on the other hand. The conversations were analyzed using cross-correlation and time series techniques to capture the change dynamics over time. It was found that successful conversations are distinguishable from failed ones by multiple measures. Furthermore, the sales representative proved to be better at leading the vocal changes, i.e., making the prospect follow their speech styles rather than the other way around. They also showed a stronger tendency to take that lead at an earlier stage, all the more so in successful conversations. The fact that accommodation occurs more by trained speakers and improves their performances fits anecdotal best practices of sales experts, which are now also proven scientifically. Following these results, the next experiment came closer to the final goal of this work and investigated vocal accommodation effects in human-computer interaction. This was done via a shadowing experiment, which offers a controlled setting for examining phonetic variations. As spoken dialogue systems with such accommodation capabilities (like this work aims to achieve) do not exist yet, a simulated system was used to introduce these changes to the participants, who believed they help with the testing of a language learning tutoring system. After determining their preference concerning three segmental phonetic features, participants were listen-ing to either natural or synthesized voices of male and female speakers, which produced the participants’ dispreferred variation of the aforementioned features. Accommodation occurred in all cases, but the natural voices triggered stronger effects. Nevertheless, it can be concluded that participants were accommodating toward synthetic voices as well, which means that social mechanisms are applied in humans also when speaking with computer-based interlocutors. The shadowing paradigm was utilized also to test whether accommodation is a phenomenon associated only with speech or with other vocal productions as well. To that end, accommodation in the singing of familiar and novel music was examined. Interestingly, accommodation was found in both cases, though in different ways. While participants seemed to use the familiar piece merely as a reference for singing more accurately, the novel piece became the goal for complete replicate. For example, one difference was that mostly pitch corrections were introduced in the former case, while in the latter also key and rhythmic patterns were adopted. Some of those findings were expected and they show that people’s more salient features are also harder to modify using external auditory influence. Lastly, a multiparty experiment with spontaneous human-human-computer interactions was carried out to compare accommodation in human-directed and computer-directed speech. The participants solved tasks for which they needed to talk both with a confederate and with an agent. This allows a direct comparison of their speech based on the addressee within the same conversation, which has not been done so far. Results show that some participants’ vocal behavior changed similarly when talking to the confederate and the agent, while others’ speech varied only with the confederate. Further analysis found that the greatest factor for this difference was the order in which the participants talked with the interlocutors. Apparently, those who first talked to the agent alone saw it more as a social actor in the conversation, while those who interacted with it after talking to the confederate treated it more as a means to achieve a goal, and thus behaved differently with it. In the latter case, the variations in the human-directed speech were much more prominent. Differences were also found between the analyzed features, but the task type did not influence the degree of accommodation effects. The results of these experiments lead to the conclusion that vocal accommodation does occur in human-computer interactions, even if often to lesser degrees. With the question of whether people accommodate to computer-based interlocutors as well answered, the next step would be to describe accommodative behaviors in a computer-processable manner. Two approaches are proposed here: computational and statistical. The computational model aims to capture the presumed cognitive process associated with accommodation in humans. This comprises various steps, such as detecting the variable feature’s sound, adding instances of it to the feature’s mental memory, and determining how much the sound will change while taking into account both its current representation and the external input. Due to its sequential nature, this model was implemented as a pipeline. Each of the pipeline’s five steps corresponds to a specific part of the cognitive process and can have one or more parameters to control its output (e.g., the size of the feature’s memory or the accommodation pace). Using these parameters, precise accommodative behaviors can be crafted while applying expert knowledge to motivate the chosen parameter values. These advantages make this approach suitable for experimentation with pre-defined, deterministic behaviors where each step can be changed individually. Ultimately, this approach makes a system vocally responsive to users’ speech input. The second approach grants more evolved behaviors, by defining different core behaviors and adding non-deterministic variations on top of them. This resembles human behavioral patterns, as each person has a base way of accommodating (or not accommodating), which may arbitrarily change based on the specific circumstances. This approach offers a data-driven statistical way to extract accommodation behaviors from a given collection of interactions. First, the target feature’s values of each speaker in an interaction are converted into continuous interpolated lines by drawing one sample from the posterior distribution of a Gaussian process conditioned on the given values. Then, the gradients of these lines, which represent rates of mutual change, are used to defined discrete levels of change based on their distribution. Finally, each level is assigned a symbol, which ultimately creates a symbol sequence representation for each interaction. The sequences are clustered so that each cluster stands for a type of behavior. The sequences of a cluster can then be used to calculate n-gram probabilities that enable the generation of new sequences of the captured behavior. The specific output value is sampled from the range corresponding to the generated symbol. With this approach, accommodation behaviors are extracted directly from data, as opposed to manually crafting them. However, it is harder to describe what exactly these behaviors represent and motivate the use of one of them over the other. To bridge this gap between these two approaches, it is also discussed how they can be combined to benefit from the advantages of both. Furthermore, to generate more structured behaviors, a hierarchy of accommodation complexity levels is suggested here, from a direct adoption of users’ realizations, via specified responsiveness, and up to independent core behaviors with non-deterministic variational productions. Besides a way to track and represent vocal changes, an accommodative system also needs a text-to-speech component that is able to realize those changes in the system’s speech output. Speech synthesis models are typically trained once on data with certain characteristics and do not change afterward. This prevents such models from introducing any variation in specific sounds and other phonetic features. Two methods for directly modifying such features are explored here. The first is based on signal modifications applied to the output signal after it was generated by the system. The processing is done between the timestamps of the target features and uses pre-defined scripts that modify the signal to achieve the desired values. This method is more suitable for continuous features like vowel quality, especially in the case of subtle changes that do not necessarily lead to a categorical sound change. The second method aims to capture phonetic variations in the training data. To that end, a training corpus with phonemic representations is used, as opposed to the regular graphemic representations. This way, the model can learn more direct relations between phonemes and sound instead of surface forms and sound, which, depending on the language, might be more complex and depend on their surrounding letters. The target variations themselves don’t necessarily need to be explicitly present in the training data, all time the different sounds are naturally distinguishable. In generation time, the current target feature’s state determines the phoneme to use for generating the desired sound. This method is suitable for categorical changes, especially for contrasts that naturally exist in the language. While both methods have certain limitations, they provide a proof of concept for the idea that spoken dialogue systems may phonetically adapt their speech output in real-time and without re-training their text-to-speech models. To combine the behavior definitions and the speech manipulations, a system is required, which can connect these elements to create a complete accommodation capability. The architecture suggested here extends the standard spoken dialogue system with an additional module, which receives the transcribed speech signal from the speech recognition component without influencing the input to the language understanding component. While language the understanding component uses only textual transcription to determine the user’s intention, the added component process the raw signal along with its phonetic transcription. In this extended architecture, the accommodation model is activated in the added module and the information required for speech manipulation is sent to the text-to-speech component. However, the text-to-speech component now has two inputs, viz. the content of the system’s response coming from the language generation component and the states of the defined target features from the added component. An implementation of a web-based system with this architecture is introduced here, and its functionality is showcased by demonstrating how it can be used to conduct a shadowing experiment automatically. This has two main advantage: First, since the system recognizes the participants’ phonetic variations and automatically selects the appropriate variation to use in its response, the experimenter saves time and prevents manual annotation errors. The experimenter also automatically gains additional information, like exact timestamps of utterances, real-time visualization of the interlocutors’ productions, and the possibility to replay and analyze the interaction after the experiment is finished. The second advantage is scalability. Multiple instances of the system can run on a server and be accessed by multiple clients at the same time. This not only saves time and the logistics of bringing participants into a lab, but also allows running the experiment with different configurations (e.g., other parameter values or target features) in a controlled and reproducible way. This completes a full cycle from examining human behaviors to integrating accommodation capabilities. Though each part of it can undoubtedly be further investigated, the emphasis here is on how they depend and connect to each other. Measuring changes features without showing how they can be modeled or achieving flexible speech synthesis without considering the desired final output might not lead to the final goal of introducing accommodation capabilities into computers. Treating accommodation in human-computer interaction as one large process rather than isolated sub-problems lays the ground for more comprehensive and complete solutions in the future.Heutzutage wird die verbale Interaktion mit Computern immer gebrĂ€uchlicher, was der rasant wachsenden Anzahl von sprachaktivierten GerĂ€ten weltweit geschuldet ist. Allerdings stellt die computerseitige Handhabung gesprochener Sprache weiterhin eine große Herausforderung dar, obwohl sie die bevorzugte Art zwischenmenschlicher Kommunikation reprĂ€sentiert. Dieser Umstand führt auch dazu, dass Benutzer ihren Sprachstil an das jeweilige GerĂ€t anpassen, um diese Handhabung zu erleichtern. Solche Anpassungen kommen in menschlicher gesprochener Sprache auch in der zwischenmenschlichen Kommunikation vor. Üblicherweise ereignen sie sich unbewusst und auf natürliche Weise wĂ€hrend eines GesprĂ€chs, etwa um die soziale Distanz zwischen den GesprĂ€chsteilnehmern zu kontrollieren oder um die Effizienz des GesprĂ€chs zu verbessern. Dieses PhĂ€nomen wird als Akkommodation bezeichnet und findet auf verschiedene Weise wĂ€hrend menschlicher Kommunikation statt. Sie Ă€ußert sich zum Beispiel in der Gestik, Mimik, Blickrichtung oder aber auch in der Wortwahl und dem verwendeten Satzbau. Vokal- Akkommodation beschĂ€ftigt sich mit derartigen Anpassungen auf phonetischer Ebene, die sich in segmentalen und suprasegmentalen Merkmalen zeigen. Werden AusprĂ€gungen dieser Merkmale bei den GesprĂ€chsteilnehmern im Laufe des GesprĂ€chs Ă€hnlicher, spricht man von Konvergenz, vergrĂ¶ĂŸern sich allerdings die Unterschiede, so wird dies als Divergenz bezeichnet. Dieser natürliche gegenseitige Anpassungsvorgang fehlt jedoch auf der Seite des Computers, was zu einer Lücke in der Mensch-Maschine-Interaktion führt. Darüber hinaus verwenden sprachaktivierte Systeme immer dieselbe Sprachausgabe und ignorieren folglich etwaige Unterschiede zum Sprachstil des momentanen Benutzers. Die Erkennung dieser phonetischen Abweichungen und die Erstellung von anpassungsfĂ€higer Sprachausgabe würden zur Personalisierung dieser Systeme beitragen und könnten letztendlich die insgesamte Benutzererfahrung verbessern. Aus diesem Grund kann die Erforschung dieser Aspekte von Akkommodation helfen, Mensch-Maschine-Interaktion besser zu verstehen und weiterzuentwickeln. Die vorliegende Dissertation stellt einen umfassenden Überblick zu Bausteinen bereit, die nötig sind, um AkkommodationsfĂ€higkeiten in Sprachdialogsysteme zu integrieren. In diesem Zusammenhang wurden auch interaktive Mensch-Mensch- und Mensch- Maschine-Experimente durchgeführt. In diesen Experimenten wurden Differenzen der vokalen Verhaltensweisen untersucht und Methoden erforscht, wie phonetische Abweichungen in synthetische Sprachausgabe integriert werden können. Um die erhaltenen Ergebnisse empirisch auswerten zu können, wurden hierbei auch verschiedene ModellierungsansĂ€tze erforscht. Fernerhin wurde der Frage nachgegangen, wie sich die betreffenden Komponenten kombinieren lassen, um ein Akkommodationssystem zu konstruieren. Jeder dieser Aspekte stellt für sich genommen bereits einen überaus breiten Forschungsbereich dar. Allerdings sind sie voneinander abhĂ€ngig und sollten zusammen betrachtet werden. Aus diesem Grund liegt ein übergreifender Schwerpunkt dieser Dissertation darauf, nicht nur aufzuzeigen, wie sich diese Aspekte weiterentwickeln lassen, sondern auch zu motivieren, wie sie zusammenhĂ€ngen. Ein weiterer Schwerpunkt dieser Arbeit befasst sich mit der zeitlichen Komponente des Akkommodationsprozesses, was auf der Beobachtung fußt, dass Menschen im Laufe eines GesprĂ€chs stĂ€ndig ihren Sprachstil Ă€ndern. Diese Beobachtung legt nahe, derartige Prozesse als kontinuierliche und dynamische Prozesse anzusehen. Fasst man jedoch diesen Prozess als diskret auf und betrachtet z.B. nur den Beginn und das Ende einer Interaktion, kann dies dazu führen, dass viele Akkommodationsereignisse unentdeckt bleiben oder übermĂ€ĂŸig geglĂ€ttet werden. Um die Entwicklung eines vokalen Akkommodationssystems zu rechtfertigen, muss zuerst bewiesen werden, dass Menschen bei der vokalen Interaktion mit einem Computer ein Ă€hnliches Anpassungsverhalten zeigen wie bei der Interaktion mit einem Menschen. Da es keine eindeutig festgelegte Metrik für das Messen des Akkommodationsgrades und für die Evaluierung der AkkommodationsqualitĂ€t gibt, ist es besonders wichtig, die Sprachproduktion von Menschen empirisch zu untersuchen, um sie als Referenz für mögliche Verhaltensweisen anzuwenden. In dieser Arbeit schließt diese Untersuchung verschiedene experimentelle Anordnungen ein, um einen besseren Überblick über Akkommodationseffekte zu erhalten. In einer ersten Studie wurde die vokale Akkommodation in einer Umgebung untersucht, in der sie natürlich vorkommt: in einem spontanen Mensch-Mensch GesprĂ€ch. Zu diesem Zweck wurde eine Sammlung von echten VerkaufsgesprĂ€chen gesammelt und analysiert, wobei in jedem dieser GesprĂ€che ein anderes Handelsvertreter-Neukunde Paar teilgenommen hatte. Diese GesprĂ€che verschaffen einen Einblick in Akkommodationseffekte wĂ€hrend spontanen authentischen Interaktionen, wobei die GesprĂ€chsteilnehmer zwei Ziele verfolgen: zum einen soll ein GeschĂ€ft verhandelt werden, zum anderen möchte aber jeder Teilnehmer für sich die besten Bedingungen aushandeln. Die Konversationen wurde durch das Kreuzkorrelation-Zeitreihen-Verfahren analysiert, um die dynamischen Änderungen im Zeitverlauf zu erfassen. Hierbei kam zum Vorschein, dass sich erfolgreiche Konversationen von fehlgeschlagenen GesprĂ€chen deutlich unterscheiden lassen. Überdies wurde festgestellt, dass die Handelsvertreter die treibende Kraft von vokalen Änderungen sind, d.h. sie können die Neukunden eher dazu zu bringen, ihren Sprachstil anzupassen, als andersherum. Es wurde auch beobachtet, dass sie diese Akkommodation oft schon zu einem frühen Zeitpunkt auslösen, was besonders bei erfolgreichen GesprĂ€chen beobachtet werden konnte. Dass diese Akkommodation stĂ€rker bei trainierten Sprechern ausgelöst wird, deckt sich mit den meist anekdotischen Empfehlungen von erfahrenen Handelsvertretern, die bisher nie wissenschaftlich nachgewiesen worden sind. Basierend auf diesen Ergebnissen beschĂ€fti

    Training a supra-segmental parametric F0 model without interpolating F0

    No full text
    Combining multiple intonation models at different linguistic levels is an effective way to improve the naturalness of the predicted F0. In many of these approaches, the intonation models for suprasegmental levels are based on a parametrization of the log-F0 contours over the units of that level. However, many of these parametrisations are not stable when applied to discontinuous signals. Therefore, the F0 signal has to be interpolated. These interpolated values introduce a distortion in the coefficients that degrades the quality of the model. This paper proposes two methods that eliminate the need for such interpolation, one based on regularization and the other on factor analysis. Subjective evaluations show that, for a Discrete-cosine-transform (DCT) syllable-level model, both approaches result in a significant improvement w.r.t. a baseline using interpolated F0. The approach based on regularization yields the best results. © 2013 IEEE

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

    Models and analysis of vocal emissions for biomedical applications

    Get PDF
    This book of Proceedings collects the papers presented at the 3rd International Workshop on Models and Analysis of Vocal Emissions for Biomedical Applications, MAVEBA 2003, held 10-12 December 2003, Firenze, Italy. The workshop is organised every two years, and aims to stimulate contacts between specialists active in research and industrial developments, in the area of voice analysis for biomedical applications. The scope of the Workshop includes all aspects of voice modelling and analysis, ranging from fundamental research to all kinds of biomedical applications and related established and advanced technologies

    A Silent-Speech Interface using Electro-Optical Stomatography

    Get PDF
    Sprachtechnologie ist eine große und wachsende Industrie, die das Leben von technologieinteressierten Nutzern auf zahlreichen Wegen bereichert. Viele potenzielle Nutzer werden jedoch ausgeschlossen: NĂ€mlich alle Sprecher, die nur schwer oder sogar gar nicht Sprache produzieren können. Silent-Speech Interfaces bieten einen Weg, mit Maschinen durch ein bequemes sprachgesteuertes Interface zu kommunizieren ohne dafĂŒr akustische Sprache zu benötigen. Sie können außerdem prinzipiell eine Ersatzstimme stellen, indem sie die intendierten Äußerungen, die der Nutzer nur still artikuliert, kĂŒnstlich synthetisieren. Diese Dissertation stellt ein neues Silent-Speech Interface vor, das auf einem neu entwickelten Messsystem namens Elektro-Optischer Stomatografie und einem neuartigen parametrischen Vokaltraktmodell basiert, das die Echtzeitsynthese von Sprache basierend auf den gemessenen Daten ermöglicht. Mit der Hardware wurden Studien zur Einzelworterkennung durchgefĂŒhrt, die den Stand der Technik in der intra- und inter-individuellen Genauigkeit erreichten und ĂŒbertrafen. DarĂŒber hinaus wurde eine Studie abgeschlossen, in der die Hardware zur Steuerung des Vokaltraktmodells in einer direkten Artikulation-zu-Sprache-Synthese verwendet wurde. WĂ€hrend die VerstĂ€ndlichkeit der Synthese von Vokalen sehr hoch eingeschĂ€tzt wurde, ist die VerstĂ€ndlichkeit von Konsonanten und kontinuierlicher Sprache sehr schlecht. Vielversprechende Möglichkeiten zur Verbesserung des Systems werden im Ausblick diskutiert.:Statement of authorship iii Abstract v List of Figures vii List of Tables xi Acronyms xiii 1. Introduction 1 1.1. The concept of a Silent-Speech Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2. Structure of this work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2. Fundamentals of phonetics 7 2.1. Components of the human speech production system . . . . . . . . . . . . . . . . . . . 7 2.2. Vowel sounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3. Consonantal sounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4. Acoustic properties of speech sounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.5. Coarticulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.6. Phonotactics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.7. Summary and implications for the design of a Silent-Speech Interface (SSI) . . . . . . . 21 3. Articulatory data acquisition techniques in Silent-Speech Interfaces 25 3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2. Scope of the literature review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3. Video Recordings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.4. Ultrasonography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.5. Electromyography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.6. Permanent-Magnetic Articulography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.7. Electromagnetic Articulography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.8. Radio waves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.9. Palatography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.10.Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4. Electro-Optical Stomatography 55 4.1. Contact sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.2. Optical distance sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.3. Lip sensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.4. Sensor Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.5. Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.6. Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5. Articulation-to-Text 99 5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.2. Command word recognition pilot study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.3. Command word recognition small-scale study . . . . . . . . . . . . . . . . . . . . . . . . 102 6. Articulation-to-Speech 109 6.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.2. Articulatory synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.3. The six point vocal tract model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6.4. Objective evaluation of the vocal tract model . . . . . . . . . . . . . . . . . . . . . . . . 116 6.5. Perceptual evaluation of the vocal tract model . . . . . . . . . . . . . . . . . . . . . . . . 120 6.6. Direct synthesis using EOS to control the vocal tract model . . . . . . . . . . . . . . . . 125 6.7. Pitch and voicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 7. Summary and outlook 145 7.1. Summary of the contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 7.2. Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 A. Overview of the International Phonetic Alphabet 151 B. Mathematical proofs and derivations 153 B.1. Combinatoric calculations illustrating the reduction of possible syllables using phonotactics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 B.2. Signal Averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 B.3. Effect of the contact sensor area on the conductance . . . . . . . . . . . . . . . . . . . . 155 B.4. Calculation of the forward current for the OP280V diode . . . . . . . . . . . . . . . . . . 155 C. Schematics and layouts 157 C.1. Schematics of the control unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 C.2. Layout of the control unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 C.3. Bill of materials of the control unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 C.4. Schematics of the sensor unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 C.5. Layout of the sensor unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 C.6. Bill of materials of the sensor unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 D. Sensor unit assembly 169 E. Firmware flow and data protocol 177 F. Palate file format 181 G. Supplemental material regarding the vocal tract model 183 H. Articulation-to-Speech: Optimal hyperparameters 189 Bibliography 191Speech technology is a major and growing industry that enriches the lives of technologically-minded people in a number of ways. Many potential users are, however, excluded: Namely, all speakers who cannot easily or even at all produce speech. Silent-Speech Interfaces offer a way to communicate with a machine by a convenient speech recognition interface without the need for acoustic speech. They also can potentially provide a full replacement voice by synthesizing the intended utterances that are only silently articulated by the user. To that end, the speech movements need to be captured and mapped to either text or acoustic speech. This dissertation proposes a new Silent-Speech Interface based on a newly developed measurement technology called Electro-Optical Stomatography and a novel parametric vocal tract model to facilitate real-time speech synthesis based on the measured data. The hardware was used to conduct command word recognition studies reaching state-of-the-art intra- and inter-individual performance. Furthermore, a study on using the hardware to control the vocal tract model in a direct articulation-to-speech synthesis loop was also completed. While the intelligibility of synthesized vowels was high, the intelligibility of consonants and connected speech was quite poor. Promising ways to improve the system are discussed in the outlook.:Statement of authorship iii Abstract v List of Figures vii List of Tables xi Acronyms xiii 1. Introduction 1 1.1. The concept of a Silent-Speech Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2. Structure of this work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2. Fundamentals of phonetics 7 2.1. Components of the human speech production system . . . . . . . . . . . . . . . . . . . 7 2.2. Vowel sounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3. Consonantal sounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4. Acoustic properties of speech sounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.5. Coarticulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.6. Phonotactics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.7. Summary and implications for the design of a Silent-Speech Interface (SSI) . . . . . . . 21 3. Articulatory data acquisition techniques in Silent-Speech Interfaces 25 3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2. Scope of the literature review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3. Video Recordings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.4. Ultrasonography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.5. Electromyography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.6. Permanent-Magnetic Articulography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.7. Electromagnetic Articulography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.8. Radio waves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.9. Palatography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.10.Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4. Electro-Optical Stomatography 55 4.1. Contact sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.2. Optical distance sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.3. Lip sensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.4. Sensor Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.5. Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.6. Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5. Articulation-to-Text 99 5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.2. Command word recognition pilot study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.3. Command word recognition small-scale study . . . . . . . . . . . . . . . . . . . . . . . . 102 6. Articulation-to-Speech 109 6.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.2. Articulatory synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.3. The six point vocal tract model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6.4. Objective evaluation of the vocal tract model . . . . . . . . . . . . . . . . . . . . . . . . 116 6.5. Perceptual evaluation of the vocal tract model . . . . . . . . . . . . . . . . . . . . . . . . 120 6.6. Direct synthesis using EOS to control the vocal tract model . . . . . . . . . . . . . . . . 125 6.7. Pitch and voicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 7. Summary and outlook 145 7.1. Summary of the contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 7.2. Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 A. Overview of the International Phonetic Alphabet 151 B. Mathematical proofs and derivations 153 B.1. Combinatoric calculations illustrating the reduction of possible syllables using phonotactics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 B.2. Signal Averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 B.3. Effect of the contact sensor area on the conductance . . . . . . . . . . . . . . . . . . . . 155 B.4. Calculation of the forward current for the OP280V diode . . . . . . . . . . . . . . . . . . 155 C. Schematics and layouts 157 C.1. Schematics of the control unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 C.2. Layout of the control unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 C.3. Bill of materials of the control unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 C.4. Schematics of the sensor unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 C.5. Layout of the sensor unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 C.6. Bill of materials of the sensor unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 D. Sensor unit assembly 169 E. Firmware flow and data protocol 177 F. Palate file format 181 G. Supplemental material regarding the vocal tract model 183 H. Articulation-to-Speech: Optimal hyperparameters 189 Bibliography 19
    corecore