384 research outputs found

    CHULA TTS: A Modularized Text-To-Speech Framework

    Get PDF

    Articulatory Copy Synthesis Based on the Speech Synthesizer VocalTractLab

    Get PDF
    Articulatory copy synthesis (ACS), a subarea of speech inversion, refers to the reproduction of natural utterances and involves both the physiological articulatory processes and their corresponding acoustic results. This thesis proposes two novel methods for the ACS of human speech using the articulatory speech synthesizer VocalTractLab (VTL) to address or mitigate the existing problems of speech inversion, such as non-unique mapping, acoustic variation among different speakers, and the time-consuming nature of the process. The first method involved finding appropriate VTL gestural scores for given natural utterances using a genetic algorithm. It consisted of two steps: gestural score initialization and optimization. In the first step, gestural scores were initialized using the given acoustic signals with speech recognition, grapheme-to-phoneme (G2P), and a VTL rule-based method for converting phoneme sequences to gestural scores. In the second step, the initial gestural scores were optimized by a genetic algorithm via an analysis-by-synthesis (ABS) procedure that sought to minimize the cosine distance between the acoustic features of the synthetic and natural utterances. The articulatory parameters were also regularized during the optimization process to restrict them to reasonable values. The second method was based on long short-term memory (LSTM) and convolutional neural networks, which were responsible for capturing the temporal dependence and the spatial structure of the acoustic features, respectively. The neural network regression models were trained, which used acoustic features as inputs and produced articulatory trajectories as outputs. In addition, to cover as much of the articulatory and acoustic space as possible, the training samples were augmented by manipulating the phonation type, speaking effort, and the vocal tract length of the synthetic utterances. Furthermore, two regularization methods were proposed: one based on the smoothness loss of articulatory trajectories and another based on the acoustic loss between original and predicted acoustic features. The best-performing genetic algorithms and convolutional LSTM systems (evaluated in terms of the difference between the estimated and reference VTL articulatory parameters) obtained average correlation coefficients of 0.985 and 0.983 for speaker-dependent utterances, respectively, and their reproduced speech achieved recognition accuracies of 86.25% and 64.69% for speaker-independent utterances of German words, respectively. When applied to German sentence utterances, as well as English and Mandarin Chinese word utterances, the neural network based ACS systems achieved recognition accuracies of 73.88%, 52.92%, and 52.41%, respectively. The results showed that both of these methods not only reproduced the articulatory processes but also reproduced the acoustic signals of reference utterances. Moreover, the regularization methods led to more physiologically plausible articulatory processes and made the estimated articulatory trajectories be more articulatorily preferred by VTL, thus reproducing more natural and intelligible speech. This study also found that the convolutional layers, when used in conjunction with batch normalization layers, automatically learned more distinctive features from log power spectrograms. Furthermore, the neural network based ACS systems trained using German data could be generalized to the utterances of other languages

    Robust visual speech recognition using optical flow analysis and rotation invariant features

    Get PDF
    The focus of this thesis is to develop computer vision algorithms for visual speech recognition system to identify the visemes. The majority of existing speech recognition systems is based on audio-visual signals and has been developed for speech enhancement and is prone to acoustic noise. Considering this problem, aim of this research is to investigate and develop a visual only speech recognition system which should be suitable for noisy environments. Potential applications of such a system include the lip-reading mobile phones, human computer interface (HCI) for mobility-impaired users, robotics, surveillance, improvement of speech based computer control in a noisy environment and for the rehabilitation of the persons who have undergone a laryngectomy surgery. In the literature, there are several models and algorithms available for visual feature extraction. These features are extracted from static mouth images and characterized as appearance and shape based features. However, these methods rarely incorporate the time dependent information of mouth dynamics. This dissertation presents two optical flow based approaches of visual feature extraction, which capture the mouth motions in an image sequence. The motivation for using motion features is, because the human perception of lip-reading is concerned with the temporal dynamics of mouth motion. The first approach is based on extraction of features from the optical flow vertical component. The optical flow vertical component is decomposed into multiple non-overlapping fixed scale blocks and statistical features of each block are computed for successive video frames of an utterance. To overcome the issue of large variation in speed of speech, each utterance is normalized using simple linear interpolation method. In the second approach, four directional motion templates based on optical flow are developed, each representing the consolidated motion information in an utterance in four directions (i.e.,up, down, left and right). This approach is an evolution of a view based approach known as motion history image (MHI). One of the main issues with the MHI method is its motion overwriting problem because of self-occlusion. DMHIs seem to solve this issue of overwriting. Two types of image descriptors, Zernike moments and Hu moments are used to represent each image of DMHIs. A support vector machine (SVM) classifier was used to classify the features obtained from the optical flow vertical component, Zernike and Hu moments separately. For identification of visemes, a multiclass SVM approach was employed. A video speech corpus of seven subjects was used for evaluating the efficiency of the proposed methods for lip-reading. The experimental results demonstrate the promising performance of the optical flow based mouth movement representations. Performance comparison between DMHI and MHI based on Zernike moments, shows that the DMHI technique outperforms the MHI technique. A video based adhoc temporal segmentation method is proposed in the thesis for isolated utterances. It has been used to detect the start and the end frame of an utterance from an image sequence. The technique is based on a pair-wise pixel comparison method. The efficiency of the proposed technique was tested on the available data set with short pauses between each utterance

    Max Planck Institute for Psycholinguistics: Annual report 1996

    No full text

    Acoustic correlates of encoded prosody in written conversation

    Get PDF
    This thesis presents an analysis of certain punctuation devices such as parenthesis, italics and emphatic spellings with respect to their acoustic correlates in read speech. The class of punctuation devices under investigation are referred to as prosodic markers. The thesis therefore presents an analysis of features of the spoken language which are represented symbolically in text. Hence it is a characterization of aspects of the spoken language which have been transcribed or symbolized in the written medium and then translated back into a spoken form by a reader. The thesis focuses in particular on the analysis of parenthesis, the examination of encoded prominence and emphasis, and also addresses the use of paralinguistic markers which signal attitude or emotion.In an effort to avoid the use of self constructed or artificial material containing arbitrary symbolic or prosodic encodings, all material used for empirical analysis was taken from examples of electronic written exchanges on the Internet, such as from electronic mail messages and from articles posted on electronic newsgroups and news bulletins. This medium of language, which is referred to here as written conversation, provides a rich source of material containing encoded prosodic markers. These occur in the form of 'smiley faces' expressing attitudes or feelings, words highlighted by a number of means such as capitalization, italics, underscore characters, or asterisks, and in the form of dashes or parentheses, which provide suggestions on how the information in a text or sentence may be structured with regard to its informational content.Chapter 2 investigates in detail the genre of written conversation with respect to its place in an emerging continuum between written and spoken language, concentrating on transcriptional devices and their function as indicators of prosody. The implications these symbolic representations bear on the task of reading, by humans as well as machines, are then examined.Chapters 3 and 4 turn to the acoustic analysis of parentheticals and emphasis markers respectively. The experimental work in this thesis is based on readings of a corpus of selected materials from written conversation with the acoustic analysis concentrating on the differences between readings of texts with prosodic markers and readings of the same texts from which prosodic markers have been removed. Finally, the effect of prosodic markers is tested in perception experiments involving both human and resynthesized utterances

    Relational Reasoning and Socially Shared Regulation of Learning in Collaborative Problem Solving

    Get PDF
    The ability to solve complex problems in collaborative settings is considered a critical 21st century competency. Yet, national and international reports have revealed deficiencies in both students’ and employees’ teamwork and communication skills, which are essential when working collaboratively. These deficits may be underlain by a limited understanding of how cognitive and social processes operate synchronistically as team members work together to solve complex problems. The current study aimed to investigate how two specific processes—relational reasoning (RR) and socially shared regulation of learning (SSRL)—unfold during a collaborative problem-solving (CPS) task. Specifically, the researcher assessed the extent to which different teams exhibited differential proportions of reasoning and regulation; how team activity was distributed across individuals; and, whether frequent sequences of reasoning and regulation could be identified. To address these aims, four teams of senior undergraduate students (n = 22) were recruited from a capstone design course in mechanical engineering. Over the course of the semester, teams conceptualized and prototyped a design to address a current market need. Each team was video-recorded during the conceptualization process—specifically, as teams evaluated and eliminated ideas from their corpus of designs. Team conversations were transcribed, segmented into utterances, and coded for the presence of RR, SSRL, and task-related and other talk. Results from chi-square tests of independence, social network analysis, and sequence mining revealed that teams indeed exhibited differential proportions of RR and SSRL, with antinomous reasoning and monitoring and control of consensus emerging as key CPS processes. Further, planning and reflection acted as bookends to CPS, while RR and monitoring processes co-occurred in the interim. Finally, CPS alternated between periods of activity that were shared more and less equally among team members. This study contributes to the literature on CPS by exploring the dynamic interplay between RR and SSRL and by demonstrating that CPS can be investigated at the micro level, meso level, and macro level. Methodologically, this study demonstrates how leveraging data mining techniques and assembling compelling visualizations can illustrate the recursive and cyclical character of RR and SSRL. Finally, limitations are noted, and implications for research and practice are forwarded

    Audio-Visual Speech Enhancement Based on Deep Learning

    Get PDF

    The Relationship Between Parental Literacy and Language Practices and Beliefs and Toddlers’ Emergent Literacy Skills

    Get PDF
    This study involved the examination of parental beliefs and practices about early literacy and language and how they influence observed literacy behaviors of their 18-36 month old toddlers. Observed literacy behaviors of the toddlers included phonological awareness (PA) and written language awareness (WLA). The objectives of this study were to (a) characterize the emergent literacy behaviors of toddlers, (b) characterize parental beliefs and practices regarding emergent literacy with respect to toddlers, and (c) determine the relationship between toddlers’ emergent literacy behaviors and their parents’ beliefs and behaviors. Participants included 15 mother-toddler dyads. Mean age of mothers was 35.2 years (SD = 4.0). All mothers qualified as middle- upper socioeconomic status according to Hollingshead (1975). Toddlers were typically developing with a mean age of 26.74 months (SD = 5.3). Quantitative and qualitative methods were used to collect and analyze data. Main data collection measures included: (a) home literacy environment observation; (b) shared reading observations; and (c) parent questionnaire. Data from these measures were coded and organized into emergent literacy domains PA & WLA. Based on pilot findings, only one PA category (i.e., rhyming) was used. Four main WLA categories were used: (a) book conventions; (b) print conventions; (c) letter knowledge; and (d) story grammar. Characteristics of child-directed speech from shared reading interactions were analyzed. These characteristics included: (a) topic initiating utterances; (b) conversation eliciting utterances; (c) behavior directing utterances; (d) mean length of utterance; (e) rate; and (f) length of turn. Results indicated that these mothers engaged in several emergent literacy and language practices with their toddlers. Some of these behaviors were correlated with the observable emergent literacy skills of their children. For example, book convention behaviors of parents were moderately correlated with letter knowledge behaviors of toddlers (r = .549, p = .017). Letter knowledge behaviors of parents were moderately correlated with letter knowledge behaviors of toddlers (r = .524, p = .023). Additionally, length of turn for mothers was moderately correlated with print conventions behaviors of toddlers (r = .618, p = .007). These and other findings will be discussed in regards to their implications for early intervention practices

    Perception and Acquisition of Natural Authentic English Speech for Chinese Learners Using DIT\u27s Speech Technologies

    Get PDF
    Given that Chinese language learners are greatly influenced by their mother-tongue, which is a tone language rather than an intonation language, learning and coping with authentic English speech seems more difficult than for learners of other languages. The focus of the current research is, on the basis of analysis of the nature of spoken English and spoken Chinese, to help Chinese learners derive benefit from ICT technologies developed by the Technological University Dublin (DIT). The thesis concentrates on investigating the application of speech technologies in bridging the gap between students’ internalised, idealised formulations and natural, authentic English speech. Part of the testing carried out by the present author demonstrates the acceptability of a slow-down algorithm in facilitating Chinese learners of English in re-producing formulaic language. This algorithm is useful because it can slow down audio files to any desired speed between 100% and 40% without distortion, so as to allow language learners to pay attention to the real, rapid flow of ‘messy’ speech and follow the intonation patterns contained in them. The rationale for and the application of natural, dialogic native-to-native English speech to language learning is also explored. The Chinese language learners involved in this study are exposed to authentic, native speech patterns by providing them access to real, informal dialogue in various contexts. In the course of this analysis, the influence of speed of delivery and pitch range on the categorisation of formulaic language is also investigated. The study investigates the potential of the speech tools available to the present author as an effective EFL learning facility, especially for speakers of tone languages, and their role in helping language learners achieve confluent interaction in an English L1 environment

    Gesture and Speech in Interaction - 4th edition (GESPIN 4)

    Get PDF
    International audienceThe fourth edition of Gesture and Speech in Interaction (GESPIN) was held in Nantes, France. With more than 40 papers, these proceedings show just what a flourishing field of enquiry gesture studies continues to be. The keynote speeches of the conference addressed three different aspects of multimodal interaction:gesture and grammar, gesture acquisition, and gesture and social interaction. In a talk entitled Qualitiesof event construal in speech and gesture: Aspect and tense, Alan Cienki presented an ongoing researchproject on narratives in French, German and Russian, a project that focuses especially on the verbal andgestural expression of grammatical tense and aspect in narratives in the three languages. Jean-MarcColletta's talk, entitled Gesture and Language Development: towards a unified theoretical framework,described the joint acquisition and development of speech and early conventional and representationalgestures. In Grammar, deixis, and multimodality between code-manifestation and code-integration or whyKendon's Continuum should be transformed into a gestural circle, Ellen Fricke proposed a revisitedgrammar of noun phrases that integrates gestures as part of the semiotic and typological codes of individuallanguages. From a pragmatic and cognitive perspective, Judith Holler explored the use ofgaze and hand gestures as means of organizing turns at talk as well as establishing common ground in apresentation entitled On the pragmatics of multi-modal face-to-face communication: Gesture, speech andgaze in the coordination of mental states and social interaction.Among the talks and posters presented at the conference, the vast majority of topics related, quitenaturally, to gesture and speech in interaction - understood both in terms of mapping of units in differentsemiotic modes and of the use of gesture and speech in social interaction. Several presentations explored the effects of impairments(such as diseases or the natural ageing process) on gesture and speech. The communicative relevance ofgesture and speech and audience-design in natural interactions, as well as in more controlled settings liketelevision debates and reports, was another topic addressed during the conference. Some participantsalso presented research on first and second language learning, while others discussed the relationshipbetween gesture and intonation. While most participants presented research on gesture and speech froman observer's perspective, be it in semiotics or pragmatics, some nevertheless focused on another importantaspect: the cognitive processes involved in language production and perception. Last but not least,participants also presented talks and posters on the computational analysis of gestures, whether involvingexternal devices (e.g. mocap, kinect) or concerning the use of specially-designed computer software forthe post-treatment of gestural data. Importantly, new links were made between semiotics and mocap data
    • …
    corecore