1,706 research outputs found

    ARTICULATORY INFORMATION FOR ROBUST SPEECH RECOGNITION

    Get PDF
    Current Automatic Speech Recognition (ASR) systems fail to perform nearly as good as human speech recognition performance due to their lack of robustness against speech variability and noise contamination. The goal of this dissertation is to investigate these critical robustness issues, put forth different ways to address them and finally present an ASR architecture based upon these robustness criteria. Acoustic variations adversely affect the performance of current phone-based ASR systems, in which speech is modeled as `beads-on-a-string', where the beads are the individual phone units. While phone units are distinctive in cognitive domain, they are varying in the physical domain and their variation occurs due to a combination of factors including speech style, speaking rate etc.; a phenomenon commonly known as `coarticulation'. Traditional ASR systems address such coarticulatory variations by using contextualized phone-units such as triphones. Articulatory phonology accounts for coarticulatory variations by modeling speech as a constellation of constricting actions known as articulatory gestures. In such a framework, speech variations such as coarticulation and lenition are accounted for by gestural overlap in time and gestural reduction in space. To realize a gesture-based ASR system, articulatory gestures have to be inferred from the acoustic signal. At the initial stage of this research an initial study was performed using synthetically generated speech to obtain a proof-of-concept that articulatory gestures can indeed be recognized from the speech signal. It was observed that having vocal tract constriction trajectories (TVs) as intermediate representation facilitated the gesture recognition task from the speech signal. Presently no natural speech database contains articulatory gesture annotation; hence an automated iterative time-warping architecture is proposed that can annotate any natural speech database with articulatory gestures and TVs. Two natural speech databases: X-ray microbeam and Aurora-2 were annotated, where the former was used to train a TV-estimator and the latter was used to train a Dynamic Bayesian Network (DBN) based ASR architecture. The DBN architecture used two sets of observation: (a) acoustic features in the form of mel-frequency cepstral coefficients (MFCCs) and (b) TVs (estimated from the acoustic speech signal). In this setup the articulatory gestures were modeled as hidden random variables, hence eliminating the necessity for explicit gesture recognition. Word recognition results using the DBN architecture indicate that articulatory representations not only can help to account for coarticulatory variations but can also significantly improve the noise robustness of ASR system

    Articulatory Copy Synthesis Based on the Speech Synthesizer VocalTractLab

    Get PDF
    Articulatory copy synthesis (ACS), a subarea of speech inversion, refers to the reproduction of natural utterances and involves both the physiological articulatory processes and their corresponding acoustic results. This thesis proposes two novel methods for the ACS of human speech using the articulatory speech synthesizer VocalTractLab (VTL) to address or mitigate the existing problems of speech inversion, such as non-unique mapping, acoustic variation among different speakers, and the time-consuming nature of the process. The first method involved finding appropriate VTL gestural scores for given natural utterances using a genetic algorithm. It consisted of two steps: gestural score initialization and optimization. In the first step, gestural scores were initialized using the given acoustic signals with speech recognition, grapheme-to-phoneme (G2P), and a VTL rule-based method for converting phoneme sequences to gestural scores. In the second step, the initial gestural scores were optimized by a genetic algorithm via an analysis-by-synthesis (ABS) procedure that sought to minimize the cosine distance between the acoustic features of the synthetic and natural utterances. The articulatory parameters were also regularized during the optimization process to restrict them to reasonable values. The second method was based on long short-term memory (LSTM) and convolutional neural networks, which were responsible for capturing the temporal dependence and the spatial structure of the acoustic features, respectively. The neural network regression models were trained, which used acoustic features as inputs and produced articulatory trajectories as outputs. In addition, to cover as much of the articulatory and acoustic space as possible, the training samples were augmented by manipulating the phonation type, speaking effort, and the vocal tract length of the synthetic utterances. Furthermore, two regularization methods were proposed: one based on the smoothness loss of articulatory trajectories and another based on the acoustic loss between original and predicted acoustic features. The best-performing genetic algorithms and convolutional LSTM systems (evaluated in terms of the difference between the estimated and reference VTL articulatory parameters) obtained average correlation coefficients of 0.985 and 0.983 for speaker-dependent utterances, respectively, and their reproduced speech achieved recognition accuracies of 86.25% and 64.69% for speaker-independent utterances of German words, respectively. When applied to German sentence utterances, as well as English and Mandarin Chinese word utterances, the neural network based ACS systems achieved recognition accuracies of 73.88%, 52.92%, and 52.41%, respectively. The results showed that both of these methods not only reproduced the articulatory processes but also reproduced the acoustic signals of reference utterances. Moreover, the regularization methods led to more physiologically plausible articulatory processes and made the estimated articulatory trajectories be more articulatorily preferred by VTL, thus reproducing more natural and intelligible speech. This study also found that the convolutional layers, when used in conjunction with batch normalization layers, automatically learned more distinctive features from log power spectrograms. Furthermore, the neural network based ACS systems trained using German data could be generalized to the utterances of other languages

    Generating gestural timing from EMA data using articulatory resynthesis

    Get PDF
    As part of ongoing work to integrate an articulatory synthesizer into a modular TTS platform, a method is presented which allows gestural timings to be generated automatically from EMA data. Further work is outlined which will adapt the vocal tract model and phoneset to English using new articulatory data, and use statistical trajectory models

    Dance-the-music : an educational platform for the modeling, recognition and audiovisual monitoring of dance steps using spatiotemporal motion templates

    Get PDF
    In this article, a computational platform is presented, entitled “Dance-the-Music”, that can be used in a dance educational context to explore and learn the basics of dance steps. By introducing a method based on spatiotemporal motion templates, the platform facilitates to train basic step models from sequentially repeated dance figures performed by a dance teacher. Movements are captured with an optical motion capture system. The teachers’ models can be visualized from a first-person perspective to instruct students how to perform the specific dance steps in the correct manner. Moreover, recognition algorithms-based on a template matching method can determine the quality of a student’s performance in real time by means of multimodal monitoring techniques. The results of an evaluation study suggest that the Dance-the-Music is effective in helping dance students to master the basics of dance figures

    Speech and language therapy for aphasia following stroke

    Get PDF
    Background  Aphasia is an acquired language impairment following brain damage that affects some or all language modalities: expression and understanding of speech, reading, and writing. Approximately one third of people who have a stroke experience aphasia.  Objectives  To assess the effects of speech and language therapy (SLT) for aphasia following stroke.  Search methods  We searched the Cochrane Stroke Group Trials Register (last searched 9 September 2015), CENTRAL (2015, Issue 5) and other Cochrane Library Databases (CDSR, DARE, HTA, to 22 September 2015), MEDLINE (1946 to September 2015), EMBASE (1980 to September 2015), CINAHL (1982 to September 2015), AMED (1985 to September 2015), LLBA (1973 to September 2015), and SpeechBITE (2008 to September 2015). We also searched major trials registers for ongoing trials including ClinicalTrials.gov (to 21 September 2015), the Stroke Trials Registry (to 21 September 2015), Current Controlled Trials (to 22 September 2015), and WHO ICTRP (to 22 September 2015). In an effort to identify further published, unpublished, and ongoing trials we also handsearched theInternational Journal of Language and Communication Disorders(1969 to 2005) and reference lists of relevant articles, and we contacted academic institutions and other researchers. There were no language restrictions.  Selection criteria  Randomised controlled trials (RCTs) comparing SLT (a formal intervention that aims to improve language and communication abilities, activity and participation) versus no SLT; social support or stimulation (an intervention that provides social support and communication stimulation but does not include targeted therapeutic interventions); or another SLT intervention (differing in duration, intensity, frequency, intervention methodology or theoretical approach).  Data collection and analysis  We independently extracted the data and assessed the quality of included trials. We sought missing data from investigators.  Main results  We included 57 RCTs (74 randomised comparisons) involving 3002 participants in this review (some appearing in more than one comparison). Twenty-seven randomised comparisons (1620 participants) assessed SLT versus no SLT; SLT resulted in clinically and statistically significant benefits to patients' functional communication (standardised mean difference (SMD) 0.28, 95% confidence interval (CI) 0.06 to 0.49, P = 0.01), reading, writing, and expressive language, but (based on smaller numbers) benefits were not evident at follow-up. Nine randomised comparisons (447 participants) assessed SLT with social support and stimulation; meta-analyses found no evidence of a difference in functional communication, but more participants withdrew from social support interventions than SLT. Thirty-eight randomised comparisons (1242 participants) assessed two approaches to SLT. Functional communication was significantly better in people with aphasia that received therapy at a high intensity, high dose, or over a long duration compared to those that received therapy at a lower intensity, lower dose, or over a shorter period of time. The benefits of a high intensity or a high dose of SLT were confounded by a significantly higher dropout rate in these intervention groups. Generally, trials randomised small numbers of participants across a range of characteristics (age, time since stroke, and severity profiles), interventions, and outcomes.  Authors' conclusions  Our review provides evidence of the effectiveness of SLT for people with aphasia following stroke in terms of improved functional communication, reading, writing, and expressive language compared with no therapy. There is some indication that therapy at high intensity, high dose or over a longer period may be beneficial. HIgh-intensity and high dose interventions may not be acceptable to all

    Synchronization of Speech and Gesture : Evidence for Interaction in Action

    Get PDF
    Peer reviewedPostprin

    A systematic investigation of gesture kinematics in evolving manual languages in the lab

    Get PDF
    Item does not contain fulltextSilent gestures consist of complex multi-articulatory movements but are now primarily studied through categorical coding of the referential gesture content. The relation of categorical linguistic content with continuous kinematics is therefore poorly understood. Here, we reanalyzed the video data from a gestural evolution experiment (Motamedi, Schouwstra, Smith, Culbertson, & Kirby, 2019), which showed increases in the systematicity of gesture content over time. We applied computer vision techniques to quantify the kinematics of the original data. Our kinematic analyses demonstrated that gestures become more efficient and less complex in their kinematics over generations of learners. We further detect the systematicity of gesture form on the level of thegesture kinematic interrelations, which directly scales with the systematicity obtained on semantic coding of the gestures. Thus, from continuous kinematics alone, we can tap into linguistic aspects that were previously only approachable through categorical coding of meaning. Finally, going beyond issues of systematicity, we show how unique gesture kinematic dialects emerged over generations as isolated chains of participants gradually diverged over iterations from other chains. We, thereby, conclude that gestures can come to embody the linguistic system at the level of interrelationships between communicative tokens, which should calibrate our theories about form and linguistic content.29 p
    • 

    corecore