394 research outputs found

    Two and three-dimensional visual articulatory models for pronunciation training and for treatment of speech disorders

    Get PDF
    Visual articulatory models can be used for visualizing vocal tract articulatory speech movements. This information may be helpful in pronunciation training or in therapy of speech disorders. For testing this hypothesis, speech recognition rates were quantified for mute animations of vocalic and consonantal speech movements generated by a 2D and a 3D visual articulatory model. The visually based speech sound recognition test (mimicry test) was performed by two groups of eight children (five to eight years old) matched in age and sex. The children were asked to mimic the visually produced mute speech movement animations for different speech sounds. Recognition rates stay significantly above chance but indicate no significant difference for each of the two models. Children older than 5 years are capable of interpreting vocal tract articulatory speech sound movements without any preparatory training in a speech adequate way. The complex 3D-display of vocal tract articulatory movements provides no significant advantage in comparison to the visually simpler 2D-midsagittal displays of vocal tract articulatory movements

    An interactive speech training system with virtual reality articulation for Mandarin-speaking hearing impaired children

    Get PDF
    The present project involved the development of a novel interactive speech training system based on virtual reality articulation and examination of the efficacy of the system for hearing impaired (HI) children. Twenty meaningful Mandarin words were presented to the HI children via a 3-D talking head during articulation training. Electromagnetic Articulography (EMA) and graphic transform technology were used to depict movements of various articulators. In addition, speech corpuses were organized in listening and speaking training modules of the system to help improve language skills of the HI children. Accuracy of virtual reality articulatory movement was evaluated through a series of experiments. Finally, a pilot test was performed to train two HI children using the system. Preliminary results showed improvement in speech production by the HI children, and the system was recognized as acceptable and interesting for children. It can be concluded that the training system is effective and valid in articulation training for HI children. © 2013 IEEE.published_or_final_versio

    Parallel Reference Speaker Weighting for Kinematic-Independent Acoustic-to-Articulatory Inversion

    Get PDF
    Acoustic-to-articulatory inversion, the estimation of articulatory kinematics from an acoustic waveform, is a challenging but important problem. Accurate estimation of articulatory movements has the potential for significant impact on our understanding of speech production, on our capacity to assess and treat pathologies in a clinical setting, and on speech technologies such as computer aided pronunciation assessment and audio-video synthesis. However, because of the complex and speaker-specific relationship between articulation and acoustics, existing approaches for inversion do not generalize well across speakers. As acquiring speaker-specific kinematic data for training is not feasible in many practical applications, this remains an important and open problem. This paper proposes a novel approach to acoustic-to-articulatory inversion, Parallel Reference Speaker Weighting (PRSW), which requires no kinematic data for the target speaker and a small amount of acoustic adaptation data. PRSW hypothesizes that acoustic and kinematic similarities are correlated and uses speaker-adapted articulatory models derived from acoustically derived weights. The system was assessed using a 20-speaker data set of synchronous acoustic and Electromagnetic Articulography (EMA) kinematic data. Results demonstrate that by restricting the reference group to a subset consisting of speakers with strong individual speaker-dependent inversion performance, the PRSW method is able to attain kinematic-independent acoustic-to-articulatory inversion performance nearly matching that of the speaker-dependent model, with an average correlation of 0.62 versus 0.63. This indicates that given a sufficiently complete and appropriately selected reference speaker set for adaptation, it is possible to create effective articulatory models without kinematic training data

    Augmented Reality Talking Heads as a Support for Speech Perception and Production

    Get PDF

    Tongue control and its implication in pronunciation training

    Get PDF
    International audiencePronunciation training based on speech production techniques illustrating tongue movements is gaining popularity. However, there is not sufficient evidence that learners can imitate some tongue animation. In this paper, we argue that although controlling tongue movement related to speech is not such an easy task, training with visual feedback improves its control. We investigated human awareness of controlling their tongue body gestures. In a first experiment, participants were asked to perform some tongue movements composed of two sets of gestures. This task was evaluated by observing ultrasound imaging of the tongue recorded during the experiment. No feedback was provided. In a second experiment, a short session of training was added where participants can observe ultrasound imaging in real time of their own tongue movements. The goal was to increase their awareness of their tongue gestures. A pretest and posttest were carried out without any feedback. The results suggest that without a priori knowledge, it is not easy to finely control tongue body gestures. The second experiment showed that we gained in performance after a short training session and this suggests that providing visual feedback, even a short one, improves tongue gesture awareness

    Articulatory Copy Synthesis Based on the Speech Synthesizer VocalTractLab

    Get PDF
    Articulatory copy synthesis (ACS), a subarea of speech inversion, refers to the reproduction of natural utterances and involves both the physiological articulatory processes and their corresponding acoustic results. This thesis proposes two novel methods for the ACS of human speech using the articulatory speech synthesizer VocalTractLab (VTL) to address or mitigate the existing problems of speech inversion, such as non-unique mapping, acoustic variation among different speakers, and the time-consuming nature of the process. The first method involved finding appropriate VTL gestural scores for given natural utterances using a genetic algorithm. It consisted of two steps: gestural score initialization and optimization. In the first step, gestural scores were initialized using the given acoustic signals with speech recognition, grapheme-to-phoneme (G2P), and a VTL rule-based method for converting phoneme sequences to gestural scores. In the second step, the initial gestural scores were optimized by a genetic algorithm via an analysis-by-synthesis (ABS) procedure that sought to minimize the cosine distance between the acoustic features of the synthetic and natural utterances. The articulatory parameters were also regularized during the optimization process to restrict them to reasonable values. The second method was based on long short-term memory (LSTM) and convolutional neural networks, which were responsible for capturing the temporal dependence and the spatial structure of the acoustic features, respectively. The neural network regression models were trained, which used acoustic features as inputs and produced articulatory trajectories as outputs. In addition, to cover as much of the articulatory and acoustic space as possible, the training samples were augmented by manipulating the phonation type, speaking effort, and the vocal tract length of the synthetic utterances. Furthermore, two regularization methods were proposed: one based on the smoothness loss of articulatory trajectories and another based on the acoustic loss between original and predicted acoustic features. The best-performing genetic algorithms and convolutional LSTM systems (evaluated in terms of the difference between the estimated and reference VTL articulatory parameters) obtained average correlation coefficients of 0.985 and 0.983 for speaker-dependent utterances, respectively, and their reproduced speech achieved recognition accuracies of 86.25% and 64.69% for speaker-independent utterances of German words, respectively. When applied to German sentence utterances, as well as English and Mandarin Chinese word utterances, the neural network based ACS systems achieved recognition accuracies of 73.88%, 52.92%, and 52.41%, respectively. The results showed that both of these methods not only reproduced the articulatory processes but also reproduced the acoustic signals of reference utterances. Moreover, the regularization methods led to more physiologically plausible articulatory processes and made the estimated articulatory trajectories be more articulatorily preferred by VTL, thus reproducing more natural and intelligible speech. This study also found that the convolutional layers, when used in conjunction with batch normalization layers, automatically learned more distinctive features from log power spectrograms. Furthermore, the neural network based ACS systems trained using German data could be generalized to the utterances of other languages

    Phonic Faces as a Method for Improving Decoding for Children with Persistent Decoding Deficits

    Get PDF
    Background: Decoding is a foundational skill for reading, contributing to both reading fluency and comprehension (Lyon et al., 2003). Visual enhancements of alphabetic letters such as shaping letters to resemble words beginning with that sound (e.g., “f” drawn as a flower) (Handler & Fierson, 2011) and associating photographs of lips producing the sounds (Lindamood & Lindamood, 1998) have been shown to improve decoding skills. This study investigated whether a more direct pictured association using faces with alphabet letters placed in the mouth to cue speech sounds, termed Phonic Faces (Norris,2001), would enable students with persistent decoding impairment to acquire orthographic patterns in pseudowords, real words, and reading passages. Methods: A multiple baseline single subject design assessed the effects of Phonic Faces on learning to decode two orthographic patterns. Three participants were taught the short vowel CVC pattern for five weeks using words and pseudowords displayed using Phonic Faces while two long-vowel patterns (CVCe and CVVC) remained in an untrained baseline condition. On week six, a five-week intervention was introduced for the long vowel pattern showing the lowest scores on daily pseudoword probes. Results: The results of the study were suggestive but not conclusive. The graphs of daily probe scores for all three subjects showed significant gains for all three patterns using the two standard deviation method of analysis. However, in all three cases, one or more of the control variables made changes prior to the introduction of treatment. Additionally, pre-to-posttest gains in measures of decoding and contextualized reading showed scores greater than the SEM, indicating true gains. Discussion: Analysis of patterns of change showed generalization of learning across patterns. Once the long vowel Phonic Faces were introduced, improvements were shown for both long vowel patterns. Likewise, the long and short vowels were embedded in similar patterns of 2-3 letter consonant blends and digraphs, all of which scored at low levels at pretest. However, once the consonant patterns were learned in the CVC words, they generalized quickly to long vowel words, especially for participants who scored higher on vowel knowledge at pretest. Replication with decoders exhibiting greater impairment is recommended

    Investigating the Effects of Speaker Variability on Arabic children’s Acquisition of English Vowels

    Get PDF
    This study investigated whether speaker variability in phonetic training benefits vowel learnability by Arabic learners of English. Perception training using High-Variability stimuli in laboratory studies has been shown to improve both the perception and production of Second Language sounds in adults and children and has become the dominant methodology for investigating issues in Second Language acquisition. Less consideration is given to production training, in which Second Language learners focus on the role of the articulators in producing second language sounds. This study aimed to assess the role of speaker variability by comparing the effect of using HighVariability and Low-Variability stimuli for production training in a classroom setting. Forty-six Arabic children aged 9-12 years were trained on 18 Standard Southern British English vowels in five training sessions over two weeks and were tested before and after training on their vowel production and category discrimination. The results indicate that Low-Variability stimuli may be more beneficial for children, however, High-Variability stimuli may alter some phonetic cues. Furthermore, the results suggest that production training may be used to improve the perception and production of Second Language sounds, but also to inform the design of Second Language pronunciation learning programmes and theories of Second Language acquisition
    • …
    corecore