23 research outputs found

    MISPRONUNCIATION DETECTION AND DIAGNOSIS IN MANDARIN ACCENTED ENGLISH SPEECH

    Get PDF
    This work presents the development, implementation, and evaluation of a Mispronunciation Detection and Diagnosis (MDD) system, with application to pronunciation evaluation of Mandarin-accented English speech. A comprehensive detection and diagnosis of errors in the Electromagnetic Articulography corpus of Mandarin-Accented English (EMA-MAE) was performed by using the expert phonetic transcripts and an Automatic Speech Recognition (ASR) system. Articulatory features derived from the parallel kinematic data available in the EMA-MAE corpus were used to identify the most significant articulatory error patterns seen in L2 speakers during common mispronunciations. Using both acoustic and articulatory information, an ASR based Mispronunciation Detection and Diagnosis (MDD) system was built and evaluated across different feature combinations and Deep Neural Network (DNN) architectures. The MDD system captured mispronunciation errors with a detection accuracy of 82.4%, a diagnostic accuracy of 75.8% and a false rejection rate of 17.2%. The results demonstrate the advantage of using articulatory features in revealing the significant contributors of mispronunciation as well as improving the performance of MDD systems

    Dealing with linguistic mismatches for automatic speech recognition

    Get PDF
    Recent breakthroughs in automatic speech recognition (ASR) have resulted in a word error rate (WER) on par with human transcribers on the English Switchboard benchmark. However, dealing with linguistic mismatches between the training and testing data is still a significant challenge that remains unsolved. Under the monolingual environment, it is well-known that the performance of ASR systems degrades significantly when presented with the speech from speakers with different accents, dialects, and speaking styles than those encountered during system training. Under the multi-lingual environment, ASR systems trained on a source language achieve even worse performance when tested on another target language because of mismatches in terms of the number of phonemes, lexical ambiguity, and power of phonotactic constraints provided by phone-level n-grams. In order to address the issues of linguistic mismatches for current ASR systems, my dissertation investigates both knowledge-gnostic and knowledge-agnostic solutions. In the first part, classic theories relevant to acoustics and articulatory phonetics that present capability of being transferred across a dialect continuum from local dialects to another standardized language are re-visited. Experiments demonstrate the potentials that acoustic correlates in the vicinity of landmarks could help to build a bridge for dealing with mismatches across difference local or global varieties in a dialect continuum. In the second part, we design an end-to-end acoustic modeling approach based on connectionist temporal classification loss and propose to link the training of acoustics and accent altogether in a manner similar to the learning process in human speech perception. This joint model not only performed well on ASR with multiple accents but also boosted accuracies of accent identification task in comparison to separately-trained models

    Towards Automatic Speech-Language Assessment for Aphasia Rehabilitation

    Full text link
    Speech-based technology has the potential to reinforce traditional aphasia therapy through the development of automatic speech-language assessment systems. Such systems can provide clinicians with supplementary information to assist with progress monitoring and treatment planning, and can provide support for on-demand auxiliary treatment. However, current technology cannot support this type of application due to the difficulties associated with aphasic speech processing. The focus of this dissertation is on the development of computational methods that can accurately assess aphasic speech across a range of clinically-relevant dimensions. The first part of the dissertation focuses on novel techniques for assessing aphasic speech intelligibility in constrained contexts. The second part investigates acoustic modeling methods that lead to significant improvement in aphasic speech recognition and allow the system to work with unconstrained speech samples. The final part demonstrates the efficacy of speech recognition-based analysis in automatic paraphasia detection, extraction of clinically-motivated quantitative measures, and estimation of aphasia severity. The methods and results presented in this work will enable robust technologies for accurately recognizing and assessing aphasic speech, and will provide insights into the link between computational methods and clinical understanding of aphasia.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/140840/1/ducle_1.pd

    The Role of Phonics in Teaching English Pronunciation English as a Foreign Language Students

    Get PDF
    This thesis proposes the use of a modified phonics program to teach students basic rules that will help them to translate graphemes to phonemes in both words they are familiar with and words they are not. It is a common misconception that English has a highly irregular or irrational orthography. Quite to the contrary, English, as a morphophonenic language, has a highly regular orthography governed by systematic rules and spelling patterns that correspond to phonemes in speech. We argue that a knowledge of these rules give students the necessary tools to move from grapheme to phoneme. This also increases their confidence, develops their metacognitive awareness and produces autonomous learners whose pronunciation and communication will improve because of knowledge of how English works and relates writing to speaking

    A Neurocomputational Model of Grounded Language Comprehension and Production at the Sentence Level

    Get PDF
    While symbolic and statistical approaches to natural language processing have become undeniably impressive in recent years, such systems still display a tendency to make errors that are inscrutable to human onlookers. This disconnect with human processing may stem from the vast differences in the substrates that underly natural language processing in artificial systems versus biological systems. To create a more relatable system, this dissertation turns to the more biologically inspired substrate of neural networks, describing the design and implementation of a model that learns to comprehend and produce language at the sentence level. The model's task is to ground simulated speech streams, representing a simple subset of English, in terms of a virtual environment. The model learns to understand and answer full-sentence questions about the environment by mimicking the speech stream of another speaker, much as a human language learner would. It is the only known neural model to date that can learn to map natural language questions to full-sentence natural language answers, where both question and answer are represented sublexically as phoneme sequences. The model addresses important points for which most other models, neural and otherwise, fail to account. First, the model learns to ground its linguistic knowledge using human-like sensory representations, gaining language understanding at a deeper level than that of syntactic structure. Second, analysis provides evidence that the model learns combinatorial internal representations, thus gaining the compositionality of symbolic approaches to cognition, which is vital for computationally efficient encoding and decoding of meaning. The model does this while retaining the fully distributed representations characteristic of neural networks, providing the resistance to damage and graceful degradation that are generally lacking in symbolic and statistical approaches. Finally, the model learns via direct imitation of another speaker, allowing it to emulate human processing with greater fidelity, thus increasing the relatability of its behavior. Along the way, this dissertation develops a novel training algorithm that, for the first time, requires only local computations to train arbitrary second-order recurrent neural networks. This algorithm is evaluated on its overall efficacy, biological feasibility, and ability to reproduce peculiarities of human learning such as age-correlated effects in second language acquisition

    Audiovisual processing in aphasic and non-brain-damaged listeners:the whole is more than the sum of its parts

    Get PDF
    Spraakverwerking is een taak, die (meestal) zonder veel moeite gedaan wordt. Slechts als de verwerking verstoord is, bijvoorbeeld als gevolg van hersenletsel, merken we de complexiteit ervan op. Dörte Hessler deed onderzoek naar dit fenomeen. Niet alleen auditieve, maar ook audiovisuele verwerking van klanken komt aan bod. Uit het onderzoek kwam allereerst naar voren dat mensen met een afasie (een taalstoornis die optreedt als gevolg van hersenletsel) meer moeite hebben met het herkennen van kleine dan van grote klankverschillen. Klanken kunnen bijvoorbeeld verschillen in de manier waarop de klank wordt gemaakt, de plaats waar dat gebeurt en het feit of de stembanden gaan trillen bij een klank. Klanken die op al deze drie onderdelen van elkaar verschilden bleken eenvoudiger te herkennen dan klanken die maar op één onderdeel verschilden. Het lastigste onderscheid was te maken bij klanken die alleen verschilden in het al of niet laten trillen van de stembanden (bijvoorbeeld het verschil tussen p of b). Hersenreacties van luisteraars zonder taalproblemen lieten in het verlengde hiervan zien hersengolven een sterkere reactie vertonen wanneer de verschillen tussen klanken klein zijn. Dit hangt waarschijnlijk samen met de extra aandacht die nodig is om deze kleinere verschillen te verwerken. Het onderzoek wees verder uit dat visuele ondersteuning (liplezen), die een positieve invloed heeft op de spraakverwerking, zich niet beperkt tot hele duidelijke herkenbare klankkenmerken, zoals de plaats van uitspraak, maar ook op de manier van uitspreken en de stembandtrilling. Ook personen zonder hersenbeschadiging tonen een effect van liplezen: hun reactietijden dalen, als ze een doelklank moeten kiezen. Verder werden ook hun hersenreacties beïnvloed: auditieve en audiovisuele input leidden tot duidelijke verschillen in reactiepatronen. Verwerking was eenvoudiger bij een audiovisueel aan bod van een klank

    Computer analysis of children's non-native English speech for language learning and assessment

    Get PDF
    Children's ASR appears to be more challenging than adults' and it's even more difficult when it comes to non-native children's speech. This research investigates different techniques to compensate for the effects of non-native and children on the performance of ASR systems. The study mainly utilises hybrid DNN-HMM systems with conventional DNNs, LSTMs and more advanced TDNN models. This work uses the CALL-ST corpus and TLT-school corpus to study children's non-native English speech. Initially, data augmentation was explored on the CALL-ST corpus to address the lack of data problem using the AMI corpus and PF-STAR German corpus. Feature selection, acoustic model adaptation and selection were also investigated on CALL-ST. More aspects of the ASR system, including pronunciation modelling, acoustic modelling, language modelling and system fusion, were explored on the TLT-school corpus as this corpus has a bigger amount of data. Then, the relationships between the CALL-ST and TLT-school corpora were studied and utilised to improve ASR performance. The other part of the present work is text processing for non-native children's English speech. We focused on providing accept/reject feedback to learners based on the text generated by the ASR system from learners' spoken responses. A rule-based and a machine learning-based system were proposed for making the judgement, several aspects of the systems were evaluated. The influence of the ASR system on the text processing system was explored

    Vowel acquisition in a multidialectal environment: A five-year longitudinal case study

    Get PDF
    What happens when a child is exposed to multiple phonological systems while they are acquiring language? How do they resolve contradictory patterns in the accents around them in their own developing speech production? Do they acquire the accent of the local community, their parents’ accent, or something in between? This thesis examines the acquisition of a subset of vowels in a child growing up in a multidialectal environment. The child’s realisations of vowels in the lexical sets STRUT, FOOT, START, PALM and BATH are analysed between the ages of 2;01 and 6;11. Previous research has shown that while a child’s accent is usually heavily influenced by their peers, having parents from outside the local area can prevent complete acquisition of an accent. Local cultural values, whether or not a parent’s accent has more prestigious elements than the local one, a child’s personality, and the complexity of the relationship between the home and local phonological systems have all been implicated in whether or not a child fully acquires a local accent. In the child studied here, a shift from the vowels used at home to local variants always happened at the level of articulatory feature, rather than at phonemic level, in the first instance, and vowels belonging to different lexical sets were acquired at different rates. This thesis demonstrates that acquisition of these vowels takes many years, as combinations of articulatory features stabilise. Moreover, even once a local variant has apparently been acquired, the variety of language spoken at home can leave a phonetic legacy in a child’s accent. Naturalistic data collection combined with impressionistic and acoustic analysis in conjunction with a long and sustained data collection period reveals patterns in this child’s phonological acquisition not seen in any previous research in this detail
    corecore