18,051 research outputs found

    Non-native children speech recognition through transfer learning

    Full text link
    This work deals with non-native children's speech and investigates both multi-task and transfer learning approaches to adapt a multi-language Deep Neural Network (DNN) to speakers, specifically children, learning a foreign language. The application scenario is characterized by young students learning English and German and reading sentences in these second-languages, as well as in their mother language. The paper analyzes and discusses techniques for training effective DNN-based acoustic models starting from children native speech and performing adaptation with limited non-native audio material. A multi-lingual model is adopted as baseline, where a common phonetic lexicon, defined in terms of the units of the International Phonetic Alphabet (IPA), is shared across the three languages at hand (Italian, German and English); DNN adaptation methods based on transfer learning are evaluated on significant non-native evaluation sets. Results show that the resulting non-native models allow a significant improvement with respect to a mono-lingual system adapted to speakers of the target language

    Recognizing Speech in a Novel Accent: The Motor Theory of Speech Perception Reframed

    Get PDF
    The motor theory of speech perception holds that we perceive the speech of another in terms of a motor representation of that speech. However, when we have learned to recognize a foreign accent, it seems plausible that recognition of a word rarely involves reconstruction of the speech gestures of the speaker rather than the listener. To better assess the motor theory and this observation, we proceed in three stages. Part 1 places the motor theory of speech perception in a larger framework based on our earlier models of the adaptive formation of mirror neurons for grasping, and for viewing extensions of that mirror system as part of a larger system for neuro-linguistic processing, augmented by the present consideration of recognizing speech in a novel accent. Part 2 then offers a novel computational model of how a listener comes to understand the speech of someone speaking the listener's native language with a foreign accent. The core tenet of the model is that the listener uses hypotheses about the word the speaker is currently uttering to update probabilities linking the sound produced by the speaker to phonemes in the native language repertoire of the listener. This, on average, improves the recognition of later words. This model is neutral regarding the nature of the representations it uses (motor vs. auditory). It serve as a reference point for the discussion in Part 3, which proposes a dual-stream neuro-linguistic architecture to revisits claims for and against the motor theory of speech perception and the relevance of mirror neurons, and extracts some implications for the reframing of the motor theory

    Leveraging native language information for improved accented speech recognition

    Full text link
    Recognition of accented speech is a long-standing challenge for automatic speech recognition (ASR) systems, given the increasing worldwide population of bi-lingual speakers with English as their second language. If we consider foreign-accented speech as an interpolation of the native language (L1) and English (L2), using a model that can simultaneously address both languages would perform better at the acoustic level for accented speech. In this study, we explore how an end-to-end recurrent neural network (RNN) trained system with English and native languages (Spanish and Indian languages) could leverage data of native languages to improve performance for accented English speech. To this end, we examine pre-training with native languages, as well as multi-task learning (MTL) in which the main task is trained with native English and the secondary task is trained with Spanish or Indian Languages. We show that the proposed MTL model performs better than the pre-training approach and outperforms a baseline model trained simply with English data. We suggest a new setting for MTL in which the secondary task is trained with both English and the native language, using the same output set. This proposed scenario yields better performance with +11.95% and +17.55% character error rate gains over baseline for Hispanic and Indian accents, respectively.Comment: Accepted at Interspeech 201

    Pronunciation variation modelling using accent features

    Get PDF

    Agreeing to disagree : constant non-alignment of speech gestures in dialogue

    Get PDF
    Numerous studies suggest that interlocutors in a dialogue align with each other in terms of their articulatory gestures. It is often suggested that this, first, is the consequence of an automatic tendency for imitation and, second, this fosters mutual understanding. Making use of online archives of media, it was tested whether alignment is hence inevitable. The focus was on the pronunciation of the German word. The standard pronunciation is, but speakers with a Swabian accent produce, acoustically reflected in the fricative spectra. We measured the spectra of fricatives in from interviewers while interviewing either a prominent German politician using the Swabian variant or an interviewee using the standard variant. Results showed neither an overall influence of the interviewees' pronunciation on the fricative realization by the interviewer nor a tendency to align over time for interviewer-interviewee pairs with different pronunciations. This shows that phonetic alignment in conversation is a more complex process than most current theories seem to suggest. Moreover, failure to align may not impede mutual understanding.peer-reviewe

    Non-native contrasts in Tongan loans

    Get PDF
    We present three case studies of marginal contrasts in Tongan loans from English, working with data from three speakers. Although Tongan lacks contrasts in stress or in CC vs. CVC sequences, secondary stress in loans is contrastive, and is sensitive to whether a vowel has a correspondent in the English source word; vowel deletion is also sensitive to whether a vowel is epenthetic as compared to the English source; and final vowel length is sensitive to whether the penultimate vowel is epenthetic, and if not, whether it corresponds to a stressed or unstressed vowel in the English source. We provide an analysis in the multilevel model of Boersma (1998) and Boersma & Hamann (2009), and show that the loan patterns can be captured using only constraints that plausibly are needed for native-word phonology, including constraints that reflect perceptual strategies
    • 

    corecore