16,566 research outputs found

    Parallel Reference Speaker Weighting for Kinematic-Independent Acoustic-to-Articulatory Inversion

    Get PDF
    Acoustic-to-articulatory inversion, the estimation of articulatory kinematics from an acoustic waveform, is a challenging but important problem. Accurate estimation of articulatory movements has the potential for significant impact on our understanding of speech production, on our capacity to assess and treat pathologies in a clinical setting, and on speech technologies such as computer aided pronunciation assessment and audio-video synthesis. However, because of the complex and speaker-specific relationship between articulation and acoustics, existing approaches for inversion do not generalize well across speakers. As acquiring speaker-specific kinematic data for training is not feasible in many practical applications, this remains an important and open problem. This paper proposes a novel approach to acoustic-to-articulatory inversion, Parallel Reference Speaker Weighting (PRSW), which requires no kinematic data for the target speaker and a small amount of acoustic adaptation data. PRSW hypothesizes that acoustic and kinematic similarities are correlated and uses speaker-adapted articulatory models derived from acoustically derived weights. The system was assessed using a 20-speaker data set of synchronous acoustic and Electromagnetic Articulography (EMA) kinematic data. Results demonstrate that by restricting the reference group to a subset consisting of speakers with strong individual speaker-dependent inversion performance, the PRSW method is able to attain kinematic-independent acoustic-to-articulatory inversion performance nearly matching that of the speaker-dependent model, with an average correlation of 0.62 versus 0.63. This indicates that given a sufficiently complete and appropriately selected reference speaker set for adaptation, it is possible to create effective articulatory models without kinematic training data

    Learning second language speech perception in natural settings

    Get PDF

    Learning and adaptation in speech production without a vocal tract

    Get PDF
    How is the complex audiomotor skill of speaking learned? To what extent does it depend on the specific characteristics of the vocal tract? Here, we developed a touchscreen-based speech synthesizer to examine learning of speech production independent of the vocal tract. Participants were trained to reproduce heard vowel targets by reaching to locations on the screen without visual feedback and receiving endpoint vowel sound auditory feedback that depended continuously on touch location. Participants demonstrated learning as evidenced by rapid increases in accuracy and consistency in the production of trained targets. This learning generalized to productions of novel vowel targets. Subsequent to learning, sensorimotor adaptation was observed in response to changes in the location-sound mapping. These findings suggest that participants learned adaptable sensorimotor maps allowing them to produce desired vowel sounds. These results have broad implications for understanding the acquisition of speech motor control.Published versio

    Articulating: the neural mechanisms of speech production

    Full text link
    Speech production is a highly complex sensorimotor task involving tightly coordinated processing across large expanses of the cerebral cortex. Historically, the study of the neural underpinnings of speech suffered from the lack of an animal model. The development of non-invasive structural and functional neuroimaging techniques in the late 20th century has dramatically improved our understanding of the speech network. Techniques for measuring regional cerebral blood flow have illuminated the neural regions involved in various aspects of speech, including feedforward and feedback control mechanisms. In parallel, we have designed, experimentally tested, and refined a neural network model detailing the neural computations performed by specific neuroanatomical regions during speech. Computer simulations of the model account for a wide range of experimental findings, including data on articulatory kinematics and brain activity during normal and perturbed speech. Furthermore, the model is being used to investigate a wide range of communication disorders.R01 DC002852 - NIDCD NIH HHS; R01 DC007683 - NIDCD NIH HHS; R01 DC016270 - NIDCD NIH HHSAccepted manuscrip

    DropClass and DropAdapt: Dropping classes for deep speaker representation learning

    Get PDF
    Many recent works on deep speaker embeddings train their feature extraction networks on large classification tasks, distinguishing between all speakers in a training set. Empirically, this has been shown to produce speaker-discriminative embeddings, even for unseen speakers. However, it is not clear that this is the optimal means of training embeddings that generalize well. This work proposes two approaches to learning embeddings, based on the notion of dropping classes during training. We demonstrate that both approaches can yield performance gains in speaker verification tasks. The first proposed method, DropClass, works via periodically dropping a random subset of classes from the training data and the output layer throughout training, resulting in a feature extractor trained on many different classification tasks. Combined with an additive angular margin loss, this method can yield a 7.9% relative improvement in equal error rate (EER) over a strong baseline on VoxCeleb. The second proposed method, DropAdapt, is a means of adapting a trained model to a set of enrolment speakers in an unsupervised manner. This is performed by fine-tuning a model on only those classes which produce high probability predictions when the enrolment speakers are used as input, again also dropping the relevant rows from the output layer. This method yields a large 13.2% relative improvement in EER on VoxCeleb. The code for this paper has been made publicly available.Comment: Submitted to Speaker Odyssey 202

    Speech Disruption During Delayed Auditory Feedback with Simultaneous Visual Feedback

    Get PDF
    Delayed auditory feedback (DAF) regarding speech can cause dysfluency. The purpose of this study was to explore whether providing visual feedback in addition to DAF would ameliorate speech disruption. Speakers repeated sentences and heard their auditory feedback delayed with and without simultaneous visual feedback. DAF led to increased sentence durations and an increased number of speech disruptions. Although visual feedback did not reduce DAF effects on duration, a promising but nonsignificant trend was observed for fewer speech disruptions when visual feedback was provided. This trend was significant in speakers who were overall less affected by DAF. The results suggest the possibility that speakers strategically use alternative sources of feedback
    • …
    corecore