585 research outputs found

    Modeling speech imitation and ecological learning of auditory-motor maps.

    Get PDF
    Classical models of speech consider an antero-posterior distinction between perceptive and productive functions. However, the selective alteration of neural activity in speech motor centers, via transcranial magnetic stimulation, was shown to affect speech discrimination. On the automatic speech recognition (ASR) side, the recognition systems have classically relied solely on acoustic data, achieving rather good performance in optimal listening conditions. The main limitations of current ASR are mainly evident in the realistic use of such systems. These limitations can be partly reduced by using normalization strategies that minimize inter-speaker variability by either explicitly removing speakers' peculiarities or adapting different speakers to a reference model. In this paper we aim at modeling a motor-based imitation learning mechanism in ASR. We tested the utility of a speaker normalization strategy that uses motor representations of speech and compare it with strategies that ignore the motor domain. Specifically, we first trained a regressor through state-of-the-art machine learning techniques to build an auditory-motor mapping, in a sense mimicking a human learner that tries to reproduce utterances produced by other speakers. This auditory-motor mapping maps the speech acoustics of a speaker into the motor plans of a reference speaker. Since, during recognition, only speech acoustics are available, the mapping is necessary to "recover" motor information. Subsequently, in a phone classification task, we tested the system on either one of the speakers that was used during training or a new one. Results show that in both cases the motor-based speaker normalization strategy slightly but significantly outperforms all other strategies where only acoustics is taken into account

    A speaker adaptive DNN training approach for speaker-independent acoustic inversion

    Get PDF
    We address the speaker-independent acoustic inversion (AI) problem, also referred to as acoustic-to-articulatory mapping. The scarce availability of multi-speaker articulatory data makes it difficult to learn a mapping which generalizes from a limited number of training speakers and reliably reconstructs the articulatory movements of unseen speakers. In this paper, we propose a Multi-task Learning (MTL)-based approach that explicitly separates the modeling of each training speaker AI peculiarities from the modeling of AI characteristics that are shared by all speakers. Our approach stems from the well known Regularized MTL approach and extends it to feed-forward deep neural networks (DNNs). Given multiple training speakers, we learn for each an acoustic-to-articulatory mapping represented by a DNN. Then, through an iterative procedure, we search for a canonical speaker-independent DNN that is "similar" to all speaker-dependent DNNs. The degree of similarity is controlled by a regularization parameter. We report experiments on the University of Wisconsin X-ray Microbeam Database under different training/testing experimental settings. The results obtained indicate that our MTL-trained canonical DNN largely outperforms a standardly trained (i.e., single task learning-based) speaker independent DNN

    Phonatory and articulatory representations of speech production in cortical and subcortical fMRI responses

    Get PDF
    Speaking involves coordination of multiple neuromotor systems, including respiration, phonation and articulation. Developing non-invasive imaging methods to study how the brain controls these systems is critical for understanding the neurobiology of speech production. Recent models and animal research suggest that regions beyond the primary motor cortex (M1) help orchestrate the neuromotor control needed for speaking, including cortical and sub-cortical regions. Using contrasts between speech conditions with controlled respiratory behavior, this fMRI study investigates articulatory gestures involving the tongue, lips and velum (i.e., alveolars versus bilabials, and nasals versus orals), and phonatory gestures (i.e., voiced versus whispered speech). Multivariate pattern analysis (MVPA) was used to decode articulatory gestures in M1, cerebellum and basal ganglia. Furthermore, apart from confirming the role of a mid-M1 region for phonation, we found that a dorsal M1 region, linked to respiratory control, showed significant differences for voiced compared to whispered speech despite matched lung volume observations. This region was also functionally connected to tongue and lip M1 seed regions, underlying its importance in the coordination of speech. Our study confirms and extends current knowledge regarding the neural mechanisms underlying neuromotor speech control, which hold promise to study neural dysfunctions involved in motor-speech disorders non-invasively.Tis work was supported by the Spanish Ministry of Economy and Competitiveness through the Juan de la Cierva Fellowship (FJCI-2015-26814), and the Ramon y Cajal Fellowship (RYC-2017- 21845), the Spanish State Research Agency through the BCBL “Severo Ochoa” excellence accreditation (SEV-2015-490), the Basque Government (BERC 2018- 2021) and the European Union’s Horizon 2020 research and innovation program under the Marie Sklodowska-Curie grant (No 799554).info:eu-repo/semantics/publishedVersio

    Time-domain speech enhancement using generative adversarial networks

    Get PDF
    Speech enhancement improves recorded voice utterances to eliminate noise that might be impeding their intelligibility or compromising their quality. Typical speech enhancement systems are based on regression approaches that subtract noise or predict clean signals. Most of them do not operate directly on waveforms. In this work, we propose a generative approach to regenerate corrupted signals into a clean version by using generative adversarial networks on the raw signal. We also explore several variations of the proposed system, obtaining insights into proper architectural choices for an adversarially trained, convolutional autoencoder applied to speech. We conduct both objective and subjective evaluations to assess the performance of the proposed method. The former helps us choose among variations and better tune hyperparameters, while the latter is used in a listening experiment with 42 subjects, confirming the effectiveness of the approach in the real world. We also demonstrate the applicability of the approach for more generalized speech enhancement, where we have to regenerate voices from whispered signals.Peer ReviewedPostprint (author's final draft

    Articulatory representations to address acoustic variability in speech

    Get PDF
    The past decade has seen phenomenal improvement in the performance of Automatic Speech Recognition (ASR) systems. In spite of this vast improvement in performance, the state-of-the-art still lags significantly behind human speech recognition. Even though certain systems claim super-human performance, this performance often is sub-par across domains and across datasets. This gap is predominantly due to the lack of robustness against speech variability. Even clean speech is extremely variable due to a large number of factors such as voice characteristics, speaking style, speaking rate, accents, casualness, emotions and more. The goal of this thesis is to investigate the variability of speech from the perspective of speech production, put forth robust articulatory features to address this variability, and to incorporate these features in state-of-the-art ASR systems in the best way possible. ASR systems model speech as a sequence of distinctive phone units like beads on a string. Although phonemes are distinctive units in the cognitive domain, their physical realizations are extremely varied due to coarticulation and lenition which are commonly observed in conversational speech. The traditional approaches deal with this issue by performing di-, tri- or quin-phone based acoustic modeling but are insufficient to model longer contextual dependencies. Articulatory phonology analyzes speech as a constellation of coordinated articulatory gestures performed by the articulators in the vocal tract (lips, tongue tip, tongue body, jaw, glottis and velum). In this framework, acoustic variability is explained by the temporal overlap of gestures and their reduction in space. In order to analyze speech in terms of articulatory gestures, the gestures need to be estimated from the speech signal. The first part of the thesis focuses on a speaker independent acoustic-to-articulatory inversion system that was developed to estimate vocal tract constriction variables (TVs) from speech. The mapping from acoustics to TVs was learned from the multi-speaker X-ray Microbeam (XRMB) articulatory dataset. Constriction regions from TV trajectories were defined as articulatory gestures using articulatory kinematics. The speech inversion system combined with the TV kinematics based gesture annotation provided a system to estimate articulatory gestures from speech. The second part of this thesis deals with the analysis of the articulatory trajectories under different types of variability such as multiple speakers, speaking rate, and accents. It was observed that speaker variation degraded the performance of the speech inversion system. A Vocal Tract Length Normalization (VTLN) based speaker normalization technique was therefore developed to address the speaker variability in the acoustic and articulatory domains. The performance of speech inversion systems was analyzed on an articulatory dataset containing speaking rate variations to assess if the model was able to reliably predict the TVs in challenging coarticulatory scenarios. The performance of the speech inversion system was analyzed in cross accent and cross language scenarios through experiments on a Dutch and British English articulatory dataset. These experiments provide a quantitative measure of the robustness of the speech inversion systems to different speech variability. The final part of this thesis deals with the incorporation of articulatory features in state-of-the-art medium vocabulary ASR systems. A hybrid convolutional neural network (CNN) architecture was developed to fuse the acoustic and articulatory feature streams in an ASR system. ASR experiments were performed on the Wall Street Journal (WSJ) corpus. Several articulatory feature combinations were explored to determine the best feature combination. Cross-corpus evaluations were carried out to evaluate the WSJ trained ASR system on the TIMIT and another dataset containing speaking rate variability. Results showed that combining articulatory features with acoustic features through the hybrid CNN improved the performance of the ASR system in matched and mismatched evaluation conditions. The findings based on this dissertation indicate that articulatory representations extracted from acoustics can be used to address acoustic variability in speech observed due to speakers, accents, and speaking rates and further be used to improve the performance of Automatic Speech Recognition systems

    The Effect of Speaking Rate on Vowel Variability Based on the Uncontrolled Manifold Approach and Flow-Based Invertible Neural Network Modeling

    Full text link
    Variability is intrinsic to human speech production. One approach to understand variability in speech is to decompose it into task-irrelevant (“good”) and task-relevant (“bad”) parts with respect to speech tasks. Based on the uncontrolled manifold (UCM) approach, this dissertation investigates how vowel token-to-token variability in articulation and acoustics can be decomposed into “good” and “bad” parts and how speaking rate changes the pattern of these two from the Haskins IEEE rate comparison database. Furthermore, it is examined whether the “good” part of variability, or flexibility, can be modeled directly from speech data using the flow-based invertible neural networks framework. The application of the UCM analysis and FlowINN modeling method is discussed, particularly focusing on how the “good” part of variability in speech can be useful rather than being disregarded as noise

    Information theoretic perspectives on en- and decoding in audition and vision

    Get PDF
    In cognitive neuroscience, encoding and decoding models mathematically relate stimuli in the outside world to neuronal or behavioural responses. While both stimuli and responses can be multidimensional variables, these models are on their own limited to bivariate descriptions of correspondences. In order to assess the cognitive or neuroscientific significance of such correspondences, a key challenge is to set them in relation to other variables. This thesis uses information theory to contextualise encoding and decoding models in example cases of audition and vision. In the first example, encoding models based on a certain operationalisation of the stimulus are relativised by models based on other operationalisations of the same stimulus material that are conceptually simpler and shown to predict the same neuronal response variance. This highlights the ambiguity inherent in an individual model. In the second example, a methodological contribution is made to the problem of relating the bivariate dependency of stimuli and responses to the history of response components with high degrees of predictability. This perspective demonstrates that only a subset of all stimulus-correlated response variance can be expected to be genuinely caused by the stimulus, while another subset is the consequence of the response’s own dynamics. In the third and final example, complex models are used to predict behavioural responses. Their predictions are grounded in experimentally controlled stimulus variance, such that interpretations of what the models predicted responses with are facilitated. Together, these three perspectives underscore the need to go beyond bivariate descriptions of correspondences in order to understand the process of perception
    corecore