984 research outputs found

    Estimating underlying articulatory targets of Thai vowels by using deep learning based on generating synthetic samples from a 3D vocal tract model and data augmentation

    Get PDF
    Representation learning is one of the fundamental issues in modeling articulatory-based speech synthesis using target-driven models. This paper proposes a computational strategy for learning underlying articulatory targets from a 3D articulatory speech synthesis model using a bi-directional long short-term memory recurrent neural network based on a small set of representative seed samples. From a seeding set, a larger training set was generated that provided richer contextual variations for the model to learn. The deep learning model for acoustic-to-target mapping was then trained to model the inverse relation of the articulation process. This method allows the trained model to map the given acoustic data onto the articulatory target parameters which can then be used to identify the distribution based on linguistic contexts. The model was evaluated based on its effectiveness in mapping acoustics to articulation, and the perceptual accuracy of speech reproduced from the estimated articulation. The results indicate that the model can accurately imitate speech with a high degree of phonemic precision

    ARTICULATORY INFORMATION FOR ROBUST SPEECH RECOGNITION

    Get PDF
    Current Automatic Speech Recognition (ASR) systems fail to perform nearly as good as human speech recognition performance due to their lack of robustness against speech variability and noise contamination. The goal of this dissertation is to investigate these critical robustness issues, put forth different ways to address them and finally present an ASR architecture based upon these robustness criteria. Acoustic variations adversely affect the performance of current phone-based ASR systems, in which speech is modeled as `beads-on-a-string', where the beads are the individual phone units. While phone units are distinctive in cognitive domain, they are varying in the physical domain and their variation occurs due to a combination of factors including speech style, speaking rate etc.; a phenomenon commonly known as `coarticulation'. Traditional ASR systems address such coarticulatory variations by using contextualized phone-units such as triphones. Articulatory phonology accounts for coarticulatory variations by modeling speech as a constellation of constricting actions known as articulatory gestures. In such a framework, speech variations such as coarticulation and lenition are accounted for by gestural overlap in time and gestural reduction in space. To realize a gesture-based ASR system, articulatory gestures have to be inferred from the acoustic signal. At the initial stage of this research an initial study was performed using synthetically generated speech to obtain a proof-of-concept that articulatory gestures can indeed be recognized from the speech signal. It was observed that having vocal tract constriction trajectories (TVs) as intermediate representation facilitated the gesture recognition task from the speech signal. Presently no natural speech database contains articulatory gesture annotation; hence an automated iterative time-warping architecture is proposed that can annotate any natural speech database with articulatory gestures and TVs. Two natural speech databases: X-ray microbeam and Aurora-2 were annotated, where the former was used to train a TV-estimator and the latter was used to train a Dynamic Bayesian Network (DBN) based ASR architecture. The DBN architecture used two sets of observation: (a) acoustic features in the form of mel-frequency cepstral coefficients (MFCCs) and (b) TVs (estimated from the acoustic speech signal). In this setup the articulatory gestures were modeled as hidden random variables, hence eliminating the necessity for explicit gesture recognition. Word recognition results using the DBN architecture indicate that articulatory representations not only can help to account for coarticulatory variations but can also significantly improve the noise robustness of ASR system

    Artificial Vocal Learning guided by Phoneme Recognition and Visual Information

    Get PDF
    This paper introduces a paradigm shift regarding vocal learning simulations, in which the communicative function of speech acquisition determines the learning process and intelligibility is considered the primary measure of learning success. Thereby, a novel approach for artificial vocal learning is presented that utilizes deep neural network-based phoneme recognition in order to calculate the speech acquisition objective function. This function guides a learning framework that involves the state-of-the-art articulatory speech synthesizer VocalTractLab as the motor-to-acoustic forward model. In this way, an extensive set of German phonemes, including most of the consonants and all stressed vowels, was produced successfully. The synthetic phonemes were rated as highly intelligible by human listeners. Furthermore, it is shown that visual speech information, such as lip and jaw movements, can be extracted from video recordings and be incorporated into the learning framework as an additional loss component during the optimization process. It was observed that this visual loss did not increase the overall intelligibility of phonemes. Instead, the visual loss acted as a regularization mechanism that facilitated the finding of more biologically plausible solutions in the articulatory domain

    On Invariance and Selectivity in Representation Learning

    Get PDF
    We discuss data representation which can be learned automatically from data, are invariant to transformations, and at the same time selective, in the sense that two points have the same representation only if they are one the transformation of the other. The mathematical results here sharpen some of the key claims of i-theory -- a recent theory of feedforward processing in sensory cortex

    Articulatory Copy Synthesis Based on the Speech Synthesizer VocalTractLab

    Get PDF
    Articulatory copy synthesis (ACS), a subarea of speech inversion, refers to the reproduction of natural utterances and involves both the physiological articulatory processes and their corresponding acoustic results. This thesis proposes two novel methods for the ACS of human speech using the articulatory speech synthesizer VocalTractLab (VTL) to address or mitigate the existing problems of speech inversion, such as non-unique mapping, acoustic variation among different speakers, and the time-consuming nature of the process. The first method involved finding appropriate VTL gestural scores for given natural utterances using a genetic algorithm. It consisted of two steps: gestural score initialization and optimization. In the first step, gestural scores were initialized using the given acoustic signals with speech recognition, grapheme-to-phoneme (G2P), and a VTL rule-based method for converting phoneme sequences to gestural scores. In the second step, the initial gestural scores were optimized by a genetic algorithm via an analysis-by-synthesis (ABS) procedure that sought to minimize the cosine distance between the acoustic features of the synthetic and natural utterances. The articulatory parameters were also regularized during the optimization process to restrict them to reasonable values. The second method was based on long short-term memory (LSTM) and convolutional neural networks, which were responsible for capturing the temporal dependence and the spatial structure of the acoustic features, respectively. The neural network regression models were trained, which used acoustic features as inputs and produced articulatory trajectories as outputs. In addition, to cover as much of the articulatory and acoustic space as possible, the training samples were augmented by manipulating the phonation type, speaking effort, and the vocal tract length of the synthetic utterances. Furthermore, two regularization methods were proposed: one based on the smoothness loss of articulatory trajectories and another based on the acoustic loss between original and predicted acoustic features. The best-performing genetic algorithms and convolutional LSTM systems (evaluated in terms of the difference between the estimated and reference VTL articulatory parameters) obtained average correlation coefficients of 0.985 and 0.983 for speaker-dependent utterances, respectively, and their reproduced speech achieved recognition accuracies of 86.25% and 64.69% for speaker-independent utterances of German words, respectively. When applied to German sentence utterances, as well as English and Mandarin Chinese word utterances, the neural network based ACS systems achieved recognition accuracies of 73.88%, 52.92%, and 52.41%, respectively. The results showed that both of these methods not only reproduced the articulatory processes but also reproduced the acoustic signals of reference utterances. Moreover, the regularization methods led to more physiologically plausible articulatory processes and made the estimated articulatory trajectories be more articulatorily preferred by VTL, thus reproducing more natural and intelligible speech. This study also found that the convolutional layers, when used in conjunction with batch normalization layers, automatically learned more distinctive features from log power spectrograms. Furthermore, the neural network based ACS systems trained using German data could be generalized to the utterances of other languages

    Learning How to Speak: Imitation-Based Refinement of Syllable Production in an Articulatory-Acoustic Model

    Get PDF
    Philippsen A, Reinhart F, Wrede B. Learning How to Speak: Imitation-Based Refinement of Syllable Production in an Articulatory-Acoustic Model. Presented at the Forth Joint IEEE International Conference on Development and Learning and on Epigenetic Robotics (ICDL-EpiRob), Genoa, Italy.This paper proposes an efficient neural network model for learning the articulatory-acoustic forward and inverse mapping of consonant-vowel sequences including coarticulation effects. It is shown that the learned models can generalize vowels as well as consonants to other contexts and that the need for supervised training examples can be reduced by refining initial forward and inverse models using acoustic examples only. The models are initially trained on smaller sets of examples and then improved by presenting auditory goals that are imitated. The acoustic outcomes of the imitations together with the executed actions provide new training pairs. It is shown that this unsupervised and imitation-based refinement significantly decreases the error of the forward as well as the inverse model. Using a state-of-the-art articulatory speech synthesizer, our approach allows to reproduce the acoustics from learned articulatory trajectories, i.e. we can listen to the results and rate their quality by error measures and perception

    Multi-Attribute Seismic Analysis Using Unsupervised Machine Learning Method: Self-Organizing Maps

    Get PDF
    Seismic attributes are a fundamental part of seismic interpretation and are routinely used by geoscientists to extract key information and visualize geological features. By combining different findings from each attribute, they can provide a good insight of the area and help overcome many geological challenges. However, individually analyzing multiple attributes to find relevant information can be time-consuming and inefficient, especially when working with large datasets. It can lead to miscalculations, errors in judgement and human bias. This is where Machine Learning (ML) methods can be implemented to improve existing interpretations or find additional information. ML can help by handling large volumes of multi-dimensional data and interrelating them. Methods such as Self Organizing Maps (SOM) allow multi-attribute analysis and help extract more information as compared to quantitative interpretation. SOM is an unsupervised neural network that can find meaningful and reliable patterns corresponding to a specific geological feature (Roden and Chen, 2017). The purpose of this thesis was to understand how SOM can help make interpretations of direct hydrocarbon indicators (DHI) in the Statfjord Field area easier. Several AVO attributes were generated to detect DHIs and were then used as input for multi-attribute SOM analysis. SOMPY package in Python was used to train the model and generate SOM classification results. Data samples were classified based on BMU hits and clusters in the data. The classification was then applied to the whole dataset and converted to seismic sections for comparison and interpretation. SOM classified seismic lines were compared with the results of the AVO attributes. Since DHIs are anomalous data, they were expected to be represented by small data clusters and BMUs with low hits. While SOM reproduced the seismic reflectors well, it did not define the DHI features clearly for them to be easily interpreted. Use of fewer seismic attributes and computational limitations of the machine could be some of the reasons behind not achieving desired results. However, the study has room for improvement and the potential to produce meaningful results. Improvements in model design and training, and also the selection of input attributes are some of the areas that need to be addressed. Furthermore, testing other Python libraries and better handling of large datasets can allow better performance and more accurate results

    A recurrent 16p12.1 microdeletion supports a two-hit model for severe developmental delay.

    Get PDF
    We report the identification of a recurrent, 520-kb 16p12.1 microdeletion associated with childhood developmental delay. The microdeletion was detected in 20 of 11,873 cases compared with 2 of 8,540 controls (P = 0.0009, OR = 7.2) and replicated in a second series of 22 of 9,254 cases compared with 6 of 6,299 controls (P = 0.028, OR = 2.5). Most deletions were inherited, with carrier parents likely to manifest neuropsychiatric phenotypes compared to non-carrier parents (P = 0.037, OR = 6). Probands were more likely to carry an additional large copy-number variant when compared to matched controls (10 of 42 cases, P = 5.7 x 10(-5), OR = 6.6). The clinical features of individuals with two mutations were distinct from and/or more severe than those of individuals carrying only the co-occurring mutation. Our data support a two-hit model in which the 16p12.1 microdeletion both predisposes to neuropsychiatric phenotypes as a single event and exacerbates neurodevelopmental phenotypes in association with other large deletions or duplications. Analysis of other microdeletions with variable expressivity indicates that this two-hit model might be more generally applicable to neuropsychiatric disease
    corecore