772 research outputs found

    Bayesian Speaker Adaptation Based on a New Hierarchical Probabilistic Model

    Get PDF
    In this paper, a new hierarchical Bayesian speaker adaptation method called HMAP is proposed that combines the advantages of three conventional algorithms, maximum a posteriori (MAP), maximum-likelihood linear regression (MLLR), and eigenvoice, resulting in excellent performance across a wide range of adaptation conditions. The new method efficiently utilizes intra-speaker and inter-speaker correlation information through modeling phone and speaker subspaces in a consistent hierarchical Bayesian way. The phone variations for a specific speaker are assumed to be located in a low-dimensional subspace. The phone coordinate, which is shared among different speakers, implicitly contains the intra-speaker correlation information. For a specific speaker, the phone variation, represented by speaker-dependent eigenphones, are concatenated into a supervector. The eigenphone supervector space is also a low dimensional speaker subspace, which contains inter-speaker correlation information. Using principal component analysis (PCA), a new hierarchical probabilistic model for the generation of the speech observations is obtained. Speaker adaptation based on the new hierarchical model is derived using the maximum a posteriori criterion in a top-down manner. Both batch adaptation and online adaptation schemes are proposed. With tuned parameters, the new method can handle varying amounts of adaptation data automatically and efficiently. Experimental results on a Mandarin Chinese continuous speech recognition task show good performance under all testing conditions

    Automatic syllabification using segmental conditional random fields

    Get PDF
    In this paper we present a statistical approach for the automatic syllabification of phonetic word transcriptions. A syllable bigram language model forms the core of the system. Given the large number of syllables in non-syllabic languages, sparsity is the main issue, especially since the available syllabified corpora tend to be small. Traditional back-off mechanisms only give a partial solution to the sparsity problem. In this work we use a set of features for back-off purposes: on the one hand probabilities such as consonant cluster probabilities, and on the other hand a set of rules based on generic syllabification principles such as legality, sonority and maximal onset. For the combination of these highly correlated features with the baseline bigram feature we employ segmental conditional random fields (SCRFs) as statistical framework. The resulting method is very versatile and can be used for any amount of data of any language. The method was tested on various datasets in English and Dutch with dictionary sizes varying between 1 and 60 thousand words. We obtained a 97.96% word accuracy for supervised syllabification and a 91.22% word accuracy for unsupervised syllabification for English. When including the top-2 generated syllabifications for a small fraction of the words, virtual perfect syllabification is obtained in supervised mode

    Nonparametric Bayesian Double Articulation Analyzer for Direct Language Acquisition from Continuous Speech Signals

    Full text link
    Human infants can discover words directly from unsegmented speech signals without any explicitly labeled data. In this paper, we develop a novel machine learning method called nonparametric Bayesian double articulation analyzer (NPB-DAA) that can directly acquire language and acoustic models from observed continuous speech signals. For this purpose, we propose an integrative generative model that combines a language model and an acoustic model into a single generative model called the "hierarchical Dirichlet process hidden language model" (HDP-HLM). The HDP-HLM is obtained by extending the hierarchical Dirichlet process hidden semi-Markov model (HDP-HSMM) proposed by Johnson et al. An inference procedure for the HDP-HLM is derived using the blocked Gibbs sampler originally proposed for the HDP-HSMM. This procedure enables the simultaneous and direct inference of language and acoustic models from continuous speech signals. Based on the HDP-HLM and its inference procedure, we developed a novel double articulation analyzer. By assuming HDP-HLM as a generative model of observed time series data, and by inferring latent variables of the model, the method can analyze latent double articulation structure, i.e., hierarchically organized latent words and phonemes, of the data in an unsupervised manner. The novel unsupervised double articulation analyzer is called NPB-DAA. The NPB-DAA can automatically estimate double articulation structure embedded in speech signals. We also carried out two evaluation experiments using synthetic data and actual human continuous speech signals representing Japanese vowel sequences. In the word acquisition and phoneme categorization tasks, the NPB-DAA outperformed a conventional double articulation analyzer (DAA) and baseline automatic speech recognition system whose acoustic model was trained in a supervised manner.Comment: 15 pages, 7 figures, Draft submitted to IEEE Transactions on Autonomous Mental Development (TAMD

    Producing power-law distributions and damping word frequencies with two-stage language models

    Get PDF
    Standard statistical models of language fail to capture one of the most striking properties of natural languages: the power-law distribution in the frequencies of word tokens. We present a framework for developing statisticalmodels that can generically produce power laws, breaking generativemodels into two stages. The first stage, the generator, can be any standard probabilistic model, while the second stage, the adaptor, transforms the word frequencies of this model to provide a closer match to natural language. We show that two commonly used Bayesian models, the Dirichlet-multinomial model and the Dirichlet process, can be viewed as special cases of our framework. We discuss two stochastic processes-the Chinese restaurant process and its two-parameter generalization based on the Pitman-Yor process-that can be used as adaptors in our framework to produce power-law distributions over word frequencies. We show that these adaptors justify common estimation procedures based on logarithmic or inverse-power transformations of empirical frequencies. In addition, taking the Pitman-Yor Chinese restaurant process as an adaptor justifies the appearance of type frequencies in formal analyses of natural language and improves the performance of a model for unsupervised learning of morphology.48 page(s

    Morphological Segmentation for Keyword Spotting

    Get PDF
    We explore the impact of morphological segmentation on keyword spotting (KWS). Despite potential benefits, state-of-the-art KWS systems do not use morphological information. In this paper, we augment a state-of-the-art KWS system with sub-word units derived from supervised and unsupervised morphological segmentations, and compare with phonetic and syllabic segmentations. Our experiments demonstrate that morphemes improve overall performance of KWS systems. Syllabic units, however, rival the performance of morphological units when used in KWS. By combining morphological, phonetic and syllabic segmentations, we demonstrate substantial performance gains.United States. Intelligence Advanced Research Projects Activity (United States. Army Research Laboratory Contract W911NF-12-C-0013

    Computational prediction of Greek Nominal Allomorphy

    Get PDF
    In Theoretical Morphology, an interest in the topic of stem allomorphy has been renewed the last two decades. On the other hand, the computational treatment of allomorphy is still a huge challenge since the first systematic attempts on predicting allomorphy with machine learning techniques. The goals of this paper are to predict the allomorphic changes and to show the essential contribution of various morphological, phonological and semantic characteristics. Therefore, we use a MaxEnt model to identify the weights of these characteristics that are directly dependent on allomorphy and help design a predictive model. Our model is based on AMIS and its overall accuracy was 86.49%. To improve the system, we tried a more rational approach to achieve a better performance (the correct prediction was raised to 91.43%)

    Advances in Probabilistic Deep Learning

    Get PDF
    This thesis is concerned with methodological advances in probabilistic inference and their application to core challenges in machine perception and AI. Inferring a posterior distribution over the parameters of a model given some data is a central challenge that occurs in many fields ranging from finance and artificial intelligence to physics. Exact calculation is impossible in all but the simplest cases and a rich field of approximate inference has been developed to tackle this challenge. This thesis develops both an advance in approximate inference and an application of these methods to the problem of speech synthesis. In the first section of this thesis we develop a novel framework for constructing Markov Chain Monte Carlo (MCMC) kernels that can efficiently sample from high dimensional distributions such as the posteriors, that frequently occur in machine perception. We provide a specific instance of this framework and demonstrate that it can match or exceed the performance of Hamiltonian Monte Carlo without requiring gradients of the target distribution. In the second section of the thesis we focus on the application of approximate inference techniques to the task of synthesising human speech from text. By using advances in neural variational inference we are able to construct a state of the art speech synthesis system in which it is possible to control aspects of prosody such as emotional expression from significantly less supervised data than previously existing state of the art methods
    corecore