165 research outputs found
Automatic Speech Recognition: the New Millennium
Colloque avec actes et comité de lecture. internationale.International audienceWe present a new approach to automatic speech recognition (ASR) based on the formalism of Bayesian networks. We put the foundations of new ASR systems for which the robustness relies on the fidelity in speech modeling and on the information contained in training data
Unsupervised Stream-Weights Computation in Classification and Recognition Tasks
International audienceIn this paper, we provide theoretical results on the problem of optimal stream weight selection for the multi-stream classi- fication problem. It is shown, that in the presence of estimation or modeling errors using stream weights can decrease the total classification error. Stream weight estimates are computed for various conditions. Then we turn our attention to the problem of unsupervised stream weights computation. Based on the theoretical results we propose to use models and “anti-models” (class- specific background models) to estimate stream weights. A non-linear function of the ratio of the inter- to intra-class distance is used for stream weight estimation. The proposed unsupervised stream weight estimation algorithm is evaluated on both artificial data and on the problem of audio-visual speech classification. Finally the proposed algorithm is extended to the problem of audio- visual speech recognition. It is shown that the proposed algorithms achieve results comparable to the supervised minimum-error training approach under most testing conditions
On classification between normal and pathological voices using the MEEI-KayPENTAX database: Issues and consequences
International audienceA large amount of research in pathological voice classification consider the task of feature extraction for discrimination between normal and dysphonic sustained vowels. The most widely used dataset for this purpose is the Massachusetts Eye \& Ear Infirmary (MEEI) Voice Disorders Database commercialized by KayPENTAX Corp. During the last two decades, dozens of methods have been proposed to extract discriminative features from these signals in order to design accurate classifiers between the two classes of this database. The main contribution of this paper is to show that the normal and dysphonic sustained vowels of the KayPENTAX database are actually perfectly separable. This implies that this dataset is not suited for the normal-vs-dysphonic classification task, as long as the only concern is to achieve high classification accuracy. Indeed, we show that a single scalar parameter extracted from a matching pursuit decomposition of these signals (with a Gabor dictionary) yields a prefect classification accuracy (100 \% with a large margin). We then discuss the implication of this finding on the precaution that should be taken with this database and on research in pathological voice detection in general. \end{abstract} \noindent{\bf Index Terms}: Pathological voice classification, speech perturbation measure, dysphonia, matching pursuit, MEEI-KayPENTAX Voice Disorders Database
Large Margin GMM for discriminative speaker verifi cation
International audienceGaussian mixture models (GMM), trained using the generative cri- terion of maximum likelihood estimation, have been the most popular ap- proach in speaker recognition during the last decades. This approach is also widely used in many other classi cation tasks and applications. Generative learning in not however the optimal way to address classi cation problems. In this paper we rst present a new algorithm for discriminative learning of diagonal GMM under a large margin criterion. This algorithm has the ma- jor advantage of being highly e cient, which allow fast discriminative GMM training using large scale databases. We then evaluate its performances on a full NIST speaker veri cation task using NIST-SRE'2006 data. In particular, we use the popular Symmetrical Factor Analysis (SFA) for session variability compensation. The results show that our system outperforms the state-of-the- art approaches of GMM-SFA and the SVM-based one, GSL-NAP. Relative reductions of the Equal Error Rate of about 9.33% and 14.88% are respec- tively achieved over these systems
Nouveau noyau de séquences pour la vérification du locuteur
En utilisant la théorie des Espaces de Hilbert à Noyau Reproduisant, nous concevons un nouveau noyau de séquences, qui mesure la similarité entre deux séquences d'observations. Nous appliquons ce noyau à une tâche de vérification du locuteur (campagne d'évaluation NIST 2004). Les résultats montrent qu'incorporer notre nouveau noyau de séquences dans une architecture SVM non seulement fournit des résultats bien meilleurs qu'un classifieur UBM-GMM de base, mais aussi donne de meilleures performances que le classifieur utilisant un noyau GLDS (Generalized Linear Discriminant Séquence kernel). De plus, notre noyau opère dans un espace de plus faible dimension, tout en permettant un large choix de noyaux
Efficient multipulse approximation of speech excitation using the most singular manifold
INTERSPEECH 2012We propose a novel approach to find the locations of the multipulse sequence that approximates the speech source excitation. This approach is based on the notion of Most Singular Manifold (MSM) which is associated to the set of less predictable events. The MSM is formed by identifying (directly from the speech waveform) multiscale singularities which may correspond to significant impulsive excitations of the vocal tract. This identification is done through a multiscale measure of local predictability and the estimation of its associated singularity exponents. Once the pulse locations are found using the MSM, their amplitudes are computed using the second stage of the classical MultiPulse Excitation (MPE) coder. The multipulse sequence is then fed to the classical LPC synthesizer to reconstruct speech. The resulting MSM-based algorithm is shown to be significantly more efficient than MPE. We evaluate our algorithm using 1 hour of speech from the TIMIT database and compare its performances to MPE and a recent approach based on compressed sensing (CS). The results show that our algorithm yields similar perceptual quality as MPE and outperforms the CS method when the number of pulses is low
Fast training of Large Margin diagonal Gaussian mixture models for speaker identification
International audienceGaussian mixture models (GMM) have been widely and successfully used in speaker recognition during the last decades. They are generally trained using the generative criterion of maximum likelihood estimation. In an earlier work, we proposed an algorithm for discriminative training of GMM with diagonal covariances under a large margin criterion. In this paper, we present a new version of this algorithm which has the major advantage of being computationally highly efficient. The resulting algorithm is thus well suited to handle large scale databases. We carry out experiments on a speaker identification task using NIST-SRE'2006 data and compare our new algorithm to the baseline generative GMM using different GMM sizes. The results show that our system significantly outperforms the baseline GMM in all configurations, and with high computational efficiency
Combination of SVM and Large Margin GMM modeling for speaker identification
International audienceMost state-of-the-art speaker recognition systems are partially or completely based on Gaussian mixture models (GMM). GMM have been widely and successfully used in speaker recognition during the last decades. They are traditionally estimated from a world model using the generative criterion of Maximum A Posteriori. In an earlier work, we proposed an efficient algorithm for discriminative learning of GMM with diagonal covariances under a large margin criterion. In this paper, we evaluate the combination of the large margin GMM modeling approach with SVM in the setting of speaker identification. We carry out a full NIST speaker identification task using NIST-SRE'2006 data, in a Symmetrical Factor Analysis compensation scheme. The results show that the two modeling approaches are complementary and that their combination outperforms their single use
Speaker verification using Large Margin GMM discriminative training
International audienceGaussian mixture models (GMM) have been widely and successfully used in speaker recognition during the last decades. They are generally trained using the generative criterion of maximum likelihood estimation. In an earlier work, we proposed an algorithm for discriminative training of GMM with diagonal covariances under a large margin criterion. In this paper, we present a new version of this algorithm which has the major advantage of being computationally highly efficient. The resulting algorithm is thus well suited to handle large scale databases. To show the effectiveness of the new algorithm, we carry out a full NIST speaker verification task using NISTSRE' 2006 data. The results show that our system outperforms the baseline GMM, and with high computational efficiency
- …