9 research outputs found

    An Interactive and Efficient Voice Processing For Home Automation System

    Get PDF
    Home networking has evolved from linked personal computers to a more complex system that encompasses advanced security and automation applications. Once just reserved for high-end luxury homes, home networks are now a regular feature in residences. These networks allow users to consolidate heating, air conditioning, lighting, appliances, entertainment, intercom, telecommunication, surveillance and security systems into an easy-to-operate unified network. Interactive applications operated by voice recognition, for example integrated door security systems and the ability to control home appliances, are key features of home automation networks. This interactive capability depends on high-quality voice processing technology, including acoustic echo cancellation, low signal distortion and noise reduction techniques. A home automation system must also be scalable to allow future evolution, flexible to support field upgrades, interactive, easy-to-use, costefficient and reliable. This article introduces some of the voice quality performance issues and design challenges unique to home automation systems. It will discuss home automation network applications that rely on voice processing, and examine some of the critical features and functionality that can help ease design complexity and cost to deliver enhanced performance

    Statistical Audit via Gaussian Mixture Models in Business Intelligence Systems

    Get PDF
    A Business Intelligence (BI) System employs tools from several areas of knowledge for the collection, integration and analysis of data to improve business decision making. The Brazilian Ministry of Planning, Budget and Management (MP) uses a BI System designed with the University of Bras´ılia to ascertain irregularities on the payroll of the Brazilian federal government, performing audit trails on selected items and fields of the payroll database. This current auditing approach is entirely deterministic, since the audit trails look for previously known signatures of irregularities which are composed by means of an ontological method used to represent auditors concept maps. In this work, we propose to incorporate a statistical filter in this existing BI system in order to increase its performance in terms of processing speed and overall system responsiveness. The proposed statistical filter is based on a generative Gaussian Mixture Model (GMM) whose goal is to provide a complete stochastic model of the process, specially the latent probability density function of the generative mixture, and use that model to filter the most probable payrolls. Inserting this statistical filter as a pre-processing stage preceding the deterministic auditing showed to be effective in reducing the amount of data to be analyzed by the audit trails, despite the penalty fee intrinsically associated with stochastic models due to the false negative outcomes that are not further processed. In our approach, gains obtained with the proposed pre-processing stage overcome impacts from false negative outcomes

    An Investigation of Spectral Subband Centroids for Speaker Authentication

    Get PDF
    Most conventional features used in speaker authentication are based on estimation of spectral envelopes in one way or another, in the form of cepstrums, e.g., Mel-scale Filterbank Cepstrum Coefficients (MFCCs), Linear-scale Filterbank Cepstrum Coefficients (LFCCs) and Relative Spectral Perceptual Linear Prediction (RASTA-PLP). In this study, Spectral Subband Centroids (SSCs) are examined. These features are the centroid frequency in each subband. They have properties similar to the formant frequency but are limited to a given subband. Preliminary empirical findings, on a subset of the XM2VTS database, using Analysis of Variance and Linear Discriminant Analysis suggest that, firstly, a certain number of centroids (up to about 16) are necessary to cover enough information about the speaker's identity; and secondly, that SSCs could provide complementary information to the conventional MFCCs. Theoretical findings suggest that mean-subtracted SSCs are more robust to additive noise. Further empirical experiments carried out on the more realistic NIST2001 database using SSCs, MFCCs (respectively LFCCs) and their combinations by concatenation suggest that SSCs are indeed robust and complementary features to conventional MFCC (respectively LFCCs) features often used in speaker authentication

    A Lognormal Tied Mixture Model Of Pitch For Prosody-Based Speaker Recognition

    No full text
    Statistics of pitch have recently been used in speaker recognition systems with good results. The success of such systems depends on robust and accurate computation of pitch statistics in the presence of pitch tracking errors. In this work, we develop a statistical model of pitch that allows unbiased estimation of pitch statistics from pitch tracks which are subject to doubling and/or halving. We first argue by a simple correlation model and empirically demonstrate by QQ plots that "clean" pitch is distributed with a lognormal distribution rather than the often assumed normal distribution. Second, we present a probabilistic model for estimated pitch via a pitch tracker in the presence of doubling/halving, which leads to a mixture of three lognormal distributions with tied means and variances for a total of four free parameters. We use the obtained pitch statistics as features in speaker verification on the March 1996 NIST Speaker Recognition Evaluation data (subset of Switchboard) and ..

    Discrimination parole/musique et étude de nouveaux paramètres et modèles pour un système d'identification du locuteur dans le contexte de conférences téléphoniques

    Get PDF
    La mise en oeuvre de systèmes de compréhension automatique de parole pouvant fonctionner dans des conditions réelles implique de reproduire certaines aptitudes de l'être humain. Outre les aptitudes à comprendre la parole même lorsqu'elle est corrompue par du bruit, nous sommes capables de tenir une conversation impliquant plusieurs interlocuteurs. Ce dernier point est lié au fait que nous identifions implicitement les interlocuteurs. Cette caractérisation du locuteur nous permet par exemple de réaliser des conversations téléphoniques en mode conférence. En plus de la reconnaissance du vocabulaire ou de l'identification du locuteur, on est également capable de distinguer les séquences de la musique (en alternance, en arrière plan, etc.) qui peuvent apparaître lorsqu'un des correspondants se place en mode attente. En partant de ce contexte, on s'est intéressé à développer un système capable d'une part de discriminer entre les séquences de Parole/Musique et d'autre part d'identifier le locuteur dans des conditions téléphoniques fonctionnant en mode conférence avec une variabilité des combinés. Autrement dit, cette thèse s'intéresse à deux sujets du domaine du traitement de la parole. Le premier sujet porte sur la recherche de nouveaux paramètres pour améliorer les performances des algorithmes qui identifient les locuteurs en mode téléphonique. Le deuxième sujet est consacré à la proposition de nouvelles approches en discrimination de la parole, de la musique et de la musique chantée. En discrimination du locuteur, on présentera une première étude visant à caractériser le locuteur par des paramètres AM-FM synchrones à la glotte, extraits à la sortie d'un banc de filtres cochléaires. L'objectif visé est de trouver de nouveaux paramètres plus robustes aux bruits et à la variabilité des combinés téléphoniques. Comme résultats, on a obtenu des scores presque similaires entre le système proposé et le système de référence. Les meilleures performances ont été enregistrées lorsque le système utilise une architecture parallèle composée de deux reconnaisseurs qui se basent respectivement sur les paramètres MFCC et AM-FM. Dans le même cadre, on s'est intéressé à proposer une nouvelle technique de modélisation qui tient compte de la dépendance temporelle entre la source d'excitation et le conduit vocal. Avec les tests de courtes durées, on a obtenu de meilleures performances en comparaison à l'approche classique. Cependant, quand on augmente la durée de test, on obtient presque les mêmes performances pour tous les systèmes proposés. En discrimination Parole/Musique, on a proposé deux systèmes, le premier utilise trois modèles paramétriques entraînés respectivement pour la parole, la musique et la musique chantée sans effectuer aucune normalisation sur les vecteurs paramètres. Sur une durée test de 100 ms, on a obtenu un taux de reconnaissance en moyenne de 93,77%. Le deuxième système ne requiert aucun entraînement et se base simplement sur un seuil pour effectuer la classification

    Discriminative and generative approaches for long- and short-term speaker characteristics modeling : application to speaker verification

    Get PDF
    The speaker verification problem can be stated as follows: given two speech recordings, determine whether or not they have been uttered by the same speaker. Most current speaker verification systems are based on Gaussian mixture models. This probabilistic representation allows to adequately model the complex distribution of the underlying speech feature parameters. It however represents an inadequate basis for discriminating between speakers, which is the key issue in the area of speaker verification. In the first part of this thesis, we attempt to overcome these difficulties by proposing to combine support vector machines, a well established discriminative modeling, with two generative approaches based on Gaussian mixture models. In the first generative approach, a target speaker is represented by a Gaussian mixture model corresponding to a Maximum A Posteriori adaptation of a large Gaussian mixture model, coined universal background model, to the target speaker data. The second generative approach is the Joint Factor Analysis that has become the state-of-the-art in the field of speaker verification during the last three years. The advantage of this technique is that it provides a framework of powerful tools for modeling the inter-speaker and channel variabilities. We propose and test several kernel functions that are integrated in the design of both previous combinations. The best results are obtained when the support vector machines are applied within a new space called the "total variability space", defined using the factor analysis. In this novel modeling approach, the channel effect is treated through a combination of linear discnminant analysis and kemel normalization based on the inverse of the within covariance matrix of the speaker. In the second part of this thesis, we present a new approach to modeling the speaker's longterm prosodic and spectral characteristics. This novel approach is based on continuous approximations of the prosodic and cepstral contours contained in a pseudo-syllabic segment of speech. Each of these contours is fitted to a Legendre polynomial, whose coefficients are modeled by a Gaussian mixture model. The joint factor analysis is used to treat the speaker and channel variabilities. Finally, we perform a scores fusion between systems based on long-term speaker characteristics with those described above that use short-term speaker features
    corecore