37 research outputs found

    Compensation of Nuisance Factors for Speaker and Language Recognition

    Get PDF
    The variability of the channel and environment is one of the most important factors affecting the performance of text-independent speaker verification systems. The best techniques for channel compensation are model based. Most of them have been proposed for Gaussian mixture models, while in the feature domain blind channel compensation is usually performed. The aim of this work is to explore techniques that allow more accurate intersession compensation in the feature domain. Compensating the features rather than the models has the advantage that the transformed parameters can be used with models of a different nature and complexity and for different tasks. In this paper, we evaluate the effects of the compensation of the intersession variability obtained by means of the channel factors approach. In particular, we compare channel variability modeling in the usual Gaussian mixture model domain, and our proposed feature domain compensation technique. We show that the two approaches lead to similar results on the NIST 2005 Speaker Recognition Evaluation data with a reduced computation cost. We also report the results of a system, based on the intersession compensation technique in the feature space that was among the best participants in the NIST 2006 Speaker Recognition Evaluation. Moreover, we show how we obtained significant performance improvement in language recognition by estimating and compensating, in the feature domain, the distortions due to interspeaker variability within the same language. Index Terms—Factor anal

    Typicality extraction in a Speaker Binary Keys model

    Get PDF
    International audienceIn the field of speaker recognition, the recently proposed notion of "Speaker Binary Key" provides a representation of each acoustic frame in a discriminant binary space. This approach relies on an unique acoustic model composed by a large set of speaker specific local likelihood peaks (called specificities). The model proposes a spatial coverage where each frame is characterized in terms of neighborhood. The most frequent specificities, picked up to represent the whole utterance, generate a binary key vector. The flexibility of this modeling allows to capture non-parametric behaviors. In this paper, we introduce a concept of "typicality" between binary keys, with a discriminant goal. We describe an algorithm able to extract such typicalities, which involves a singular value decomposition in a binary space. The theoretical aspects of this decomposition as well as its potential in terms of future developments are presented. All the propositions are also experimentally validated using NIST SRE 2008 framework

    Scalable learning for geostatistics and speaker recognition

    Get PDF
    With improved data acquisition methods, the amount of data that is being collected has increased severalfold. One of the objectives in data collection is to learn useful underlying patterns. In order to work with data at this scale, the methods not only need to be effective with the underlying data, but also have to be scalable to handle larger data collections. This thesis focuses on developing scalable and effective methods targeted towards different domains, geostatistics and speaker recognition in particular. Initially we focus on kernel based learning methods and develop a GPU based parallel framework for this class of problems. An improved numerical algorithm that utilizes the GPU parallelization to further enhance the computational performance of kernel regression is proposed. These methods are then demonstrated on problems arising in geostatistics and speaker recognition. In geostatistics, data is often collected at scattered locations and factors like instrument malfunctioning lead to missing observations. Applications often require the ability interpolate this scattered spatiotemporal data on to a regular grid continuously over time. This problem can be formulated as a regression problem, and one of the most popular geostatistical interpolation techniques, kriging is analogous to a standard kernel method: Gaussian process regression. Kriging is computationally expensive and needs major modifications and accelerations in order to be used practically. The GPU framework developed for kernel methods is extended to kriging and further the GPU's texture memory is better utilized for enhanced computational performance. Speaker recognition deals with the task of verifying a person's identity based on samples of his/her speech - "utterances". This thesis focuses on text-independent framework and three new recognition frameworks were developed for this problem. We proposed a kernelized Renyi distance based similarity scoring for speaker recognition. While its performance is promising, it does not generalize well for limited training data and therefore does not compare well to state-of-the-art recognition systems. These systems compensate for the variability in the speech data due to the message, channel variability, noise and reverberation. State-of-the-art systems model each speaker as a mixture of Gaussians (GMM) and compensate for the variability (termed "nuisance"). We propose a novel discriminative framework using a latent variable technique, partial least squares (PLS), for improved recognition. The kernelized version of this algorithm is used to achieve a state of the art speaker ID system, that shows results competitive with the best systems reported on in NIST's 2010 Speaker Recognition Evaluation

    Discriminative and generative approaches for long- and short-term speaker characteristics modeling : application to speaker verification

    Get PDF
    The speaker verification problem can be stated as follows: given two speech recordings, determine whether or not they have been uttered by the same speaker. Most current speaker verification systems are based on Gaussian mixture models. This probabilistic representation allows to adequately model the complex distribution of the underlying speech feature parameters. It however represents an inadequate basis for discriminating between speakers, which is the key issue in the area of speaker verification. In the first part of this thesis, we attempt to overcome these difficulties by proposing to combine support vector machines, a well established discriminative modeling, with two generative approaches based on Gaussian mixture models. In the first generative approach, a target speaker is represented by a Gaussian mixture model corresponding to a Maximum A Posteriori adaptation of a large Gaussian mixture model, coined universal background model, to the target speaker data. The second generative approach is the Joint Factor Analysis that has become the state-of-the-art in the field of speaker verification during the last three years. The advantage of this technique is that it provides a framework of powerful tools for modeling the inter-speaker and channel variabilities. We propose and test several kernel functions that are integrated in the design of both previous combinations. The best results are obtained when the support vector machines are applied within a new space called the "total variability space", defined using the factor analysis. In this novel modeling approach, the channel effect is treated through a combination of linear discnminant analysis and kemel normalization based on the inverse of the within covariance matrix of the speaker. In the second part of this thesis, we present a new approach to modeling the speaker's longterm prosodic and spectral characteristics. This novel approach is based on continuous approximations of the prosodic and cepstral contours contained in a pseudo-syllabic segment of speech. Each of these contours is fitted to a Legendre polynomial, whose coefficients are modeled by a Gaussian mixture model. The joint factor analysis is used to treat the speaker and channel variabilities. Finally, we perform a scores fusion between systems based on long-term speaker characteristics with those described above that use short-term speaker features

    Speaker characterization using adult and children’s speech

    Get PDF
    Speech signals contain important information about a speaker, such as age, gender, language, accent, and emotional/psychological state. Automatic recognition of these types of characteristics has a wide range of commercial, medical and forensic applications such as interactive voice response systems, service customization, natural human-machine interaction, recognizing the type of pathology of speakers, and directing the forensic investigation process. Many such applications depend on reliable systems using short speech segments without regard to the spoken text (text-independent). All these applications are also applicable using children’s speech. This research aims to develop accurate methods and tools to identify different characteristics of the speakers. Our experiments cover speaker recognition, gender recognition, age-group classification, and accent identification. However, similar approaches and techniques can be applied to identify other characteristics such as emotional/psychological state. The main focus of this research is on detecting these characteristics from children’s speech, which is previously reported as a more challenging subject compared to adult. Furthermore, the impact of different frequency bands on the performances of several recognition systems is studied, and the performance obtained using children’s speech is compared with the corresponding results from experiments using adults’ speech. Speaker characterization is performed by fitting a probability density function to acoustic features extracted from the speech signals. Since the distribution of acoustic features is complex, Gaussian mixture models (GMM) are applied. Due to lack of data, parametric model adaptation methods have been applied to adapt the universal background model (UBM) to the char acteristics of utterances. An effective approach involves adapting the UBM to speech signals using the Maximum-A-Posteriori (MAP) scheme. Then, the Gaussian means of the adapted GMM are concatenated to form a Gaussian mean super-vector for a given utterance. Finally, a classification or regression algorithm is used to identify the speaker characteristics. While effective, Gaussian mean super-vectors are of a high dimensionality resulting in high computational cost and difficulty in obtaining a robust model in the context of limited data. In the field of speaker recognition, recent advances using the i-vector framework have increased the classification accuracy. This framework, which provides a compact representation of an utterance in the form of a low dimensional feature vector, applies a simple factor analysis on GMM means

    Machine listening techniques as a complement to video image analysis in forensics

    Get PDF
    International audienceVideo is now one of the major sources of information for forensics. However, video documents can be originating from various recording devices (CCTV, mobile devices. . . ) with inconsistent quality and can sometimes be recorded in challenging light or motion conditions. Therefore, the amount of information that can be extracted relying solely on video image can vary to a great extent. Most of the videos however generally include audio recording as well. Machine listening can then become a valuable complement to video image analysis in challenging scenarios. In this paper, the authors present a brief overview of some machine listening techniques and their application to the analysis of video documents for forensics. The applicability of these techniques to forensics problems is then discussed in the light of machine listening system performances
    corecore