17 research outputs found
DISCRIMINANT BINARY DATA REPRESENTATION FOR SPEAKER RECOGNITION
ABSTRACT In supervector UBM/GMM paradigm, each acoustic file is represented by the mean parameters of a GMM model. This supervector space is used as a data representation space, which has a high dimensionality. Moreover, this space is not intrinsically discriminant and a complete speech segment is represented by only one vector, withdrawing mainly the possibility to take into account temporal or sequential information. This work proposes a new approach where each acoustic frame is represented in a discriminant binary space. The proposed approach relies on a UBM to structure the acoustic space in regions. Each region is then populated with a set of Gaussian models, denoted as "specificities", able to emphasize speaker specific information. Each acoustic frame is mapped in the discriminant binary space, turning "on" or "off" all the specificities to create a large binary vector. All the following steps, speaker reference extraction, likelihood estimation or decision take place in this binary space. Even if this work is a first step in this avenue, the experiments based on NIST SRE 2008 framework demonstrate the potential of the proposed approach. Moreover, this approach opens the opportunity to rethink all the classical processes using a discrete, binary view
The I4U Mega Fusion and Collaboration for NIST Speaker Recognition Evaluation 2016
The 2016 speaker recognition evaluation (SRE'16) is the latest edition in the series of benchmarking events conducted by the National Institute of Standards and Technology (NIST). I4U is a joint entry to SRE'16 as the result from the collaboration and active exchange of information among researchers from sixteen Institutes and Universities across 4 continents. The joint submission and several of its 32 sub-systems were among top-performing systems. A lot of efforts have been devoted to two major challenges, namely, unlabeled training data and dataset shift from Switchboard-Mixer to the new Call My Net dataset. This paper summarizes the lessons learned, presents our shared view from the sixteen research groups on recent advances, major paradigm shift, and common tool chain used in speaker recognition as we have witnessed in SRE'16. More importantly, we look into the intriguing question of fusing a large ensemble of sub-systems and the potential benefit of large-scale collaboration.Peer reviewe
Language Identification Incorporating Lexical Information
In this paper we explore the use of lexical information for language identification (LID). Our reference LID system uses language-dependent acoustic phone models and phone-based bigram language models. For each language, lexical information is introduced by augmenting the phone vocabulary with the N most frequent words in the training data. Combined phone and word bigram models are used to provide linguistic constraints during acoustic decoding. Experiments were carried out on a 4-language telephone speech corpus. Using lexical information achieves a relative error reduction of about 20% on spontaneous and read speech compared to the reference phone-based system. Identification rates of 92%, 96% and 99% are achieved for spontaneous, read and task-specific speech segments respectively, with prior speech detection. 1. INTRODUCTION Many state-of-the-art language identification (LID) systems exploit phone-based acoustic and (or) phonotactic scores [7]. Training generally consists of desi..
The LIMSI 1995 Hub3 System
In this paper we report on the LIMSI recognizer evaluated in the ARPA 1995 North American Business (NAB) News Hub 3 benchmark test. The LIMSI recognizer is an HMM-based system with Gaussian mixture. Decoding is carried out in multiple forward acoustic passes, where more refined acoustic and language models are used in successive passes and information is transmitted via word graphs. In order to deal with the varied acoustic conditions, channel compensation is performed iteratively, refining the noise estimates before the first three decoding passes. The final decoding pass is carried out with speaker-adapted models obtained via unsupervised adaptation using the MLLR method. In contrast to previous evaluations, the new Hub 3 test aimed at improving basic SI, CSR performance on unlimited-vocabulary read speech recorded under more varied acoustical conditions (background environmental noise and unknown microphones). On the Sennheiser microphone (average SNR 29dB) a word error of 9.1% was ..
Developments in Continuous Speech Dictation using the 1995 ARPA NAB News Task
In this paper we report on the LIMSI recognizer evaluated in the ARPA 1995 North American Business (NAB) News benchmark test. In contrast to previous evaluations, the new Hub 3 test aims at improving basic SI, CSR performance on unlimitedvocabulary read speech recorded under more varied acoustical conditions (backgroundenvironmental noise and unknown microphones). The LIMSI recognizer is an HMM-based system with Gaussianmixture. Decoding is carried out in multiple forward acoustic passes, wheremore refined acoustic andlanguagemodels are used in successive passes and information is transmitted via word graphs. In order to deal with the varied acoustic conditions, channel compensation is performed iteratively, refining the noise estimates before the first three decoding passes. The final decoding pass is carried out with speaker-adapted models obtained via unsupervised adaptation using the MLLR method. On the Sennheiser microphone (average SNR 29dB) a word error of 9.1% was obtained, whi..