Search CORE

17 research outputs found

DISCRIMINANT BINARY DATA REPRESENTATION FOR SPEAKER RECOGNITION

Author: D Matrouf
J F Bonastre
P M Bousquet
Publication venue
Publication date: 02/04/2020
Field of study

ABSTRACT In supervector UBM/GMM paradigm, each acoustic file is represented by the mean parameters of a GMM model. This supervector space is used as a data representation space, which has a high dimensionality. Moreover, this space is not intrinsically discriminant and a complete speech segment is represented by only one vector, withdrawing mainly the possibility to take into account temporal or sequential information. This work proposes a new approach where each acoustic frame is represented in a discriminant binary space. The proposed approach relies on a UBM to structure the acoustic space in regions. Each region is then populated with a set of Gaussian models, denoted as "specificities", able to emphasize speaker specific information. Each acoustic frame is mapped in the discriminant binary space, turning "on" or "off" all the specificities to create a large binary vector. All the following steps, speaker reference extraction, likelihood estimation or decision take place in this binary space. Even if this work is a first step in this avenue, the experiments based on NIST SRE 2008 framework demonstrate the potential of the proposed approach. Moreover, this approach opens the opportunity to rethink all the classical processes using a discrete, binary view

CiteSeerX

The I4U Mega Fusion and Collaboration for NIST Speaker Recognition Evaluation 2016

Author: Ajili M.
Alegre F.
Ambikairajah E.
Aronowitz H.
Bahmaninezhad F.
Bonastre J. F.
Bousquet P. M.
Busch C.
Chng E. S.
Delgado H.
Evans N.
Fauve B.
Halonen M.
Hansen J. H.L.
Hautamäki V.
Isadskiy S.
Jin R.
Kanervisto A.
Kheder W. B.
Kinnunen T.
Larcher A.
Le Lan G.
Lee K. A.
Li H.
Li Haizhou
Lim Z. H.
Lin W. W.
Liu Gang
Ma B.
Ma J.
Mak M. W.
Matrouf D.
Nautsch A.
Nguyen T. H.
Qian Q.
Rao W.
Rathgeb C.
Rouvier M.
Saeidi R.
Sahidullah M.
Sarkar A. K.
Sethu V.
Sizov A.
Sriskandaraja K.
Stafylakis T.
Sun H.
Tan Z. H.
Thomsen D. A.L.
Todisco M.
Tzimiropoulos G.
Vestman V.
Wang G.
Wang Tianzhou
Wang Z.
Xiao X.
Xu C.
Xu H.
Xue J.
Zhang C.
Zhao Q.
Zhao T.
Zhu S.
Publication venue: 'International Speech Communication Association'
Publication date: 01/01/2017
Field of study

The 2016 speaker recognition evaluation (SRE'16) is the latest edition in the series of benchmarking events conducted by the National Institute of Standards and Technology (NIST). I4U is a joint entry to SRE'16 as the result from the collaboration and active exchange of information among researchers from sixteen Institutes and Universities across 4 continents. The joint submission and several of its 32 sub-systems were among top-performing systems. A lot of efforts have been devoted to two major challenges, namely, unlabeled training data and dataset shift from Switchboard-Mixer to the new Call My Net dataset. This paper summarizes the lessons learned, presents our shared view from the sixteen research groups on recent advances, major paradigm shift, and common tool chain used in speaker recognition as we have witnessed in SRE'16. More importantly, we look into the intriguing question of fusing a large ensemble of sub-systems and the potential benefit of large-scale collaboration.Peer reviewe

Aaltodoc Publication Archive

VBN

Application of Hidden Markov Models for denoising

Author: GAUVAIN (J. L.)
MATROUF (D.)
Publication venue: GRETSI, Saint Martin d'Hères, France
Publication date: 01/01/2001
Field of study

I-Revues

Language Identification Incorporating Lexical Information

Author: D. Matrouf
J. L. Gauvain
L. F. Lamel
M. Adda-decker
Matrouf Adda-Decker
Publication venue
Publication date
Field of study

In this paper we explore the use of lexical information for language identification (LID). Our reference LID system uses language-dependent acoustic phone models and phone-based bigram language models. For each language, lexical information is introduced by augmenting the phone vocabulary with the N most frequent words in the training data. Combined phone and word bigram models are used to provide linguistic constraints during acoustic decoding. Experiments were carried out on a 4-language telephone speech corpus. Using lexical information achieves a relative error reduction of about 20% on spontaneous and read speech compared to the reference phone-based system. Identification rates of 92%, 96% and 99% are achieved for spontaneous, read and task-specific speech segments respectively, with prior speech detection. 1. INTRODUCTION Many state-of-the-art language identification (LID) systems exploit phone-based acoustic and (or) phonotactic scores [7]. Training generally consists of desi..

CiteSeerX

The LIMSI 1995 Hub3 System

Author: D. Matrouf
G. Adda
J.L. Gauvain
L. Lamel
Publication venue
Publication date
Field of study

In this paper we report on the LIMSI recognizer evaluated in the ARPA 1995 North American Business (NAB) News Hub 3 benchmark test. The LIMSI recognizer is an HMM-based system with Gaussian mixture. Decoding is carried out in multiple forward acoustic passes, where more refined acoustic and language models are used in successive passes and information is transmitted via word graphs. In order to deal with the varied acoustic conditions, channel compensation is performed iteratively, refining the noise estimates before the first three decoding passes. The final decoding pass is carried out with speaker-adapted models obtained via unsupervised adaptation using the MLLR method. In contrast to previous evaluations, the new Hub 3 test aimed at improving basic SI, CSR performance on unlimited-vocabulary read speech recorded under more varied acoustical conditions (background environmental noise and unknown microphones). On the Sennheiser microphone (average SNR 29dB) a word error of 9.1% was ..

CiteSeerX

Developments in Continuous Speech Dictation using the 1995 ARPA NAB News Task

Author: D. Matrouf
G. Adda
J.L. Gauvain
L. Lamel
Publication venue
Publication date
Field of study

In this paper we report on the LIMSI recognizer evaluated in the ARPA 1995 North American Business (NAB) News benchmark test. In contrast to previous evaluations, the new Hub 3 test aims at improving basic SI, CSR performance on unlimitedvocabulary read speech recorded under more varied acoustical conditions (backgroundenvironmental noise and unknown microphones). The LIMSI recognizer is an HMM-based system with Gaussianmixture. Decoding is carried out in multiple forward acoustic passes, wheremore refined acoustic andlanguagemodels are used in successive passes and information is transmitted via word graphs. In order to deal with the varied acoustic conditions, channel compensation is performed iteratively, refining the noise estimates before the first three decoding passes. The final decoding pass is carried out with speaker-adapted models obtained via unsupervised adaptation using the MLLR method. On the Sennheiser microphone (average SNR 29dB) a word error of 9.1% was obtained, whi..

CiteSeerX

Model and score adaptation for biometric systems: Coping with device interoperability and changing acquisition conditions

Author: Bonastre J-F
Kittler J
Marcel S
Matrouf D
Poh N
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 24/01/2020
Field of study

University of Surrey

Anti-spoofing: Voice Conversion

Author: D. Matrouf
F. Alegre
F. Alegre
F. Alegre
F. Alegre
N. Evans
Y. Stylianou
Z. Wu
Z. Wu
Z. Wu
Z. Wu
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref