Search CORE

246 research outputs found

Factorization of Discriminatively Trained i-vector Extractor for Speaker Recognition

Author: Burget Lukas
Glembek Ondrej
Novotny Ondrej
Plchot Oldrich
Publication venue
Publication date: 05/04/2019
Field of study

In this work, we continue in our research on i-vector extractor for speaker verification (SV) and we optimize its architecture for fast and effective discriminative training. We were motivated by computational and memory requirements caused by the large number of parameters of the original generative i-vector model. Our aim is to preserve the power of the original generative model, and at the same time focus the model towards extraction of speaker-related information. We show that it is possible to represent a standard generative i-vector extractor by a model with significantly less parameters and obtain similar performance on SV tasks. We can further refine this compact model by discriminative training and obtain i-vectors that lead to better performance on various SV benchmarks representing different acoustic domains.Comment: Submitted to Interspeech 2019, Graz, Austria. arXiv admin note: substantial text overlap with arXiv:1810.1318

arXiv.org e-Print Archive

Crossref

Attentive Statistics Pooling for Deep Speaker Embedding

Author: Koshinaka Takafumi
Okabe Koji
Shinoda Koichi
Publication venue: 'International Speech Communication Association'
Publication date: 24/02/2019
Field of study

This paper proposes attentive statistics pooling for deep speaker embedding in text-independent speaker verification. In conventional speaker embedding, frame-level features are averaged over all the frames of a single utterance to form an utterance-level feature. Our method utilizes an attention mechanism to give different weights to different frames and generates not only weighted means but also weighted standard deviations. In this way, it can capture long-term variations in speaker characteristics more effectively. An evaluation on the NIST SRE 2012 and the VoxCeleb data sets shows that it reduces equal error rates (EERs) from the conventional method by 7.5% and 8.1%, respectively.Comment: Proc. Interspeech 2018, pp2252--2256. arXiv admin note: text overlap with arXiv:1809.0931

arXiv.org e-Print Archive

Crossref