2,275 research outputs found
GENDER INDEPENDENT DISCRIMINATIVE SPEAKER RECOGNITION IN IâVECTOR SPACE
Speaker recognition systems attain their best accuracy when
trained with gender dependent features and tested with known
gender trials. In real applications, however, gender labels are
often not given. In this work we illustrate the design of a system
that does not make use of the gender labels both in training
and in test, i.e. a completely Gender Independent (GI)
system. It relies on discriminative training, where the trials
are iâvector pairs, and the discrimination is between the hypothesis
that the pair of feature vectors in the trial belong to
the same speaker or to different speakers. We demonstrate
that this pairwise discriminative training can be interpreted as
a procedure that estimates the parameters of the best (second
order) approximation of the logâlikelihood ratio score function,
and that a pairwise SVM can be used for training a gender
independent system. Our results show that a pairwise GI
SVM, saving memory and execution time, achieves on the last
NIST evaluations stateâofâtheâart performance, comparable
to a Gender Dependent(GD) system
Pairwise Discriminative Speaker Verification in the I-Vector Space
This work presents a new and efficient approach to discriminative speaker verification in the i-vector space. We illustrate the development of a linear discriminative classifier that is trained to discriminate between the hypothesis that a pair of feature vectors in a trial belong to the same speaker or to different speakers. This approach is alternative to the usual discriminative setup that discriminates between a speaker and all the other speakers. We use a discriminative classifier based on a Support Vector Machine (SVM) that is trained to estimate the parameters of a symmetric quadratic function approximating a log-likelihood ratio score without explicit modeling of the i-vector distributions as in the generative Probabilistic Linear Discriminant Analysis (PLDA) models. Training these models is feasible because it is not necessary to expand the i-vector pairs, which would be expensive or even impossible even for medium sized training sets. The results of experiments performed on the tel-tel extended core condition of the NIST 2010 Speaker Recognition Evaluation are competitive with the ones obtained by generative models, in terms of normalized Detection Cost Function and Equal Error Rate. Moreover, we show that it is possible to train a gender- independent discriminative model that achieves state-of-the-art accuracy, comparable to the one of a gender-dependent system, saving memory and execution time both in training and in testin
NPLDA: A Deep Neural PLDA Model for Speaker Verification
The state-of-art approach for speaker verification consists of a neural
network based embedding extractor along with a backend generative model such as
the Probabilistic Linear Discriminant Analysis (PLDA). In this work, we propose
a neural network approach for backend modeling in speaker recognition. The
likelihood ratio score of the generative PLDA model is posed as a
discriminative similarity function and the learnable parameters of the score
function are optimized using a verification cost. The proposed model, termed as
neural PLDA (NPLDA), is initialized using the generative PLDA model parameters.
The loss function for the NPLDA model is an approximation of the minimum
detection cost function (DCF). The speaker recognition experiments using the
NPLDA model are performed on the speaker verificiation task in the VOiCES
datasets as well as the SITW challenge dataset. In these experiments, the NPLDA
model optimized using the proposed loss function improves significantly over
the state-of-art PLDA based speaker verification system.Comment: Published in Odyssey 2020, the Speaker and Language Recognition
Workshop (VOiCES Special Session). Link to GitHub Implementation:
https://github.com/iiscleap/NeuralPlda. arXiv admin note: substantial text
overlap with arXiv:2001.0703
Factorization of Discriminatively Trained i-vector Extractor for Speaker Recognition
In this work, we continue in our research on i-vector extractor for speaker
verification (SV) and we optimize its architecture for fast and effective
discriminative training. We were motivated by computational and memory
requirements caused by the large number of parameters of the original
generative i-vector model. Our aim is to preserve the power of the original
generative model, and at the same time focus the model towards extraction of
speaker-related information. We show that it is possible to represent a
standard generative i-vector extractor by a model with significantly less
parameters and obtain similar performance on SV tasks. We can further refine
this compact model by discriminative training and obtain i-vectors that lead to
better performance on various SV benchmarks representing different acoustic
domains.Comment: Submitted to Interspeech 2019, Graz, Austria. arXiv admin note:
substantial text overlap with arXiv:1810.1318
Emotion Recognition from Acted and Spontaneous Speech
DizertaÄnĂ prĂĄce se zabĂœvĂĄ rozpoznĂĄnĂm emoÄnĂho stavu mluvÄĂch z ĆeÄovĂ©ho signĂĄlu. PrĂĄce je rozdÄlena do dvou hlavnĂch ÄastĂ, prvnĂ ÄĂĄst popisuju navrĆŸenĂ© metody pro rozpoznĂĄnĂ emoÄnĂho stavu z hranĂœch databĂĄzĂ. V rĂĄmci tĂ©to ÄĂĄsti jsou pĆedstaveny vĂœsledky rozpoznĂĄnĂ pouĆŸitĂm dvou rĆŻznĂœch databĂĄzĂ s rĆŻznĂœmi jazyky. HlavnĂmi pĆĂnosy tĂ©to ÄĂĄsti je detailnĂ analĂœza rozsĂĄhlĂ© ĆĄkĂĄly rĆŻznĂœch pĆĂznakĆŻ zĂskanĂœch z ĆeÄovĂ©ho signĂĄlu, nĂĄvrh novĂœch klasifikaÄnĂch architektur jako je napĆĂklad âemoÄnĂ pĂĄrovĂĄnĂâ a nĂĄvrh novĂ© metody pro mapovĂĄnĂ diskrĂ©tnĂch emoÄnĂch stavĆŻ do dvou dimenzionĂĄlnĂho prostoru. DruhĂĄ ÄĂĄst se zabĂœvĂĄ rozpoznĂĄnĂm emoÄnĂch stavĆŻ z databĂĄze spontĂĄnnĂ ĆeÄi, kterĂĄ byla zĂskĂĄna ze zĂĄznamĆŻ hovorĆŻ z reĂĄlnĂœch call center. Poznatky z analĂœzy a nĂĄvrhu metod rozpoznĂĄnĂ z hranĂ© ĆeÄi byly vyuĆŸity pro nĂĄvrh novĂ©ho systĂ©mu pro rozpoznĂĄnĂ sedmi spontĂĄnnĂch emoÄnĂch stavĆŻ. JĂĄdrem navrĆŸenĂ©ho pĆĂstupu je komplexnĂ klasifikaÄnĂ architektura zaloĆŸena na fĂșzi rĆŻznĂœch systĂ©mĆŻ. PrĂĄce se dĂĄle zabĂœvĂĄ vlivem emoÄnĂho stavu mluvÄĂho na ĂșspÄĆĄnosti rozpoznĂĄnĂ pohlavĂ a nĂĄvrhem systĂ©mu pro automatickou detekci ĂșspÄĆĄnĂœch hovorĆŻ v call centrech na zĂĄkladÄ analĂœzy parametrĆŻ dialogu mezi ĂșÄastnĂky telefonnĂch hovorĆŻ.Doctoral thesis deals with emotion recognition from speech signals. The thesis is divided into two main parts; the first part describes proposed approaches for emotion recognition using two different multilingual databases of acted emotional speech. The main contributions of this part are detailed analysis of a big set of acoustic features, new classification schemes for vocal emotion recognition such as âemotion couplingâ and new method for mapping discrete emotions into two-dimensional space. The second part of this thesis is devoted to emotion recognition using multilingual databases of spontaneous emotional speech, which is based on telephone records obtained from real call centers. The knowledge gained from experiments with emotion recognition from acted speech was exploited to design a new approach for classifying seven emotional states. The core of the proposed approach is a complex classification architecture based on the fusion of different systems. The thesis also examines the influence of speakerâs emotional state on gender recognition performance and proposes system for automatic identification of successful phone calls in call center by means of dialogue features.
Deep Speaker Feature Learning for Text-independent Speaker Verification
Recently deep neural networks (DNNs) have been used to learn speaker
features. However, the quality of the learned features is not sufficiently
good, so a complex back-end model, either neural or probabilistic, has to be
used to address the residual uncertainty when applied to speaker verification,
just as with raw features. This paper presents a convolutional time-delay deep
neural network structure (CT-DNN) for speaker feature learning. Our
experimental results on the Fisher database demonstrated that this CT-DNN can
produce high-quality speaker features: even with a single feature (0.3 seconds
including the context), the EER can be as low as 7.68%. This effectively
confirmed that the speaker trait is largely a deterministic short-time property
rather than a long-time distributional pattern, and therefore can be extracted
from just dozens of frames.Comment: deep neural networks, speaker verification, speaker featur
Time-Contrastive Learning Based Deep Bottleneck Features for Text-Dependent Speaker Verification
There are a number of studies about extraction of bottleneck (BN) features
from deep neural networks (DNNs)trained to discriminate speakers, pass-phrases
and triphone states for improving the performance of text-dependent speaker
verification (TD-SV). However, a moderate success has been achieved. A recent
study [1] presented a time contrastive learning (TCL) concept to explore the
non-stationarity of brain signals for classification of brain states. Speech
signals have similar non-stationarity property, and TCL further has the
advantage of having no need for labeled data. We therefore present a TCL based
BN feature extraction method. The method uniformly partitions each speech
utterance in a training dataset into a predefined number of multi-frame
segments. Each segment in an utterance corresponds to one class, and class
labels are shared across utterances. DNNs are then trained to discriminate all
speech frames among the classes to exploit the temporal structure of speech. In
addition, we propose a segment-based unsupervised clustering algorithm to
re-assign class labels to the segments. TD-SV experiments were conducted on the
RedDots challenge database. The TCL-DNNs were trained using speech data of
fixed pass-phrases that were excluded from the TD-SV evaluation set, so the
learned features can be considered phrase-independent. We compare the
performance of the proposed TCL bottleneck (BN) feature with those of
short-time cepstral features and BN features extracted from DNNs discriminating
speakers, pass-phrases, speaker+pass-phrase, as well as monophones whose labels
and boundaries are generated by three different automatic speech recognition
(ASR) systems. Experimental results show that the proposed TCL-BN outperforms
cepstral features and speaker+pass-phrase discriminant BN features, and its
performance is on par with those of ASR derived BN features. Moreover,....Comment: Copyright (c) 2019 IEEE. Personal use of this material is permitted.
Permission from IEEE must be obtained for all other uses, in any current or
future media, including reprinting/republishing this material for advertising
or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of
this work in other work
- âŠ