2 research outputs found
Identify Speakers in Cocktail Parties with End-to-End Attention
In scenarios where multiple speakers talk at the same time, it is important
to be able to identify the talkers accurately. This paper presents an
end-to-end system that integrates speech source extraction and speaker
identification, and proposes a new way to jointly optimize these two parts by
max-pooling the speaker predictions along the channel dimension. Residual
attention permits us to learn spectrogram masks that are optimized for the
purpose of speaker identification, while residual forward connections permit
dilated convolution with a sufficiently large context window to guarantee
correct streaming across syllable boundaries. End-to-end training results in a
system that recognizes one speaker in a two-speaker broadcast speech mixture
with 99.9% accuracy and both speakers with 93.9% accuracy, and that recognizes
all speakers in three-speaker scenarios with 81.2% accuracy.Comment: Accepted by Interspeech 2020 for presentation;
https://github.com/JunzheJosephZhu/Identify-Speakers-in-Cocktail-Parties-with-E2E-Attentio
Latent space representation for multi-target speaker detection and identification with a sparse dataset using Triplet neural networks
We present an approach to tackle the speaker recognition problem using
Triplet Neural Networks. Currently, the -vector representation with
probabilistic linear discriminant analysis (PLDA) is the most commonly used
technique to solve this problem, due to high classification accuracy with a
relatively short computation time. In this paper, we explore a neural network
approach, namely Triplet Neural Networks (TNNs), to built a latent space for
different classifiers to solve the Multi-Target Speaker Detection and
Identification Challenge Evaluation 2018 (MCE 2018) dataset. This training set
contains -vectors from 3,631 speakers, with only 3 samples for each speaker,
thus making speaker recognition a challenging task. When using the train and
development set for training both the TNN and baseline model (i.e., similarity
evaluation directly on the -vector representation), our proposed model
outperforms the baseline by 23%. When reducing the training data to only using
the train set, our method results in 309 confusions for the Multi-target
speaker identification task, which is 46% better than the baseline model. These
results show that the representational power of TNNs is especially evident when
training on small datasets with few instances available per class.Comment: Accepted for ASRU 201