2 research outputs found
Generative x-vectors for text-independent speaker verification
Speaker verification (SV) systems using deep neural network embeddings,
so-called the x-vector systems, are becoming popular due to its good
performance superior to the i-vector systems. The fusion of these systems
provides improved performance benefiting both from the discriminatively trained
x-vectors and generative i-vectors capturing distinct speaker characteristics.
In this paper, we propose a novel method to include the complementary
information of i-vector and x-vector, that is called generative x-vector. The
generative x-vector utilizes a transformation model learned from the i-vector
and x-vector representations of the background data. Canonical correlation
analysis is applied to derive this transformation model, which is later used to
transform the standard x-vectors of the enrollment and test segments to the
corresponding generative x-vectors. The SV experiments performed on the NIST
SRE 2010 dataset demonstrate that the system using generative x-vectors
provides considerably better performance than the baseline i-vector and
x-vector systems. Furthermore, the generative x-vectors outperform the fusion
of i-vector and x-vector systems for long-duration utterances, while yielding
comparable results for short-duration utterances.Comment: Accepted for publication at SLT 201
DNN Speaker Tracking with Embeddings
In multi-speaker applications is common to have pre-computed models from
enrolled speakers. Using these models to identify the instances in which these
speakers intervene in a recording is the task of speaker tracking. In this
paper, we propose a novel embedding-based speaker tracking method.
Specifically, our design is based on a convolutional neural network that mimics
a typical speaker verification PLDA (probabilistic linear discriminant
analysis) classifier and finds the regions uttered by the target speakers in an
online fashion. The system was studied from two different perspectives:
diarization and tracking; results on both show a significant improvement over
the PLDA baseline under the same experimental conditions. Two standard public
datasets, CALLHOME and DIHARD II single channel, were modified to create
two-speaker subsets with overlapping and non-overlapping regions. We evaluate
the robustness of our supervised approach with models generated from different
segment lengths. A relative improvement of 17% in DER for DIHARD II single
channel shows promising performance. Furthermore, to make the baseline system
similar to speaker tracking, non-target speakers were added to the recordings.
Even in these adverse conditions, our approach is robust enough to outperform
the PLDA baseline