112 research outputs found
NPLDA: A Deep Neural PLDA Model for Speaker Verification
The state-of-art approach for speaker verification consists of a neural
network based embedding extractor along with a backend generative model such as
the Probabilistic Linear Discriminant Analysis (PLDA). In this work, we propose
a neural network approach for backend modeling in speaker recognition. The
likelihood ratio score of the generative PLDA model is posed as a
discriminative similarity function and the learnable parameters of the score
function are optimized using a verification cost. The proposed model, termed as
neural PLDA (NPLDA), is initialized using the generative PLDA model parameters.
The loss function for the NPLDA model is an approximation of the minimum
detection cost function (DCF). The speaker recognition experiments using the
NPLDA model are performed on the speaker verificiation task in the VOiCES
datasets as well as the SITW challenge dataset. In these experiments, the NPLDA
model optimized using the proposed loss function improves significantly over
the state-of-art PLDA based speaker verification system.Comment: Published in Odyssey 2020, the Speaker and Language Recognition
Workshop (VOiCES Special Session). Link to GitHub Implementation:
https://github.com/iiscleap/NeuralPlda. arXiv admin note: substantial text
overlap with arXiv:2001.0703
Deep Self-Supervised Hierarchical Clustering for Speaker Diarization
The state-of-the-art speaker diarization systems use agglomerative
hierarchical clustering (AHC) which performs the clustering of previously
learned neural embeddings. While the clustering approach attempts to identify
speaker clusters, the AHC algorithm does not involve any further learning. In
this paper, we propose a novel algorithm for hierarchical clustering which
combines the speaker clustering along with a representation learning framework.
The proposed approach is based on principles of self-supervised learning where
the self-supervision is derived from the clustering algorithm. The
representation learning network is trained with a regularized triplet loss
using the clustering solution at the current step while the clustering
algorithm uses the deep embeddings from the representation learning step. By
combining the self-supervision based representation learning along with the
clustering algorithm, we show that the proposed algorithm improves
significantly 29% relative improvement) over the AHC algorithm with cosine
similarity for a speaker diarization task on CALLHOME dataset. In addition, the
proposed approach also improves over the state-of-the-art system with PLDA
affinity matrix with 10% relative improvement in DER.Comment: 5 pages, Accepted in Interspeech 202
Robust Raw Waveform Speech Recognition Using Relevance Weighted Representations
Speech recognition in noisy and channel distorted scenarios is often
challenging as the current acoustic modeling schemes are not adaptive to the
changes in the signal distribution in the presence of noise. In this work, we
develop a novel acoustic modeling framework for noise robust speech recognition
based on relevance weighting mechanism. The relevance weighting is achieved
using a sub-network approach that performs feature selection. A relevance
sub-network is applied on the output of first layer of a convolutional network
model operating on raw speech signals while a second relevance sub-network is
applied on the second convolutional layer output. The relevance weights for the
first layer correspond to an acoustic filterbank selection while the relevance
weights in the second layer perform modulation filter selection. The model is
trained for a speech recognition task on noisy and reverberant speech. The
speech recognition experiments on multiple datasets (Aurora-4, CHiME-3, VOiCES)
reveal that the incorporation of relevance weighting in the neural network
architecture improves the speech recognition word error rates significantly
(average relative improvements of 10% over the baseline systems)Comment: arXiv admin note: text overlap with arXiv:2001.0706
Speaker diarization assisted ASR for multi-speaker conversations
In this paper, we propose a novel approach for the transcription of speech
conversations with natural speaker overlap, from single channel recordings. We
propose a combination of a speaker diarization system and a hybrid automatic
speech recognition (ASR) system with speaker activity assisted acoustic model
(AM). An end-to-end neural network system is used for speaker diarization. Two
architectures, (i) input conditioned AM, and (ii) gated features AM, are
explored to incorporate the speaker activity information. The models output
speaker specific senones. The experiments on Switchboard telephone
conversations show the advantage of incorporating speaker activity information
in the ASR system for recordings with overlapped speech. In particular, an
absolute improvement of in word error rate (WER) is seen for the
proposed approach on natural conversation speech with automatic diarization.Comment: Manuscript submitted to INTERSPEECH 202
- …