343 research outputs found
Fully Supervised Speaker Diarization
In this paper, we propose a fully supervised speaker diarization approach,
named unbounded interleaved-state recurrent neural networks (UIS-RNN). Given
extracted speaker-discriminative embeddings (a.k.a. d-vectors) from input
utterances, each individual speaker is modeled by a parameter-sharing RNN,
while the RNN states for different speakers interleave in the time domain. This
RNN is naturally integrated with a distance-dependent Chinese restaurant
process (ddCRP) to accommodate an unknown number of speakers. Our system is
fully supervised and is able to learn from examples where time-stamped speaker
labels are annotated. We achieved a 7.6% diarization error rate on NIST SRE
2000 CALLHOME, which is better than the state-of-the-art method using spectral
clustering. Moreover, our method decodes in an online fashion while most
state-of-the-art systems rely on offline clustering.Comment: Accepted by ICASSP 201
Speaker Diarization using Deep Recurrent Convolutional Neural Networks for Speaker Embeddings
In this paper we propose a new method of speaker diarization that employs a
deep learning architecture to learn speaker embeddings. In contrast to the
traditional approaches that build their speaker embeddings using manually
hand-crafted spectral features, we propose to train for this purpose a
recurrent convolutional neural network applied directly on magnitude
spectrograms. To compare our approach with the state of the art, we collect and
release for the public an additional dataset of over 6 hours of fully annotated
broadcast material. The results of our evaluation on the new dataset and three
other benchmark datasets show that our proposed method significantly
outperforms the competitors and reduces diarization error rate by a large
margin of over 30% with respect to the baseline
Neural Predictive Coding using Convolutional Neural Networks towards Unsupervised Learning of Speaker Characteristics
Learning speaker-specific features is vital in many applications like speaker
recognition, diarization and speech recognition. This paper provides a novel
approach, we term Neural Predictive Coding (NPC), to learn speaker-specific
characteristics in a completely unsupervised manner from large amounts of
unlabeled training data that even contain many non-speech events and
multi-speaker audio streams. The NPC framework exploits the proposed short-term
active-speaker stationarity hypothesis which assumes two temporally-close short
speech segments belong to the same speaker, and thus a common representation
that can encode the commonalities of both the segments, should capture the
vocal characteristics of that speaker. We train a convolutional deep siamese
network to produce "speaker embeddings" by learning to separate `same' vs
`different' speaker pairs which are generated from an unlabeled data of audio
streams. Two sets of experiments are done in different scenarios to evaluate
the strength of NPC embeddings and compare with state-of-the-art in-domain
supervised methods. First, two speaker identification experiments with
different context lengths are performed in a scenario with comparatively
limited within-speaker channel variability. NPC embeddings are found to perform
the best at short duration experiment, and they provide complementary
information to i-vectors for full utterance experiments. Second, a large scale
speaker verification task having a wide range of within-speaker channel
variability is adopted as an upper-bound experiment where comparisons are drawn
with in-domain supervised methods
Towards an Unsupervised Entrainment Distance in Conversational Speech using Deep Neural Networks
Entrainment is a known adaptation mechanism that causes interaction
participants to adapt or synchronize their acoustic characteristics.
Understanding how interlocutors tend to adapt to each other's speaking style
through entrainment involves measuring a range of acoustic features and
comparing those via multiple signal comparison methods. In this work, we
present a turn-level distance measure obtained in an unsupervised manner using
a Deep Neural Network (DNN) model, which we call Neural Entrainment Distance
(NED). This metric establishes a framework that learns an embedding from the
population-wide entrainment in an unlabeled training corpus. We use the
framework for a set of acoustic features and validate the measure
experimentally by showing its efficacy in distinguishing real conversations
from fake ones created by randomly shuffling speaker turns. Moreover, we show
real world evidence of the validity of the proposed measure. We find that high
value of NED is associated with high ratings of emotional bond in suicide
assessment interviews, which is consistent with prior studies.Comment: submitted to Interspeech 201
Speaker Diarization with LSTM
For many years, i-vector based audio embedding techniques were the dominant
approach for speaker verification and speaker diarization applications.
However, mirroring the rise of deep learning in various domains, neural network
based audio embeddings, also known as d-vectors, have consistently demonstrated
superior speaker verification performance. In this paper, we build on the
success of d-vector based speaker verification systems to develop a new
d-vector based approach to speaker diarization. Specifically, we combine
LSTM-based d-vector audio embeddings with recent work in non-parametric
clustering to obtain a state-of-the-art speaker diarization system. Our system
is evaluated on three standard public datasets, suggesting that d-vector based
diarization systems offer significant advantages over traditional i-vector
based systems. We achieved a 12.0% diarization error rate on NIST SRE 2000
CALLHOME, while our model is trained with out-of-domain data from voice search
logs.Comment: Published at ICASSP 201
Cross-modal Supervision for Learning Active Speaker Detection in Video
In this paper, we show how to use audio to supervise the learning of active
speaker detection in video. Voice Activity Detection (VAD) guides the learning
of the vision-based classifier in a weakly supervised manner. The classifier
uses spatio-temporal features to encode upper body motion - facial expressions
and gesticulations associated with speaking. We further improve a generic model
for active speaker detection by learning person specific models. Finally, we
demonstrate the online adaptation of generic models learnt on one dataset, to
previously unseen people in a new dataset, again using audio (VAD) for weak
supervision. The use of temporal continuity overcomes the lack of clean
training data. We are the first to present an active speaker detection system
that learns on one audio-visual dataset and automatically adapts to speakers in
a new dataset. This work can be seen as an example of how the availability of
multi-modal data allows us to learn a model without the need for supervision,
by transferring knowledge from one modality to another.Comment: 16 page
Semi-Supervised Training with Pseudo-Labeling for End-to-End Neural Diarization
In this paper, we present a semi-supervised training technique using
pseudo-labeling for end-to-end neural diarization (EEND). The EEND system has
shown promising performance compared with traditional clustering-based methods,
especially in the case of overlapping speech. However, to get a well-tuned
model, EEND requires labeled data for all the joint speech activities of every
speaker at each time frame in a recording. In this paper, we explore a
pseudo-labeling approach that employs unlabeled data. First, we propose an
iterative pseudo-label method for EEND, which trains the model using unlabeled
data of a target condition. Then, we also propose a committee-based training
method to improve the performance of EEND. To evaluate our proposed method, we
conduct the experiments of model adaptation using labeled and unlabeled data.
Experimental results on the CALLHOME dataset show that our proposed
pseudo-label achieved a 37.4% relative diarization error rate reduction
compared to a seed model. Moreover, we analyzed the results of semi-supervised
adaptation with pseudo-labeling. We also show the effectiveness of our approach
on the third DIHARD dataset.Comment: Accepted for Interspeech 202
Putting a Face to the Voice: Fusing Audio and Visual Signals Across a Video to Determine Speakers
In this paper, we present a system that associates faces with voices in a
video by fusing information from the audio and visual signals. The thesis
underlying our work is that an extremely simple approach to generating (weak)
speech clusters can be combined with visual signals to effectively associate
faces and voices by aggregating statistics across a video. This approach does
not need any training data specific to this task and leverages the natural
coherence of information in the audio and visual streams. It is particularly
applicable to tracking speakers in videos on the web where a priori information
about the environment (e.g., number of speakers, spatial signals for
beamforming) is not available. We performed experiments on a real-world dataset
using this analysis framework to determine the speaker in a video. Given a
ground truth labeling determined by human rater consensus, our approach had
~71% accuracy
Robust End-to-end Speaker Diarization with Generic Neural Clustering
End-to-end speaker diarization approaches have shown exceptional performance
over the traditional modular approaches. To further improve the performance of
the end-to-end speaker diarization for real speech recordings, recently works
have been proposed which integrate unsupervised clustering algorithms with the
end-to-end neural diarization models. However, these methods have a number of
drawbacks: 1) The unsupervised clustering algorithms cannot leverage the
supervision from the available datasets; 2) The K-means-based unsupervised
algorithms that are explored often suffer from the constraint violation
problem; 3) There is unavoidable mismatch between the supervised training and
the unsupervised inference. In this paper, a robust generic neural clustering
approach is proposed that can be integrated with any chunk-level predictor to
accomplish a fully supervised end-to-end speaker diarization model. Also, by
leveraging the sequence modelling ability of a recurrent neural network, the
proposed neural clustering approach can dynamically estimate the number of
speakers during inference. Experimental show that when integrating an
attractor-based chunk-level predictor, the proposed neural clustering approach
can yield better Diarization Error Rate (DER) than the constrained
K-means-based clustering approaches under the mismatched conditions.Comment: submitted to INTERSPEECH 202
- …