2 research outputs found
Neural Predictive Coding using Convolutional Neural Networks towards Unsupervised Learning of Speaker Characteristics
Learning speaker-specific features is vital in many applications like speaker
recognition, diarization and speech recognition. This paper provides a novel
approach, we term Neural Predictive Coding (NPC), to learn speaker-specific
characteristics in a completely unsupervised manner from large amounts of
unlabeled training data that even contain many non-speech events and
multi-speaker audio streams. The NPC framework exploits the proposed short-term
active-speaker stationarity hypothesis which assumes two temporally-close short
speech segments belong to the same speaker, and thus a common representation
that can encode the commonalities of both the segments, should capture the
vocal characteristics of that speaker. We train a convolutional deep siamese
network to produce "speaker embeddings" by learning to separate `same' vs
`different' speaker pairs which are generated from an unlabeled data of audio
streams. Two sets of experiments are done in different scenarios to evaluate
the strength of NPC embeddings and compare with state-of-the-art in-domain
supervised methods. First, two speaker identification experiments with
different context lengths are performed in a scenario with comparatively
limited within-speaker channel variability. NPC embeddings are found to perform
the best at short duration experiment, and they provide complementary
information to i-vectors for full utterance experiments. Second, a large scale
speaker verification task having a wide range of within-speaker channel
variability is adopted as an upper-bound experiment where comparisons are drawn
with in-domain supervised methods
Towards an Unsupervised Entrainment Distance in Conversational Speech using Deep Neural Networks
Entrainment is a known adaptation mechanism that causes interaction
participants to adapt or synchronize their acoustic characteristics.
Understanding how interlocutors tend to adapt to each other's speaking style
through entrainment involves measuring a range of acoustic features and
comparing those via multiple signal comparison methods. In this work, we
present a turn-level distance measure obtained in an unsupervised manner using
a Deep Neural Network (DNN) model, which we call Neural Entrainment Distance
(NED). This metric establishes a framework that learns an embedding from the
population-wide entrainment in an unlabeled training corpus. We use the
framework for a set of acoustic features and validate the measure
experimentally by showing its efficacy in distinguishing real conversations
from fake ones created by randomly shuffling speaker turns. Moreover, we show
real world evidence of the validity of the proposed measure. We find that high
value of NED is associated with high ratings of emotional bond in suicide
assessment interviews, which is consistent with prior studies.Comment: submitted to Interspeech 201