2 research outputs found

    Neural Predictive Coding using Convolutional Neural Networks towards Unsupervised Learning of Speaker Characteristics

    Full text link
    Learning speaker-specific features is vital in many applications like speaker recognition, diarization and speech recognition. This paper provides a novel approach, we term Neural Predictive Coding (NPC), to learn speaker-specific characteristics in a completely unsupervised manner from large amounts of unlabeled training data that even contain many non-speech events and multi-speaker audio streams. The NPC framework exploits the proposed short-term active-speaker stationarity hypothesis which assumes two temporally-close short speech segments belong to the same speaker, and thus a common representation that can encode the commonalities of both the segments, should capture the vocal characteristics of that speaker. We train a convolutional deep siamese network to produce "speaker embeddings" by learning to separate `same' vs `different' speaker pairs which are generated from an unlabeled data of audio streams. Two sets of experiments are done in different scenarios to evaluate the strength of NPC embeddings and compare with state-of-the-art in-domain supervised methods. First, two speaker identification experiments with different context lengths are performed in a scenario with comparatively limited within-speaker channel variability. NPC embeddings are found to perform the best at short duration experiment, and they provide complementary information to i-vectors for full utterance experiments. Second, a large scale speaker verification task having a wide range of within-speaker channel variability is adopted as an upper-bound experiment where comparisons are drawn with in-domain supervised methods

    Towards an Unsupervised Entrainment Distance in Conversational Speech using Deep Neural Networks

    Full text link
    Entrainment is a known adaptation mechanism that causes interaction participants to adapt or synchronize their acoustic characteristics. Understanding how interlocutors tend to adapt to each other's speaking style through entrainment involves measuring a range of acoustic features and comparing those via multiple signal comparison methods. In this work, we present a turn-level distance measure obtained in an unsupervised manner using a Deep Neural Network (DNN) model, which we call Neural Entrainment Distance (NED). This metric establishes a framework that learns an embedding from the population-wide entrainment in an unlabeled training corpus. We use the framework for a set of acoustic features and validate the measure experimentally by showing its efficacy in distinguishing real conversations from fake ones created by randomly shuffling speaker turns. Moreover, we show real world evidence of the validity of the proposed measure. We find that high value of NED is associated with high ratings of emotional bond in suicide assessment interviews, which is consistent with prior studies.Comment: submitted to Interspeech 201
    corecore