153 research outputs found

    Semi-supervised triplet loss based learning of ambient audio embeddings

    Get PDF
    International audienceDeep neural networks are particularly useful to learn relevant repre-sentations from data. Recent studies have demonstrated the poten-tial of unsupervised representation learning for ambient sound anal-ysis using various flavors of the triplet loss. They have comparedthis approach to supervised learning. However, in real situations,it is common to have a small labeled dataset and a large unlabeledone. In this paper, we combine unsupervised and supervised tripletloss based learning into a semi-supervised representation learningapproach. We propose two flavors of this approach, whereby thepositive samples for those triplets whose anchors are unlabeled areobtained either by applying a transformation to the anchor, or byselecting the nearest sample in the training set. We compare ourapproach to supervised and unsupervised representation learning aswell as the ratio between the amount of labeled and unlabeled data.We evaluate all the above approaches on an audio tagging task usingthe DCASE 2018 Task 4 dataset, and we show the impact of thisratio on the tagging performance

    Learnable PINs: Cross-Modal Embeddings for Person Identity

    Full text link
    We propose and investigate an identity sensitive joint embedding of face and voice. Such an embedding enables cross-modal retrieval from voice to face and from face to voice. We make the following four contributions: first, we show that the embedding can be learnt from videos of talking faces, without requiring any identity labels, using a form of cross-modal self-supervision; second, we develop a curriculum learning schedule for hard negative mining targeted to this task, that is essential for learning to proceed successfully; third, we demonstrate and evaluate cross-modal retrieval for identities unseen and unheard during training over a number of scenarios and establish a benchmark for this novel task; finally, we show an application of using the joint embedding for automatically retrieving and labelling characters in TV dramas.Comment: To appear in ECCV 201

    Learning Sensory Representations with Minimal Supervision

    Get PDF
    corecore