8 research outputs found
Deep Unsupervised Similarity Learning using Partially Ordered Sets
Unsupervised learning of visual similarities is of paramount importance to
computer vision, particularly due to lacking training data for fine-grained
similarities. Deep learning of similarities is often based on relationships
between pairs or triplets of samples. Many of these relations are unreliable
and mutually contradicting, implying inconsistencies when trained without
supervision information that relates different tuples or triplets to each
other. To overcome this problem, we use local estimates of reliable
(dis-)similarities to initially group samples into compact surrogate classes
and use local partial orders of samples to classes to link classes to each
other. Similarity learning is then formulated as a partial ordering task with
soft correspondences of all samples to classes. Adopting a strategy of
self-supervision, a CNN is trained to optimally represent samples in a mutually
consistent manner while updating the classes. The similarity learning and
grouping procedure are integrated in a single model and optimized jointly. The
proposed unsupervised approach shows competitive performance on detailed pose
estimation and object classification.Comment: Accepted for publication at IEEE Computer Vision and Pattern
Recognition 201
Labelling unlabelled videos from scratch with multi-modal self-supervision
A large part of the current success of deep learning lies in the
effectiveness of data -- more precisely: labelled data. Yet, labelling a
dataset with human annotation continues to carry high costs, especially for
videos. While in the image domain, recent methods have allowed to generate
meaningful (pseudo-) labels for unlabelled datasets without supervision, this
development is missing for the video domain where learning feature
representations is the current focus. In this work, we a) show that
unsupervised labelling of a video dataset does not come for free from strong
feature encoders and b) propose a novel clustering method that allows
pseudo-labelling of a video dataset without any human annotations, by
leveraging the natural correspondence between the audio and visual modalities.
An extensive analysis shows that the resulting clusters have high semantic
overlap to ground truth human labels. We further introduce the first
benchmarking results on unsupervised labelling of common video datasets
Kinetics, Kinetics-Sound, VGG-Sound and AVE.Comment: Accepted to NeurIPS 2020. Project page:
https://www.robots.ox.ac.uk/~vgg/research/selavi, code:
https://github.com/facebookresearch/selav