247 research outputs found
Unsupervised Learning of Visual Representations using Videos
Is strong supervision necessary for learning a good visual representation? Do
we really need millions of semantically-labeled images to train a Convolutional
Neural Network (CNN)? In this paper, we present a simple yet surprisingly
powerful approach for unsupervised learning of CNN. Specifically, we use
hundreds of thousands of unlabeled videos from the web to learn visual
representations. Our key idea is that visual tracking provides the supervision.
That is, two patches connected by a track should have similar visual
representation in deep feature space since they probably belong to the same
object or object part. We design a Siamese-triplet network with a ranking loss
function to train this CNN representation. Without using a single image from
ImageNet, just using 100K unlabeled videos and the VOC 2012 dataset, we train
an ensemble of unsupervised networks that achieves 52% mAP (no bounding box
regression). This performance comes tantalizingly close to its
ImageNet-supervised counterpart, an ensemble which achieves a mAP of 54.4%. We
also show that our unsupervised network can perform competitively in other
tasks such as surface-normal estimation
Self-Supervised Learning for Spinal MRIs
A significant proportion of patients scanned in a clinical setting have
follow-up scans. We show in this work that such longitudinal scans alone can be
used as a form of 'free' self-supervision for training a deep network. We
demonstrate this self-supervised learning for the case of T2-weighted sagittal
lumbar Magnetic Resonance Images (MRIs). A Siamese convolutional neural network
(CNN) is trained using two losses: (i) a contrastive loss on whether the scan
is of the same person (i.e. longitudinal) or not, together with (ii) a
classification loss on predicting the level of vertebral bodies. The
performance of this pre-trained network is then assessed on a grading
classification task. We experiment on a dataset of 1016 subjects, 423
possessing follow-up scans, with the end goal of learning the disc degeneration
radiological gradings attached to the intervertebral discs. We show that the
performance of the pre-trained CNN on the supervised classification task is (i)
superior to that of a network trained from scratch; and (ii) requires far fewer
annotated training samples to reach an equivalent performance to that of the
network trained from scratch.Comment: 3rd Workshop on Deep Learning in Medical Image Analysi
Unsupervised Learning of Semantic Audio Representations
Even in the absence of any explicit semantic annotation, vast collections of
audio recordings provide valuable information for learning the categorical
structure of sounds. We consider several class-agnostic semantic constraints
that apply to unlabeled nonspeech audio: (i) noise and translations in time do
not change the underlying sound category, (ii) a mixture of two sound events
inherits the categories of the constituents, and (iii) the categories of events
in close temporal proximity are likely to be the same or related. Without
labels to ground them, these constraints are incompatible with classification
loss functions. However, they may still be leveraged to identify geometric
inequalities needed for triplet loss-based training of convolutional neural
networks. The result is low-dimensional embeddings of the input spectrograms
that recover 41% and 84% of the performance of their fully-supervised
counterparts when applied to downstream query-by-example sound retrieval and
sound event classification tasks, respectively. Moreover, in
limited-supervision settings, our unsupervised embeddings double the
state-of-the-art classification performance.Comment: Submitted to ICASSP 201
- …