6,534 research outputs found
Conditional Similarity Networks
What makes images similar? To measure the similarity between images, they are
typically embedded in a feature-vector space, in which their distance preserve
the relative dissimilarity. However, when learning such similarity embeddings
the simplifying assumption is commonly made that images are only compared to
one unique measure of similarity. A main reason for this is that contradicting
notions of similarities cannot be captured in a single space. To address this
shortcoming, we propose Conditional Similarity Networks (CSNs) that learn
embeddings differentiated into semantically distinct subspaces that capture the
different notions of similarities. CSNs jointly learn a disentangled embedding
where features for different similarities are encoded in separate dimensions as
well as masks that select and reweight relevant dimensions to induce a subspace
that encodes a specific similarity notion. We show that our approach learns
interpretable image representations with visually relevant semantic subspaces.
Further, when evaluating on triplet questions from multiple similarity notions
our model even outperforms the accuracy obtained by training individual
specialized networks for each notion separately.Comment: CVPR 201
Unsupervised Learning of Semantic Audio Representations
Even in the absence of any explicit semantic annotation, vast collections of
audio recordings provide valuable information for learning the categorical
structure of sounds. We consider several class-agnostic semantic constraints
that apply to unlabeled nonspeech audio: (i) noise and translations in time do
not change the underlying sound category, (ii) a mixture of two sound events
inherits the categories of the constituents, and (iii) the categories of events
in close temporal proximity are likely to be the same or related. Without
labels to ground them, these constraints are incompatible with classification
loss functions. However, they may still be leveraged to identify geometric
inequalities needed for triplet loss-based training of convolutional neural
networks. The result is low-dimensional embeddings of the input spectrograms
that recover 41% and 84% of the performance of their fully-supervised
counterparts when applied to downstream query-by-example sound retrieval and
sound event classification tasks, respectively. Moreover, in
limited-supervision settings, our unsupervised embeddings double the
state-of-the-art classification performance.Comment: Submitted to ICASSP 201
- …