2,141 research outputs found
Cover Song Identification with Timbral Shape Sequences
We introduce a novel low level feature for identifying cover songs which
quantifies the relative changes in the smoothed frequency spectrum of a song.
Our key insight is that a sliding window representation of a chunk of audio can
be viewed as a time-ordered point cloud in high dimensions. For corresponding
chunks of audio between different versions of the same song, these point clouds
are approximately rotated, translated, and scaled copies of each other. If we
treat MFCC embeddings as point clouds and cast the problem as a relative shape
sequence, we are able to correctly identify 42/80 cover songs in the "Covers
80" dataset. By contrast, all other work to date on cover songs exclusively
relies on matching note sequences from Chroma derived features
Audio Cover Song Identification using Convolutional Neural Network
In this paper, we propose a new approach to cover song identification using a
CNN (convolutional neural network). Most previous studies extract the feature
vectors that characterize the cover song relation from a pair of songs and used
it to compute the (dis)similarity between the two songs. Based on the
observation that there is a meaningful pattern between cover songs and that
this can be learned, we have reformulated the cover song identification problem
in a machine learning framework. To do this, we first build the CNN using as an
input a cross-similarity matrix generated from a pair of songs. We then
construct the data set composed of cover song pairs and non-cover song pairs,
which are used as positive and negative training samples, respectively. The
trained CNN outputs the probability of being in the cover song relation given a
cross-similarity matrix generated from any two pieces of music and identifies
the cover song by ranking on the probability. Experimental results show that
the proposed algorithm achieves performance better than or comparable to the
state-of-the-art.Comment: Workshop on ML4Audio: Machine Learning for Audio Signal Processing at
NIPS 201
Learning View-Specific Deep Networks for Person Re-Identification
In recent years, a growing body of research has focused on the problem of
person re-identification (re-id). The re-id techniques attempt to match the
images of pedestrians from disjoint non-overlapping camera views. A major
challenge of re-id is the serious intra-class variations caused by changing
viewpoints. To overcome this challenge, we propose a deep neural network-based
framework which utilizes the view information in the feature extraction stage.
The proposed framework learns a view-specific network for each camera view with
a cross-view Euclidean constraint (CV-EC) and a cross-view center loss (CV-CL).
We utilize CV-EC to decrease the margin of the features between diverse views
and extend the center loss metric to a view-specific version to better adapt
the re-id problem. Moreover, we propose an iterative algorithm to optimize the
parameters of the view-specific networks from coarse to fine. The experiments
demonstrate that our approach significantly improves the performance of the
existing deep networks and outperforms the state-of-the-art methods on the
VIPeR, CUHK01, CUHK03, SYSU-mReId, and Market-1501 benchmarks.Comment: 12 pages, 8 figures, accepted by IEEE Transactions on image
processin
Large-Scale Cover Song Detection in Digital Music Libraries Using Metadata, Lyrics and Audio Features
Cover song detection is a very relevant task in Music Information Retrieval
(MIR) studies and has been mainly addressed using audio-based systems. Despite
its potential impact in industrial contexts, low performances and lack of
scalability have prevented such systems from being adopted in practice for
large applications. In this work, we investigate whether textual music
information (such as metadata and lyrics) can be used along with audio for
large-scale cover identification problem in a wide digital music library. We
benchmark this problem using standard text and state of the art audio
similarity measures. Our studies shows that these methods can significantly
increase the accuracy and scalability of cover detection systems on Million
Song Dataset (MSD) and Second Hand Song (SHS) datasets. By only leveraging
standard tf-idf based text similarity measures on song titles and lyrics, we
achieved 35.5% of absolute increase in mean average precision compared to the
current scalable audio content-based state of the art methods on MSD. These
experimental results suggests that new methodologies can be encouraged among
researchers to leverage and identify more sophisticated NLP-based techniques to
improve current cover song identification systems in digital music libraries
with metadata.Comment: Music Information Retrieval, Cover Song Identification, Million Song
Dataset, Natural Language Processin
Ensemble-based cover song detection
Audio-based cover song detection has received much attention in the MIR
community in the recent years. To date, the most popular formulation of the
problem has been to compare the audio signals of two tracks and to make a
binary decision based on this information only. However, leveraging additional
signals might be key if one wants to solve the problem at an industrial scale.
In this paper, we introduce an ensemble-based method that approaches the
problem from a many-to-many perspective. Instead of considering pairs of tracks
in isolation, we consider larger sets of potential versions for a given
composition, and create and exploit the graph of relationships between these
tracks. We show that this can result in a significant improvement in
performance, in particular when the number of existing versions of a given
composition is large.Comment: 7 pages, 4 figures, 7 table
A Survey on Object Detection in Optical Remote Sensing Images
Object detection in optical remote sensing images, being a fundamental but
challenging problem in the field of aerial and satellite image analysis, plays
an important role for a wide range of applications and is receiving significant
attention in recent years. While enormous methods exist, a deep review of the
literature concerning generic object detection is still lacking. This paper
aims to provide a review of the recent progress in this field. Different from
several previously published surveys that focus on a specific object class such
as building and road, we concentrate on more generic object categories
including, but are not limited to, road, building, tree, vehicle, ship,
airport, urban-area. Covering about 270 publications we survey 1) template
matching-based object detection methods, 2) knowledge-based object detection
methods, 3) object-based image analysis (OBIA)-based object detection methods,
4) machine learning-based object detection methods, and 5) five publicly
available datasets and three standard evaluation metrics. We also discuss the
challenges of current studies and propose two promising research directions,
namely deep learning-based feature representation and weakly supervised
learning-based geospatial object detection. It is our hope that this survey
will be beneficial for the researchers to have better understanding of this
research field.Comment: This manuscript is the accepted version for ISPRS Journal of
Photogrammetry and Remote Sensin
Cover Detection using Dominant Melody Embeddings
Automatic cover detection -- the task of finding in an audio database all the
covers of one or several query tracks -- has long been seen as a challenging
theoretical problem in the MIR community and as an acute practical problem for
authors and composers societies. Original algorithms proposed for this task
have proven their accuracy on small datasets, but are unable to scale up to
modern real-life audio corpora. On the other hand, faster approaches designed
to process thousands of pairwise comparisons resulted in lower accuracy, making
them unsuitable for practical use.
In this work, we propose a neural network architecture that is trained to
represent each track as a single embedding vector. The computation burden is
therefore left to the embedding extraction -- that can be conducted offline and
stored, while the pairwise comparison task reduces to a simple Euclidean
distance computation. We further propose to extract each track's embedding out
of its dominant melody representation, obtained by another neural network
trained for this task. We then show that this architecture improves
state-of-the-art accuracy both on small and large datasets, and is able to
scale to query databases of thousands of tracks in a few seconds
A Prototypical Triplet Loss for Cover Detection
Automatic cover detection -- the task of finding in a audio dataset all
covers of a query track -- has long been a challenging theoretical problem in
MIR community. It also became a practical need for music composers societies
requiring to detect automatically if an audio excerpt embeds musical content
belonging to their catalog.
In a recent work, we addressed this problem with a convolutional neural
network mapping each track's dominant melody to an embedding vector, and
trained to minimize cover pairs distance in the embeddings space, while
maximizing it for non-covers. We showed in particular that training this model
with enough works having five or more covers yields state-of-the-art results.
This however does not reflect the realistic use case, where music catalogs
typically contain works with zero or at most one or two covers. We thus
introduce here a new test set incorporating these constraints, and propose two
contributions to improve our model's accuracy under these stricter conditions:
we replace dominant melody with multi-pitch representation as input data, and
describe a novel prototypical triplet loss designed to improve covers
clustering. We show that these changes improve results significantly for two
concrete use cases, large dataset lookup and live songs identification.Comment: Corrections after reviewers comments. Correct erroneous figure 5 in
original versio
Unsupervised Person Re-identification by Deep Asymmetric Metric Embedding
Person re-identification (Re-ID) aims to match identities across
non-overlapping camera views. Researchers have proposed many supervised Re-ID
models which require quantities of cross-view pairwise labelled data. This
limits their scalabilities to many applications where a large amount of data
from multiple disjoint camera views is available but unlabelled. Although some
unsupervised Re-ID models have been proposed to address the scalability
problem, they often suffer from the view-specific bias problem which is caused
by dramatic variances across different camera views, e.g., different
illumination, viewpoints and occlusion. The dramatic variances induce specific
feature distortions in different camera views, which can be very disturbing in
finding cross-view discriminative information for Re-ID in the unsupervised
scenarios, since no label information is available to help alleviate the bias.
We propose to explicitly address this problem by learning an unsupervised
asymmetric distance metric based on cross-view clustering. The asymmetric
distance metric allows specific feature transformations for each camera view to
tackle the specific feature distortions. We then design a novel unsupervised
loss function to embed the asymmetric metric into a deep neural network, and
therefore develop a novel unsupervised deep framework named the DEep
Clustering-based Asymmetric MEtric Learning (DECAMEL). In such a way, DECAMEL
jointly learns the feature representation and the unsupervised asymmetric
metric. DECAMEL learns a compact cross-view cluster structure of Re-ID data,
and thus help alleviate the view-specific bias and facilitate mining the
potential cross-view discriminative information for unsupervised Re-ID.
Extensive experiments on seven benchmark datasets whose sizes span several
orders show the effectiveness of our framework.Comment: To appear in TPAM
Siamese Generative Adversarial Privatizer for Biometric Data
State-of-the-art machine learning algorithms can be fooled by carefully
crafted adversarial examples. As such, adversarial examples present a concrete
problem in AI safety. In this work we turn the tables and ask the following
question: can we harness the power of adversarial examples to prevent malicious
adversaries from learning identifying information from data while allowing
non-malicious entities to benefit from the utility of the same data? For
instance, can we use adversarial examples to anonymize biometric dataset of
faces while retaining usefulness of this data for other purposes, such as
emotion recognition? To address this question, we propose a simple yet
effective method, called Siamese Generative Adversarial Privatizer (SGAP), that
exploits the properties of a Siamese neural network to find discriminative
features that convey identifying information. When coupled with a generative
model, our approach is able to correctly locate and disguise identifying
information, while minimally reducing the utility of the privatized dataset.
Extensive evaluation on a biometric dataset of fingerprints and cartoon faces
confirms usefulness of our simple yet effective method.Comment: Paper accepted to ACCV 2018 (Asian Conference on Computer Vision
- …