1,781 research outputs found
Zero-Shot Audio Classification via Semantic Embeddings
In this paper, we study zero-shot learning in audio classification via
semantic embeddings extracted from textual labels and sentence descriptions of
sound classes. Our goal is to obtain a classifier that is capable of
recognizing audio instances of sound classes that have no available training
samples, but only semantic side information. We employ a bilinear compatibility
framework to learn an acoustic-semantic projection between intermediate-level
representations of audio instances and sound classes, i.e., acoustic embeddings
and semantic embeddings. We use VGGish to extract deep acoustic embeddings from
audio clips, and pre-trained language models (Word2Vec, GloVe, BERT) to
generate either label embeddings from textual labels or sentence embeddings
from sentence descriptions of sound classes. Audio classification is performed
by a linear compatibility function that measures how compatible an acoustic
embedding and a semantic embedding are. We evaluate the proposed method on a
small balanced dataset ESC-50 and a large-scale unbalanced audio subset of
AudioSet. The experimental results show that classification performance is
significantly improved by involving sound classes that are semantically close
to the test classes in training. Meanwhile, we demonstrate that both label
embeddings and sentence embeddings are useful for zero-shot learning.
Classification performance is improved by concatenating label/sentence
embeddings generated with different language models. With their hybrid
concatenations, the results are improved further.Comment: Submitted to Transactions on Audio, Speech and Language Processin
Multimodal One-Shot Learning of Speech and Images
Imagine a robot is shown new concepts visually together with spoken tags,
e.g. "milk", "eggs", "butter". After seeing one paired audio-visual example per
class, it is shown a new set of unseen instances of these objects, and asked to
pick the "milk". Without receiving any hard labels, could it learn to match the
new continuous speech input to the correct visual instance? Although unimodal
one-shot learning has been studied, where one labelled example in a single
modality is given per class, this example motivates multimodal one-shot
learning. Our main contribution is to formally define this task, and to propose
several baseline and advanced models. We use a dataset of paired spoken and
visual digits to specifically investigate recent advances in Siamese
convolutional neural networks. Our best Siamese model achieves twice the
accuracy of a nearest neighbour model using pixel-distance over images and
dynamic time warping over speech in 11-way cross-modal matching.Comment: 5 pages, 1 figure, 3 tables; accepted to ICASSP 201
A Multimodal Prototypical Approach for Unsupervised Sound Classification
In the context of environmental sound classification, the adaptability of
systems is key: which sound classes are interesting depends on the context and
the user's needs. Recent advances in text-to-audio retrieval allow for
zero-shot audio classification, but performance compared to supervised models
remains limited. This work proposes a multimodal prototypical approach that
exploits local audio-text embeddings to provide more relevant answers to audio
queries, augmenting the adaptability of sound detection in the wild. We do this
by first using text to query a nearby community of audio embeddings that best
characterize each query sound, and select the group's centroids as our
prototypes. Second, we compare unseen audio to these prototypes for
classification. We perform multiple ablation studies to understand the impact
of the embedding models and prompts. Our unsupervised approach improves upon
the zero-shot state-of-the-art in three sound recognition benchmarks by an
average of 12%.Comment: Accepted to INTERSPEECH 202
Zero-Shot Audio Classification with Factored Linear and Nonlinear Acoustic-Semantic Projections
In this paper, we study zero-shot learning in audio classification through
factored linear and nonlinear acoustic-semantic projections between audio
instances and sound classes. Zero-shot learning in audio classification refers
to classification problems that aim at recognizing audio instances of sound
classes, which have no available training data but only semantic side
information. In this paper, we address zero-shot learning by employing factored
linear and nonlinear acoustic-semantic projections. We develop factored linear
projections by applying rank decomposition to a bilinear model, and use
nonlinear activation functions, such as tanh, to model the non-linearity
between acoustic embeddings and semantic embeddings. Compared with the prior
bilinear model, experimental results show that the proposed projection methods
are effective for improving classification performance of zero-shot learning in
audio classification.Comment: Accepted by ICASSP 202
- …