3 research outputs found
ESC: Dataset for Environmental Sound Classification
The ESC dataset is a collection of short environmental recordings available in a unified format (5-second-long clips, 44.1 kHz, single channel, Ogg Vorbis compressed @ 192 kbit/s). All clips have been extracted from public field recordings available through the Freesound.org project. Please see the README files for a detailed attribution list. The dataset is available under the terms of the Creative Commons license - Attribution-NonCommercial.
The dataset consists of three parts:
ESC-50: a labeled set of 2 000 environmental recordings (50 classes, 40 clips per class),
ESC-10: a labeled set of 400 environmental recordings (10 classes, 40 clips per class) (this is a subset of ESC-50 - created initialy as a proof-of-concept/standardized selection of easy recordings),
ESC-US: an unlabeled dataset of 250 000 environmental recordings (5-second-long clips), suitable for unsupervised pre-training.
The ESC-US dataset, although not hand-annotated, includes the labels (tags) submitted by the original uploading users, which could be potentially used for weakly-supervised learning (noisy and/or missing labels). The ESC-10 and ESC-50 datasets have been prearranged into 5 uniformly sized folds so that clips extracted from the same original source recording are always contained in a single fold.
The labeled datasets are also available as GitHub projects: ESC-50 | ESC-10.
For a more thorough description and analysis, please see the original paper and the supplementary IPython notebook.
The goal of this project is to facilitate open research initiatives in the field of environmental sound classification as publicly available datasets in this domain are still quite scarce.
Acknowledgments
I would like to thank Frederic Font Corbera for his help in using the Freesound API.</p
Hypernetworks build Implicit Neural Representations of Sounds
Implicit Neural Representations (INRs) are nowadays used to represent
multimedia signals across various real-life applications, including image
super-resolution, image compression, or 3D rendering. Existing methods that
leverage INRs are predominantly focused on visual data, as their application to
other modalities, such as audio, is nontrivial due to the inductive biases
present in architectural attributes of image-based INR models. To address this
limitation, we introduce HyperSound, the first meta-learning approach to
produce INRs for audio samples that leverages hypernetworks to generalize
beyond samples observed in training. Our approach reconstructs audio samples
with quality comparable to other state-of-the-art models and provides a viable
alternative to contemporary sound representations used in deep neural networks
for audio processing, such as spectrograms.Comment: ECML202