3 research outputs found

    ESC: Dataset for Environmental Sound Classification

    No full text
    The ESC dataset is a collection of short environmental recordings available in a unified format (5-second-long clips, 44.1 kHz, single channel, Ogg Vorbis compressed @ 192 kbit/s). All clips have been extracted from public field recordings available through the Freesound.org project. Please see the README files for a detailed attribution list. The dataset is available under the terms of the Creative Commons license - Attribution-NonCommercial. The dataset consists of three parts: ESC-50: a labeled set of 2 000 environmental recordings (50 classes, 40 clips per class), ESC-10: a labeled set of 400 environmental recordings (10 classes, 40 clips per class) (this is a subset of ESC-50 - created initialy as a proof-of-concept/standardized selection of easy recordings), ESC-US: an unlabeled dataset of 250 000 environmental recordings (5-second-long clips), suitable for unsupervised pre-training. The ESC-US dataset, although not hand-annotated, includes the labels (tags) submitted by the original uploading users, which could be potentially used for weakly-supervised learning (noisy and/or missing labels). The ESC-10 and ESC-50 datasets have been prearranged into 5 uniformly sized folds so that clips extracted from the same original source recording are always contained in a single fold. The labeled datasets are also available as GitHub projects: ESC-50 | ESC-10. For a more thorough description and analysis, please see the original paper and the supplementary IPython notebook. The goal of this project is to facilitate open research initiatives in the field of environmental sound classification as publicly available datasets in this domain are still quite scarce. Acknowledgments I would like to thank Frederic Font Corbera for his help in using the Freesound API.</p

    Hypernetworks build Implicit Neural Representations of Sounds

    Full text link
    Implicit Neural Representations (INRs) are nowadays used to represent multimedia signals across various real-life applications, including image super-resolution, image compression, or 3D rendering. Existing methods that leverage INRs are predominantly focused on visual data, as their application to other modalities, such as audio, is nontrivial due to the inductive biases present in architectural attributes of image-based INR models. To address this limitation, we introduce HyperSound, the first meta-learning approach to produce INRs for audio samples that leverages hypernetworks to generalize beyond samples observed in training. Our approach reconstructs audio samples with quality comparable to other state-of-the-art models and provides a viable alternative to contemporary sound representations used in deep neural networks for audio processing, such as spectrograms.Comment: ECML202
    corecore