3 research outputs found
FSD50K: an Open Dataset of Human-Labeled Sound Events
Most existing datasets for sound event recognition (SER) are relatively small
and/or domain-specific, with the exception of AudioSet, based on a massive
amount of audio tracks from YouTube videos and encompassing over 500 classes of
everyday sounds. However, AudioSet is not an open dataset---its release
consists of pre-computed audio features (instead of waveforms), which limits
the adoption of some SER methods. Downloading the original audio tracks is also
problematic due to constituent YouTube videos gradually disappearing and usage
rights issues, which casts doubts over the suitability of this resource for
systems' benchmarking. To provide an alternative benchmark dataset and thus
foster SER research, we introduce FSD50K, an open dataset containing over 51k
audio clips totalling over 100h of audio manually labeled using 200 classes
drawn from the AudioSet Ontology. The audio clips are licensed under Creative
Commons licenses, making the dataset freely distributable (including
waveforms). We provide a detailed description of the FSD50K creation process,
tailored to the particularities of Freesound data, including challenges
encountered and solutions adopted. We include a comprehensive dataset
characterization along with discussion of limitations and key factors to allow
its audio-informed usage. Finally, we conduct sound event classification
experiments to provide baseline systems as well as insight on the main factors
to consider when splitting Freesound audio data for SER. Our goal is to develop
a dataset to be widely adopted by the community as a new open benchmark for SER
research
Unsupervised structure induction and multimodal grounding
Structured representations build upon symbolic abstraction (e.g., words in natural language and visual concepts in natural images), offer a principled way of encoding our perceptions about the physical world, and enable the human-like generalization of machine learning systems. The predominant paradigm for learning structured representations of the observed data has been supervised learning, but it is limited in several respects. First, supervised learning is challenging given the scarcity of labeled data. Second, conventional approaches to structured prediction have been relying on a single modality (e.g., either images or text), ignoring the learning cues that may have been specified in and can be readily obtained from other modalities of data. In this thesis, we investigate unsupervised approaches to structure induction in a multimodal setting.
Unsupervised learning is inherently difficult in general, let alone inducing complex and discrete structures from data without direct supervision. By considering the multimodal setting, we leverage the alignments between different data modalities (e.g., text, audio, and images) to facilitate the learning of structure-induction models, e.g., knowing that the individual words in ``a white pigeon'' always appear with the same visual object, a language parser is likely to treat them as a whole (i.e., phrase). The multimodal learning setting is practically viable because multimodal alignments are generally abundant. For example, they can be found in online posts such as news and tweets that usually contain images and associated text, and in (YouTube) videos, where audio, scripts, and scenes are synchronized and grounded in each other.
We develop structure-induction models, which are capable of exploiting bimodal image-text alignments, for two modalities: (1) for natural language, we consider unsupervised syntactic parsing with phrase-structure grammars and regularize the parser by using visual image groundings; and (2) for visual images, we induce scene graph representations by mapping arguments and predicates in the text to their visual counterparts (i.e., visual objects and relations among them) in an unsupervised manner. While useful, crossmodal alignments are not always abundantly available on the web, e.g., the alignments between non-speech audio and text. We tackle the challenge by sharing the visual modality between image-text alignment and image-audio alignment; images function as a pivot and connect audio and text. The contributions of this thesis span from model development to data collection. We demonstrated the feasibility of applying multimodal learning techniques to unsupervised structure induction and multimodal alignment collection. Our work opens up new avenues for multimodal and unsupervised structured representation learning
Advanced informatics for event detection and temporal localization
PhD ThesisThe primary objective of a Sound Event Detection (SED) system is to detect the prescene
of an acoustic event (i.e., audio tagging) and to return the onset and offset of the identified acoustic event within an audio clip (i.e., temporal localization). Such a system
can be promising in wildlife and biodiversity monitoring, surveillance, and smart-home
applications.
However, developing a system to be adept at both subtasks is not a trivial task. It can
be hindered by the need for a large amount of strongly labeled data, where the event tags
and the corresponding onsets and offsets are known with certainty. This is a limiting factor
as strongly labeled data is challenging to collect and is prone to annotation errors due to
the ambiguity in the perception of onsets and offsets.
In this thesis, we propose to address the lack of strongly labeled data by using pseudo
strongly labeled data, where the event tags are known with certainty while the corresponding onsets and offsets are estimated. While Nonnegative Matrix Factorization can be
used directly for SED but with limited accuracy, we show that it can be a useful tool
for pseudo labeling. We further show that pseudo strongly labeled data estimated using
our proposed methods can improve the accuracy of a SED system developed using deep
learning approaches.
Subsequent work then focused on improving a SED system as a whole rather than a
single subtask. This leads to the proposal of a novel student-teacher training framework
that incorporates a noise-robust loss function, a new cyclic training scheme, an improved
depthwise separable convolution, a triple instance-level temporal pooling approach, and an
improved Transformer encoding layer. Together with synthetic strongly labeled data and a
large corpus of unlabeled data, we show that a SED system developed using our proposed
method is capable of producing state-of-the-art performance