5 research outputs found
A Global-local Attention Framework for Weakly Labelled Audio Tagging
Weakly labelled audio tagging aims to predict the classes of sound events
within an audio clip, where the onset and offset times of the sound events are
not provided. Previous works have used the multiple instance learning (MIL)
framework, and exploited the information of the whole audio clip by MIL pooling
functions. However, the detailed information of sound events such as their
durations may not be considered under this framework. To address this issue, we
propose a novel two-stream framework for audio tagging by exploiting the global
and local information of sound events. The global stream aims to analyze the
whole audio clip in order to capture the local clips that need to be attended
using a class-wise selection module. These clips are then fed to the local
stream to exploit the detailed information for a better decision. Experimental
results on the AudioSet show that our proposed method can significantly improve
the performance of audio tagging under different baseline network
architectures.Comment: Accepted to ICASSP202
A Two-student Learning Framework for Mixed Supervised Target Sound Detection
Target sound detection (TSD) aims to detect the target sound from mixture
audio given the reference information. Previous work shows that a good
detection performance relies on fully-annotated data. However, collecting
fully-annotated data is labor-extensive. Therefore, we consider TSD with mixed
supervision, which learns novel categories (target domain) using weak
annotations with the help of full annotations of existing base categories
(source domain). We propose a novel two-student learning framework, which
contains two mutual helping student models ( and
) that learn from fully- and weakly-annotated datasets,
respectively. Specifically, we first propose a frame-level knowledge
distillation strategy to transfer the class-agnostic knowledge from
to . After that, a pseudo supervised
(PS) training is designed to transfer the knowledge from
to . Lastly, an adversarial training strategy is proposed,
which aims to align the data distribution between source and target domains. To
evaluate our method, we build three TSD datasets based on UrbanSound and
Audioset. Experimental results show that our methods offer about 8\%
improvement in event-based F score.Comment: submitted to interspeech202