Search CORE

5 research outputs found

A Global-local Attention Framework for Weakly Labelled Audio Tagging

Author: Wang Helin
Wang Wenwu
Zou Yuexian
Publication venue
Publication date: 03/02/2021
Field of study

Weakly labelled audio tagging aims to predict the classes of sound events within an audio clip, where the onset and offset times of the sound events are not provided. Previous works have used the multiple instance learning (MIL) framework, and exploited the information of the whole audio clip by MIL pooling functions. However, the detailed information of sound events such as their durations may not be considered under this framework. To address this issue, we propose a novel two-stream framework for audio tagging by exploiting the global and local information of sound events. The global stream aims to analyze the whole audio clip in order to capture the local clips that need to be attended using a class-wise selection module. These clips are then fed to the local stream to exploit the detailed information for a better decision. Experimental results on the AudioSet show that our proposed method can significantly improve the performance of audio tagging under different baseline network architectures.Comment: Accepted to ICASSP202

arXiv.org e-Print Archive

University of Surrey

A Two-student Learning Framework for Mixed Supervised Target Sound Detection

Author: Wang Helin
Wang Wenwu
Yang Dongchao
Zou Yuexian
Publication venue
Publication date: 05/04/2022
Field of study

Target sound detection (TSD) aims to detect the target sound from mixture audio given the reference information. Previous work shows that a good detection performance relies on fully-annotated data. However, collecting fully-annotated data is labor-extensive. Therefore, we consider TSD with mixed supervision, which learns novel categories (target domain) using weak annotations with the help of full annotations of existing base categories (source domain). We propose a novel two-student learning framework, which contains two mutual helping student models (

\mathit{s\_student}

and

\mathit{w\_student}

) that learn from fully- and weakly-annotated datasets, respectively. Specifically, we first propose a frame-level knowledge distillation strategy to transfer the class-agnostic knowledge from

\mathit{s\_student}

\mathit{w\_student}

. After that, a pseudo supervised (PS) training is designed to transfer the knowledge from

\mathit{w\_student}

\mathit{s\_student}

. Lastly, an adversarial training strategy is proposed, which aims to align the data distribution between source and target domains. To evaluate our method, we build three TSD datasets based on UrbanSound and Audioset. Experimental results show that our methods offer about 8\% improvement in event-based F score.Comment: submitted to interspeech202

arXiv.org e-Print Archive