5 research outputs found

    A Global-local Attention Framework for Weakly Labelled Audio Tagging

    Full text link
    Weakly labelled audio tagging aims to predict the classes of sound events within an audio clip, where the onset and offset times of the sound events are not provided. Previous works have used the multiple instance learning (MIL) framework, and exploited the information of the whole audio clip by MIL pooling functions. However, the detailed information of sound events such as their durations may not be considered under this framework. To address this issue, we propose a novel two-stream framework for audio tagging by exploiting the global and local information of sound events. The global stream aims to analyze the whole audio clip in order to capture the local clips that need to be attended using a class-wise selection module. These clips are then fed to the local stream to exploit the detailed information for a better decision. Experimental results on the AudioSet show that our proposed method can significantly improve the performance of audio tagging under different baseline network architectures.Comment: Accepted to ICASSP202

    A Two-student Learning Framework for Mixed Supervised Target Sound Detection

    Full text link
    Target sound detection (TSD) aims to detect the target sound from mixture audio given the reference information. Previous work shows that a good detection performance relies on fully-annotated data. However, collecting fully-annotated data is labor-extensive. Therefore, we consider TSD with mixed supervision, which learns novel categories (target domain) using weak annotations with the help of full annotations of existing base categories (source domain). We propose a novel two-student learning framework, which contains two mutual helping student models (s_student\mathit{s\_student} and w_student\mathit{w\_student}) that learn from fully- and weakly-annotated datasets, respectively. Specifically, we first propose a frame-level knowledge distillation strategy to transfer the class-agnostic knowledge from s_student\mathit{s\_student} to w_student\mathit{w\_student}. After that, a pseudo supervised (PS) training is designed to transfer the knowledge from w_student\mathit{w\_student} to s_student\mathit{s\_student}. Lastly, an adversarial training strategy is proposed, which aims to align the data distribution between source and target domains. To evaluate our method, we build three TSD datasets based on UrbanSound and Audioset. Experimental results show that our methods offer about 8\% improvement in event-based F score.Comment: submitted to interspeech202
    corecore