52 research outputs found
A Two-student Learning Framework for Mixed Supervised Target Sound Detection
Target sound detection (TSD) aims to detect the target sound from mixture
audio given the reference information. Previous work shows that a good
detection performance relies on fully-annotated data. However, collecting
fully-annotated data is labor-extensive. Therefore, we consider TSD with mixed
supervision, which learns novel categories (target domain) using weak
annotations with the help of full annotations of existing base categories
(source domain). We propose a novel two-student learning framework, which
contains two mutual helping student models ( and
) that learn from fully- and weakly-annotated datasets,
respectively. Specifically, we first propose a frame-level knowledge
distillation strategy to transfer the class-agnostic knowledge from
to . After that, a pseudo supervised
(PS) training is designed to transfer the knowledge from
to . Lastly, an adversarial training strategy is proposed,
which aims to align the data distribution between source and target domains. To
evaluate our method, we build three TSD datasets based on UrbanSound and
Audioset. Experimental results show that our methods offer about 8\%
improvement in event-based F score.Comment: submitted to interspeech202
Improving Weakly Supervised Sound Event Detection with Causal Intervention
Existing weakly supervised sound event detection (WSSED) work has not
explored both types of co-occurrences simultaneously, i.e., some sound events
often co-occur, and their occurrences are usually accompanied by specific
background sounds, so they would be inevitably entangled, causing
misclassification and biased localization results with only clip-level
supervision. To tackle this issue, we first establish a structural causal model
(SCM) to reveal that the context is the main cause of co-occurrence confounders
that mislead the model to learn spurious correlations between frames and
clip-level labels. Based on the causal analysis, we propose a causal
intervention (CI) method for WSSED to remove the negative impact of
co-occurrence confounders by iteratively accumulating every possible context of
each class and then re-projecting the contexts to the frame-level features for
making the event boundary clearer. Experiments show that our method effectively
improves the performance on multiple datasets and can generalize to various
baseline models.Comment: Accepted by ICASSP202
NADiffuSE: Noise-aware Diffusion-based Model for Speech Enhancement
The goal of speech enhancement (SE) is to eliminate the background
interference from the noisy speech signal. Generative models such as diffusion
models (DM) have been applied to the task of SE because of better
generalization in unseen noisy scenes. Technical routes for the DM-based SE
methods can be summarized into three types: task-adapted diffusion process
formulation, generator-plus-conditioner (GPC) structures and the multi-stage
frameworks. We focus on the first two approaches, which are constructed under
the GPC architecture and use the task-adapted diffusion process to better deal
with the real noise. However, the performance of these SE models is limited by
the following issues: (a) Non-Gaussian noise estimation in the task-adapted
diffusion process. (b) Conditional domain bias caused by the weak conditioner
design in the GPC structure. (c) Large amount of residual noise caused by
unreasonable interpolation operations during inference. To solve the above
problems, we propose a noise-aware diffusion-based SE model (NADiffuSE) to
boost the SE performance, where the noise representation is extracted from the
noisy speech signal and introduced as a global conditional information for
estimating the non-Gaussian components. Furthermore, the anchor-based inference
algorithm is employed to achieve a compromise between the speech distortion and
noise residual. In order to mitigate the performance degradation caused by the
conditional domain bias in the GPC framework, we investigate three model
variants, all of which can be viewed as multi-stage SE based on the
preprocessing networks for Mel spectrograms. Experimental results show that
NADiffuSE outperforms other DM-based SE models under the GPC infrastructure.
Audio samples are available at: https://square-of-w.github.io/NADiffuSE-demo/
NoreSpeech: Knowledge Distillation based Conditional Diffusion Model for Noise-robust Expressive TTS
Expressive text-to-speech (TTS) can synthesize a new speaking style by
imiating prosody and timbre from a reference audio, which faces the following
challenges: (1) The highly dynamic prosody information in the reference audio
is difficult to extract, especially, when the reference audio contains
background noise. (2) The TTS systems should have good generalization for
unseen speaking styles. In this paper, we present a
\textbf{no}ise-\textbf{r}obust \textbf{e}xpressive TTS model (NoreSpeech),
which can robustly transfer speaking style in a noisy reference utterance to
synthesized speech. Specifically, our NoreSpeech includes several components:
(1) a novel DiffStyle module, which leverages powerful probabilistic denoising
diffusion models to learn noise-agnostic speaking style features from a teacher
model by knowledge distillation; (2) a VQ-VAE block, which maps the style
features into a controllable quantized latent space for improving the
generalization of style transfer; and (3) a straight-forward but effective
parameter-free text-style alignment module, which enables NoreSpeech to
transfer style to a textual input from a length-mismatched reference utterance.
Experiments demonstrate that NoreSpeech is more effective than previous
expressive TTS models in noise environments. Audio samples and code are
available at:
\href{http://dongchaoyang.top/NoreSpeech\_demo/}{http://dongchaoyang.top/NoreSpeech\_demo/}Comment: Submitted to ICASSP202
DPM-TSE: A Diffusion Probabilistic Model for Target Sound Extraction
Common target sound extraction (TSE) approaches primarily relied on
discriminative approaches in order to separate the target sound while
minimizing interference from the unwanted sources, with varying success in
separating the target from the background. This study introduces DPM-TSE, a
first generative method based on diffusion probabilistic modeling (DPM) for
target sound extraction, to achieve both cleaner target renderings as well as
improved separability from unwanted sounds. The technique also tackles common
background noise issues with DPM by introducing a correction method for noise
schedules and sample steps. This approach is evaluated using both objective and
subjective quality metrics on the FSD Kaggle 2018 dataset. The results show
that DPM-TSE has a significant improvement in perceived quality in terms of
target extraction and purity.Comment: Submitted to ICASSP 202
Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation
Large diffusion models have been successful in text-to-audio (T2A) synthesis
tasks, but they often suffer from common issues such as semantic misalignment
and poor temporal consistency due to limited natural language understanding and
data scarcity. Additionally, 2D spatial structures widely used in T2A works
lead to unsatisfactory audio quality when generating variable-length audio
samples since they do not adequately prioritize temporal information. To
address these challenges, we propose Make-an-Audio 2, a latent diffusion-based
T2A method that builds on the success of Make-an-Audio. Our approach includes
several techniques to improve semantic alignment and temporal consistency:
Firstly, we use pre-trained large language models (LLMs) to parse the text into
structured pairs for better temporal information capture. We
also introduce another structured-text encoder to aid in learning semantic
alignment during the diffusion denoising process. To improve the performance of
variable length generation and enhance the temporal information extraction, we
design a feed-forward Transformer-based diffusion denoiser. Finally, we use
LLMs to augment and transform a large amount of audio-label data into
audio-text datasets to alleviate the problem of scarcity of temporal data.
Extensive experiments show that our method outperforms baseline models in both
objective and subjective metrics, and achieves significant gains in temporal
information understanding, semantic consistency, and sound quality
Make-A-Voice: Unified Voice Synthesis With Discrete Representation
Various applications of voice synthesis have been developed independently
despite the fact that they generate "voice" as output in common. In addition,
the majority of voice synthesis models currently rely on annotated audio data,
but it is crucial to scale them to self-supervised datasets in order to
effectively capture the wide range of acoustic variations present in human
voice, including speaker identity, emotion, and prosody. In this work, we
propose Make-A-Voice, a unified framework for synthesizing and manipulating
voice signals from discrete representations. Make-A-Voice leverages a
"coarse-to-fine" approach to model the human voice, which involves three
stages: 1) semantic stage: model high-level transformation between linguistic
content and self-supervised semantic tokens, 2) acoustic stage: introduce
varying control signals as acoustic conditions for semantic-to-acoustic
modeling, and 3) generation stage: synthesize high-fidelity waveforms from
acoustic tokens. Make-A-Voice offers notable benefits as a unified voice
synthesis framework: 1) Data scalability: the major backbone (i.e., acoustic
and generation stage) does not require any annotations, and thus the training
data could be scaled up. 2) Controllability and conditioning flexibility: we
investigate different conditioning mechanisms and effectively handle three
voice synthesis applications, including text-to-speech (TTS), voice conversion
(VC), and singing voice synthesis (SVS) by re-synthesizing the discrete voice
representations with prompt guidance. Experimental results demonstrate that
Make-A-Voice exhibits superior audio quality and style similarity compared with
competitive baseline models. Audio samples are available at
https://Make-A-Voice.github.i
- …