1,353 research outputs found
Polyphonic sound event and sound activity detection: a multi-task approach
Polyphonic Sound Event Detection (SED) in real-world recordings is a challenging task because of the dynamic polyphony level, intensity, and duration of sound events. Current polyphonic SED systems fail to model the temporal structure of sound events explicitly and instead attempt to look at which sound events are present at each audio frame. Consequently, the event-wise detection performance is much lower than the segment-wise detection performance. In this work, we propose a joint model approach to improve the temporal localization of sound events using a multi-task learning setup. The first task predicts which sound events are present at each time frame; we call this branch 'Sound Event Detection (SED) model', while the second task predicts if a sound event is present or not at each frame; we call this branch 'Sound Activity Detection (SAD) model'. We verify the proposed joint model by comparing it with a separate implementation of both tasks aggregated together from individual task predictions. Our experiments on the URBAN-SED dataset show that the proposed joint model can alleviate False Positive (FP) and False Negative (FN) errors and improve both the segment-wise and the event-wise metrics
General Purpose Audio Effect Removal
Although the design and application of audio effects is well understood, the inverse problem of removing these effects is significantly more challenging and far less studied. Recently, deep learning has been applied to audio effect removal; however, existing approaches have focused on narrow formulations considering only one effect or source type at a time. In realistic scenarios, multiple effects are applied with varying source content. This motivates a more general task, which we refer to as general purpose audio effect removal. We developed a dataset for this task using five audio effects across four different sources and used it to train and evaluate a set of existing architectures. We found that no single model performed optimally on all effect types and sources. To address this, we introduced RemFX, an approach designed to mirror the compositionality of applied effects. We first trained a set of the best-performing effect-specific removal models and then leveraged an audio effect classification model to dynamically construct a graph of our models at inference. We found our approach to outperform single model baselines, although examples with many effects present remain challenging
Perceptual musical similarity metric learning with graph neural networks
Sound retrieval for assisted music composition depends on evaluating similarity between musical instrument sounds, which is partly influenced by playing techniques. Previous methods utilizing Euclidean nearest neighbours over acoustic features show some limitations in retrieving sounds sharing equivalent timbral properties, but potentially generated using a different instrument, playing technique, pitch or dynamic. In this paper, we present a metric learning system designed to approximate human similarity judgments between extended musical playing techniques using graph neural networks. Such structure is a natural candidate for solving similarity retrieval tasks, yet have seen little application in modelling perceptual music similarity. We optimize a Graph Convolutional Network (GCN) over acoustic features via a proxy metric learning loss to learn embeddings that reflect perceptual similarities. Specifically, we construct the graph's adjacency matrix from the acoustic data manifold with an example-wise adaptive k-nearest neighbourhood graph: Adaptive Neighbourhood Graph Neural Network (AN-GNN). Our approach achieves 96.4% retrieval accuracy compared to 38.5% with a Euclidean metric and 86.0% with a multilayer perceptron (MLP), while effectively considering retrievals from distinct playing techniques to the query example
Broadband DOA estimation using Convolutional neural networks trained with noise signals
A convolution neural network (CNN) based classification method for broadband
DOA estimation is proposed, where the phase component of the short-time Fourier
transform coefficients of the received microphone signals are directly fed into
the CNN and the features required for DOA estimation are learnt during
training. Since only the phase component of the input is used, the CNN can be
trained with synthesized noise signals, thereby making the preparation of the
training data set easier compared to using speech signals. Through experimental
evaluation, the ability of the proposed noise trained CNN framework to
generalize to speech sources is demonstrated. In addition, the robustness of
the system to noise, small perturbations in microphone positions, as well as
its ability to adapt to different acoustic conditions is investigated using
experiments with simulated and real data.Comment: Published in Proceedings of IEEE Workshop on Applications of Signal
Processing to Audio and Acoustics (WASPAA) 201
Automated Audio Captioning with Recurrent Neural Networks
We present the first approach to automated audio captioning. We employ an
encoder-decoder scheme with an alignment model in between. The input to the
encoder is a sequence of log mel-band energies calculated from an audio file,
while the output is a sequence of words, i.e. a caption. The encoder is a
multi-layered, bi-directional gated recurrent unit (GRU) and the decoder a
multi-layered GRU with a classification layer connected to the last GRU of the
decoder. The classification layer and the alignment model are fully connected
layers with shared weights between timesteps. The proposed method is evaluated
using data drawn from a commercial sound effects library, ProSound Effects. The
resulting captions were rated through metrics utilized in machine translation
and image captioning fields. Results from metrics show that the proposed method
can predict words appearing in the original caption, but not always correctly
ordered.Comment: Presented at the 11th IEEE Workshop on Applications of Signal
Processing to Audio and Acoustics (WASPAA), 201
Multi-scale Multi-band DenseNets for Audio Source Separation
This paper deals with the problem of audio source separation. To handle the
complex and ill-posed nature of the problems of audio source separation, the
current state-of-the-art approaches employ deep neural networks to obtain
instrumental spectra from a mixture. In this study, we propose a novel network
architecture that extends the recently developed densely connected
convolutional network (DenseNet), which has shown excellent results on image
classification tasks. To deal with the specific problem of audio source
separation, an up-sampling layer, block skip connection and band-dedicated
dense blocks are incorporated on top of DenseNet. The proposed approach takes
advantage of long contextual information and outperforms state-of-the-art
results on SiSEC 2016 competition by a large margin in terms of
signal-to-distortion ratio. Moreover, the proposed architecture requires
significantly fewer parameters and considerably less training time compared
with other methods.Comment: to appear at WASPAA 201
- …