40 research outputs found
Approximate Message Passing for Underdetermined Audio Source Separation
Approximate message passing (AMP) algorithms have shown great promise in
sparse signal reconstruction due to their low computational requirements and
fast convergence to an exact solution. Moreover, they provide a probabilistic
framework that is often more intuitive than alternatives such as convex
optimisation. In this paper, AMP is used for audio source separation from
underdetermined instantaneous mixtures. In the time-frequency domain, it is
typical to assume a priori that the sources are sparse, so we solve the
corresponding sparse linear inverse problem using AMP. We present a block-based
approach that uses AMP to process multiple time-frequency points
simultaneously. Two algorithms known as AMP and vector AMP (VAMP) are evaluated
in particular. Results show that they are promising in terms of artefact
suppression.Comment: Paper accepted for 3rd International Conference on Intelligent Signal
Processing (ISP 2017
DCASE 2018 Challenge Surrey Cross-Task convolutional neural network baseline
The Detection and Classification of Acoustic Scenes and Events (DCASE)
consists of five audio classification and sound event detection tasks: 1)
Acoustic scene classification, 2) General-purpose audio tagging of Freesound,
3) Bird audio detection, 4) Weakly-labeled semi-supervised sound event
detection and 5) Multi-channel audio classification. In this paper, we create a
cross-task baseline system for all five tasks based on a convlutional neural
network (CNN): a "CNN Baseline" system. We implemented CNNs with 4 layers and 8
layers originating from AlexNet and VGG from computer vision. We investigated
how the performance varies from task to task with the same configuration of
neural networks. Experiments show that deeper CNN with 8 layers performs better
than CNN with 4 layers on all tasks except Task 1. Using CNN with 8 layers, we
achieve an accuracy of 0.680 on Task 1, an accuracy of 0.895 and a mean average
precision (MAP) of 0.928 on Task 2, an accuracy of 0.751 and an area under the
curve (AUC) of 0.854 on Task 3, a sound event detection F1 score of 20.8% on
Task 4, and an F1 score of 87.75% on Task 5. We released the Python source code
of the baseline systems under the MIT license for further research.Comment: Accepted by DCASE 2018 Workshop. 4 pages. Source code availabl
Weakly Labelled AudioSet Tagging with Attention Neural Networks
Audio tagging is the task of predicting the presence or absence of sound
classes within an audio clip. Previous work in audio tagging focused on
relatively small datasets limited to recognising a small number of sound
classes. We investigate audio tagging on AudioSet, which is a dataset
consisting of over 2 million audio clips and 527 classes. AudioSet is weakly
labelled, in that only the presence or absence of sound classes is known for
each clip, while the onset and offset times are unknown. To address the
weakly-labelled audio tagging problem, we propose attention neural networks as
a way to attend the most salient parts of an audio clip. We bridge the
connection between attention neural networks and multiple instance learning
(MIL) methods, and propose decision-level and feature-level attention neural
networks for audio tagging. We investigate attention neural networks modeled by
different functions, depths and widths. Experiments on AudioSet show that the
feature-level attention neural network achieves a state-of-the-art mean average
precision (mAP) of 0.369, outperforming the best multiple instance learning
(MIL) method of 0.317 and Google's deep neural network baseline of 0.314. In
addition, we discover that the audio tagging performance on AudioSet embedding
features has a weak correlation with the number of training samples and the
quality of labels of each sound class.Comment: 13 page
Learning with Out-of-Distribution Data for Audio Classification
In supervised machine learning, the assumption that training data is labelled
correctly is not always satisfied. In this paper, we investigate an instance of
labelling error for classification tasks in which the dataset is corrupted with
out-of-distribution (OOD) instances: data that does not belong to any of the
target classes, but is labelled as such. We show that detecting and relabelling
certain OOD instances, rather than discarding them, can have a positive effect
on learning. The proposed method uses an auxiliary classifier, trained on data
that is known to be in-distribution, for detection and relabelling. The amount
of data required for this is shown to be small. Experiments are carried out on
the FSDnoisy18k audio dataset, where OOD instances are very prevalent. The
proposed method is shown to improve the performance of convolutional neural
networks by a significant margin. Comparisons with other noise-robust
techniques are similarly encouraging.Comment: Paper accepted for 45th International Conference on Acoustics,
Speech, and Signal Processing (ICASSP 2020
Polyphonic Sound Event Detection and Localization using a Two-Stage Strategy
Sound event detection (SED) and localization refer to recognizing sound events and estimating their spatial and temporal locations. Using neural networks has become the prevailing method for SED. In the area of sound localization, which is usually performed by estimating the direction of arrival (DOA), learning-based methods have recently been developed. In this paper, it is experimentally shown that the trained SED model is able to contribute to the direction of arrival estimation (DOAE). However, joint training of SED and DOAE degrades the performance of both. Based on these results, a two-stage polyphonic sound event detection and localization method is proposed. The method learns SED first, after which the learned feature layers are transferred for DOAE. It then uses the SED ground truth as a mask to train DOAE. The proposed method is evaluated on the DCASE 2019 Task 3 dataset, which contains different overlapping sound events in different environments. Experimental results show that the proposed method is able to improve the performance of both SED and DOAE, and also performs significantly better than the baseline method.303
An Improved Event-Independent Network for Polyphonic Sound Event Localization and Detection
Polyphonic sound event localization and detection (SELD), which jointly
performs sound event detection (SED) and direction-of-arrival (DoA) estimation,
detects the type and occurrence time of sound events as well as their
corresponding DoA angles simultaneously. We study the SELD task from a
multi-task learning perspective. Two open problems are addressed in this paper.
Firstly, to detect overlapping sound events of the same type but with different
DoAs, we propose to use a trackwise output format and solve the accompanying
track permutation problem with permutation-invariant training. Multi-head
self-attention is further used to separate tracks. Secondly, a previous finding
is that, by using hard parameter-sharing, SELD suffers from a performance loss
compared with learning the subtasks separately. This is solved by a soft
parameter-sharing scheme. We term the proposed method as Event Independent
Network V2 (EINV2), which is an improved version of our previously-proposed
method and an end-to-end network for SELD. We show that our proposed EINV2 for
joint SED and DoA estimation outperforms previous methods by a large margin,
and has comparable performance to state-of-the-art ensemble models.Comment: 5 pages, 2021 IEEE International Conference on Acoustics, Speech and
Signal Processin
Event-Independent Network for Polyphonic Sound Event Localization and Detection
Polyphonic sound event localization and detection is not only detecting what
sound events are happening but localizing corresponding sound sources. This
series of tasks was first introduced in DCASE 2019 Task 3. In 2020, the sound
event localization and detection task introduces additional challenges in
moving sound sources and overlapping-event cases, which include two events of
the same type with two different direction-of-arrival (DoA) angles. In this
paper, a novel event-independent network for polyphonic sound event
localization and detection is proposed. Unlike the two-stage method we proposed
in DCASE 2019 Task 3, this new network is fully end-to-end. Inputs to the
network are first-order Ambisonics (FOA) time-domain signals, which are then
fed into a 1-D convolutional layer to extract acoustic features. The network is
then split into two parallel branches. The first branch is for sound event
detection (SED), and the second branch is for DoA estimation. There are three
types of predictions from the network, SED predictions, DoA predictions, and
event activity detection (EAD) predictions that are used to combine the SED and
DoA features for on-set and off-set estimation. All of these predictions have
the format of two tracks indicating that there are at most two overlapping
events. Within each track, there could be at most one event happening. This
architecture introduces a problem of track permutation. To address this
problem, a frame-level permutation invariant training method is used.
Experimental results show that the proposed method can detect polyphonic sound
events and their corresponding DoAs. Its performance on the Task 3 dataset is
greatly increased as compared with that of the baseline method.Comment: conferenc
Experimental Investigation of Vacuum Membrane Distillation (VMD) Performance Based on Operational Parameters for Clean Water Production
Freshwater shortage is an ongoing concern across the world, due to increasing populations and climate change. Vacuum membrane distillation (VMD) is a viable approach for producing fresh water to meet the needs of society. In the current study, an experimental investigation has been conducted on a laboratory-scale single-stage module to explore the impact of operational parameters such as feed temperature, vacuum pressure, and feed salinity on the performance of vacuum membrane distillation (VMD), including permeate flux, gained output ratio, and specific thermal energy consumption. Results show that increasing the feed temperature and feed flow rate, and reducing the salinity, increases the permeate flux. As the feed temperature rises from 60 to 70°C, the permeate flux increases from 1.90 to 4.36 kg/m²h at a permeate pressure of 12 kPa and salinity of 30 g/L. Similarly, increasing the vacuum pressure from 12 to 18 kPa reduces the permeate flux. As a result, the specific thermal energy consumption increases from 728 to 803 kWh/m³. From experimental findings, it was observed that the rejected brine from VMD retains sufficient energy that could be utilized in another desalination system
Audiovisual Transformer Architectures for Large-Scale Classification and Synchronization of Weakly Labeled Audio Events
We tackle the task of environmental event classification by drawing
inspiration from the transformer neural network architecture used in machine
translation. We modify this attention-based feedforward structure in such a way
that allows the resulting model to use audio as well as video to compute sound
event predictions. We perform extensive experiments with these adapted
transformers on an audiovisual data set, obtained by appending relevant visual
information to an existing large-scale weakly labeled audio collection. The
employed multi-label data contains clip-level annotation indicating the
presence or absence of 17 classes of environmental sounds, and does not include
temporal information. We show that the proposed modified transformers strongly
improve upon previously introduced models and in fact achieve state-of-the-art
results. We also make a compelling case for devoting more attention to research
in multimodal audiovisual classification by proving the usefulness of visual
information for the task at hand,namely audio event recognition. In addition,
we visualize internal attention patterns of the audiovisual transformers and in
doing so demonstrate their potential for performing multimodal synchronization