257 research outputs found
Machine learning and audio processing : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science at Massey University, Albany, Auckland, New Zealand
In this thesis, we addressed two important theoretical issues in deep neural
networks and clustering, respectively. Also, we developed a new approach for
polyphonic sound event detection, which is one of the most important applications
in the audio processing area.
The developed three novel approaches are:
(i) The Large Margin Recurrent Neural Network (LMRNN), which improves
the discriminative ability of original Recurrent Neural Networks by
introducing a large margin term into the widely used cross-entropy loss
function. The developed large margin term utilises the large margin
discriminative principle as a heuristic term to navigate the convergence
process during training, which fully exploits the information from data
labels by considering both target category and competing categories.
(ii) The Robust Multi-View Continuous Subspace Clustering (RMVCSC)
approach, which performs clustering on a common view-invariant
subspace learned from all views. The clustering result and the common
representation subspace are simultaneously optimised by a single
continuous objective function. In the objective function, a robust estimator
is used to automatically clip specious inter-cluster connections while
maintaining convincing intra-cluster correspondences. Thus, the developed
RMVCSC can untangle heavily mixed clusters without pre-setting the
number of clusters.
(iii) The novel polyphonic sound event detection approach based on Relational
Recurrent Neural Network (RRNN), which utilises the relational reasoning
ability of RRNNs to untangle the overlapping sound events across audio
recordings. Different from previous works, which mixed and packed all
historical information into a single common hidden memory vector, the
developed approach allows historical information to interact with each
other across an audio recording, which is effective and efficient in
untangling the overlapping sound events.
All three approaches are tested on widely used datasets and compared with
recently published works. The experimental results have demonstrated the
effectiveness and efficiency of the developed approaches
Audio Event-Relational Graph Representation Learning for Acoustic Scene Classification
Most deep learning-based acoustic scene classification (ASC) approaches identify scenes based on acoustic features converted from audio clips containing mixed information entangled by polyphonic audio events (AEs). However, these approaches have difficulties in explaining what cues they use to identify scenes. This letter conducts the first study on disclosing the relationship between real-life acoustic scenes and semantic embeddings from the most relevant AEs. Specifically, we propose an event-relational graph representation learning (ERGL) framework for ASC to classify scenes, and simultaneously answer clearly and straightly which cues are used in classifying. In the event-relational graph, embeddings of each event are treated as nodes, while relationship cues derived from each pair of nodes are described by multi-dimensional edge features. Experiments on a real-life ASC dataset show that the proposed ERGL achieves competitive performance on ASC by learning embeddings of only a limited number of AEs. The results show the feasibility of recognizing diverse acoustic scenes based on the audio event-relational graph
Cooperative Scene-Event Modelling for Acoustic Scene Classification
Acoustic scene classification (ASC) can be helpful for creating context awareness for intelligent robots. Humans naturally use the relations between acoustic scenes (AS) and audio events (AE) to understand and recognize their surrounding environments. However, in most previous works, ASC and audio event classification (AEC) are treated as independent tasks, with a focus primarily on audio features shared between scenes and events, but not their implicit relations. To address this limitation, we propose a cooperative scene-event modelling (cSEM) framework to automatically model the intricate scene-event relation by an adaptive coupling matrix to improve ASC. Compared with other scene-event modelling frameworks, the proposed cSEM offers the following advantages. First, it reduces the confusion between similar scenes by aligning the information of coarse-grained AS and fine-grained AE in the latent space, and reducing the redundant information between the AS and AE embeddings. Second, it exploits the relation information between AS and AE to improve ASC, which is shown to be beneficial, even if the information of AE is derived from unverified pseudo-labels. Third, it uses a regression-based loss function for cooperative modelling of scene-event relations, which is shown to be more effective than classification-based loss functions. Instantiated from four models based on either Transformer or convolutional neural networks, cSEM is evaluated on real-life and synthetic datasets. Experiments show that cSEM-based models work well in real-life scene-event analysis, offering competitive results on ASC as compared with other multi-feature or multi-model ensemble methods. The ASC accuracy achieved on the TUT2018, TAU2019, and JSSED datasets is 81.0%, 88.9% and 97.2%, respectively
SwG-former: Sliding-window Graph Convolutional Network Integrated with Conformer for Sound Event Localization and Detection
Sound event localization and detection (SELD) is a joint task of sound event
detection (SED) and direction of arrival (DoA) estimation. SED mainly relies on
temporal dependencies to distinguish different sound classes, while DoA
estimation depends on spatial correlations to estimate source directions. To
jointly optimize two subtasks, the SELD system should extract spatial
correlations and model temporal dependencies simultaneously. However, numerous
models mainly extract spatial correlations and model temporal dependencies
separately. In this paper, the interdependence of spatial-temporal information
in audio signals is exploited for simultaneous extraction to enhance the model
performance. In response, a novel graph representation leveraging graph
convolutional network (GCN) in non-Euclidean space is developed to extract
spatial-temporal information concurrently. A sliding-window graph (SwG) module
is designed based on the graph representation. It exploits sliding-windows with
different sizes to learn temporal context information and dynamically
constructs graph vertices in the frequency-channel (F-C) domain to capture
spatial correlations. Furthermore, as the cornerstone of message passing, a
robust Conv2dAgg function is proposed and embedded into the SwG module to
aggregate the features of neighbor vertices. To improve the performance of SELD
in a natural spatial acoustic environment, a general and efficient SwG-former
model is proposed by integrating the SwG module with the Conformer. It exhibits
superior performance in comparison to recent advanced SELD models. To further
validate the generality and efficiency of the SwG-former, it is seamlessly
integrated into the event-independent network version 2 (EINV2) called
SwG-EINV2. The SwG-EINV2 surpasses the state-of-the-art (SOTA) methods under
the same acoustic environment
Interaction intermodale dans les réseaux neuronaux profonds pour la classification et la localisation d'évènements audiovisuels
La compréhension automatique du monde environnant a de nombreuses applications
telles que la surveillance et sécurité, l'interaction Homme-Machine,
la robotique, les soins de santé, etc. Plus précisément, la compréhension peut
s'exprimer par le biais de différentes taches telles que la classification et localisation
dans l'espace d'évènements. Les êtres vivants exploitent un maximum
de l'information disponible pour comprendre ce qui les entoure. En s'inspirant
du comportement des êtres vivants, les réseaux de neurones artificiels devraient
également utiliser conjointement plusieurs modalités, par exemple, la vision et
l'audition.
Premièrement, les modèles de classification et localisation, basés sur l'information
audio-visuelle, doivent être évalués de façon objective. Nous avons donc
enregistré une nouvelle base de données pour compléter les bases actuellement
disponibles. Comme aucun modèle audio-visuel de classification et localisation
n'existe, seule la partie sonore de la base est évaluée avec un modèle de la
littérature.
Deuxièmement, nous nous concentrons sur le cœur de la thèse: comment
utiliser conjointement de l'information visuelle et sonore pour résoudre une
tâche spécifique, la reconnaissance d'évènements. Le cerveau n'est pas constitué d'une "simple" fusion mais comprend de multiples interactions entre
les deux modalités. Il y a un couplage important entre le traitement de
l'information visuelle et sonore. Les réseaux de neurones offrent la possibilité de créer des interactions entre les modalités en plus de la fusion. Dans
cette thèse, nous explorons plusieurs stratégies pour fusionner les modalités
visuelles et sonores et pour créer des interactions entre les modalités. Ces techniques
ont les meilleures performances en comparaison aux architectures de
l'état de l'art au moment de la publication. Ces techniques montrent l'utilité
de la fusion audio-visuelle mais surtout l'importance des interactions entre les
modalités.
Pour conclure la thèse, nous proposons un réseau de référence pour la classification et localisation d'évènements audio-visuels. Ce réseau a été testé avec
la nouvelle base de données. Les modèles précédents de classification sont
modifiés pour prendre en compte la localisation dans l'espace en plus de la
classification.Abstract: The automatic understanding of the surrounding world has a wide range of applications, including surveillance, human-computer interaction, robotics, health care, etc. The understanding can be expressed in several ways such as event classification and its localization in space. Living beings exploit a maximum of the available information to understand the surrounding world. Artificial neural networks should build on this behavior and jointly use several modalities such as vision and hearing. First, audio-visual networks for classification and localization must be evaluated objectively. We recorded a new audio-visual dataset to fill a gap in the current available datasets. We were not able to find audio-visual models for classification and localization. Only the dataset audio part is evaluated with a state-of-the-art model. Secondly, we focus on the main challenge of the thesis: How to jointly use visual and audio information to solve a specific task, event recognition. The brain does not comprise a simple fusion but has multiple interactions between the two modalities to create a strong coupling between them. The neural networks offer the possibility to create interactions between the two modalities in addition to the fusion. We explore several strategies to fuse the audio and visual modalities and to create interactions between modalities. These techniques have the best performance compared to the state-of-the-art architectures at the time of publishing. They show the usefulness of audio-visual fusion but above all the contribution of the interaction between modalities. To conclude, we propose a benchmark for audio-visual classification and localization on the new dataset. Previous models for the audio-visual classification are modified to address the localization in addition to the classification
Joint Prediction of Audio Event and Annoyance Rating in an Urban Soundscape by Hierarchical Graph Representation Learning
Sound events in daily life carry rich information about the objective world. The composition of these sounds affects the
mood of people in a soundscape. Most previous approaches
only focus on classifying and detecting audio events and scenes,
but may ignore their perceptual quality that may impact humans’ listening mood for the environment, e.g. annoyance. To
this end, this paper proposes a novel hierarchical graph representation learning (HGRL) approach which links objective audio events (AE) with subjective annoyance ratings (AR) of the
soundscape perceived by humans. The hierarchical graph consists of fine-grained event (fAE) embeddings with single-class
event semantics, coarse-grained event (cAE) embeddings with
multi-class event semantics, and AR embeddings. Experiments
show the proposed HGRL successfully integrates AE with AR
for AEC and ARP tasks, while coordinating the relations between cAE and fAE and further aligning the two different grains
of AE information with the AR
Robust sound event detection in bioacoustic sensor networks
Bioacoustic sensors, sometimes known as autonomous recording units (ARUs),
can record sounds of wildlife over long periods of time in scalable and
minimally invasive ways. Deriving per-species abundance estimates from these
sensors requires detection, classification, and quantification of animal
vocalizations as individual acoustic events. Yet, variability in ambient noise,
both over time and across sensors, hinders the reliability of current automated
systems for sound event detection (SED), such as convolutional neural networks
(CNN) in the time-frequency domain. In this article, we develop, benchmark, and
combine several machine listening techniques to improve the generalizability of
SED models across heterogeneous acoustic environments. As a case study, we
consider the problem of detecting avian flight calls from a ten-hour recording
of nocturnal bird migration, recorded by a network of six ARUs in the presence
of heterogeneous background noise. Starting from a CNN yielding
state-of-the-art accuracy on this task, we introduce two noise adaptation
techniques, respectively integrating short-term (60 milliseconds) and long-term
(30 minutes) context. First, we apply per-channel energy normalization (PCEN)
in the time-frequency domain, which applies short-term automatic gain control
to every subband in the mel-frequency spectrogram. Secondly, we replace the
last dense layer in the network by a context-adaptive neural network (CA-NN)
layer. Combining them yields state-of-the-art results that are unmatched by
artificial data augmentation alone. We release a pre-trained version of our
best performing system under the name of BirdVoxDetect, a ready-to-use detector
of avian flight calls in field recordings.Comment: 32 pages, in English. Submitted to PLOS ONE journal in February 2019;
revised August 2019; published October 201
Signal processing techniques for robust sound event recognition
The computational analysis of acoustic scenes is today a topic of major interest, with a growing community focused on designing machines capable of identifying and understanding the sounds produced in our environment, similar to how humans perform this task. Although these domains have not reached the industrial popularity of other related audio domains, such as speech recognition or music analysis, applications designed to identify the occurrence of sounds in a given scenario are rapidly increasing. These applications are usually limited to a set of sound classes, which must be defined beforehand. In order to train sound classification models, representative sets of sound events are recorded and used as training data. However, the acoustic conditions present during the collection of training examples may not coincide with the conditions during application testing. Background noise, overlapping sound events or weakly segmented data, among others, may substantially affect audio data, lowering the actual performance of the learned models. To avoid such situations, machine learning systems have to be designed with the ability to generalize to data collected under conditions different from the ones seen during training.
Traditionally, the techniques used to carry out tasks related to the computational understanding of sound events have been inspired by similar domains such as music or speech, so the features selected to represent acoustic events come from those specific domains. Most of the contributions of this thesis are based on how such features are suitably applied for sound event recognition, proposing specific methods to adapt the features extracted both within classical recognition approaches and modern end-to-end convolutional neural networks. The objective of this thesis is therefore to develop novel signal processing techniques aimed at increasing the robustness of the features representing acoustic events to adverse conditions affecting the mismatch between the training and test conditions in model learning. To achieve such objective, we start first by analyzing the importance of classical feature sets such as Mel-frequency cepstral coefficients (MFCCs) or the energies extracted from log-mel filterbanks, analyzing as well the impact of noise, reverberveration or segmentation errors in diverse scenarios. We show that the performance of both classical and deep learning-based approaches is severely affected by these factors and we propose novel signal processing techniques designed to improve their robustness by means of the non-linear transformation of feature vectors along the temporal axis. Such transformation is based on the so called event trace, which can be interpreted as an indicator of the temporal activity of the event within the feature space. Finally, we propose the use of the energy envelope as a target for event detection, which implies the change from a classification-based approach to a regression-oriented one
Apprentissage profond faiblement supervisé et semi-supervisé pour la détection d'évènements sonores
La quantité de données produite par les médias tel que Youtube est une mine d'or d'information pour les algorithmes d'apprentissage machine. Une mine d'or inatteignable tant que ces informations n'ont pas été raffinées. Pour les algorithmes dits supervisés, il est nécessaire d'associer à chaque information disponible une étiquette permettant de l'identifier et de l'utiliser. C'est un travail fastidieux, lent et coûteux, réalisé par des annotateurs humains de manière bénévole ou professionnellement. Cependant, la quantité d'information générée chaque jour excède largement nos capacités d'annotation humaine. Il est alors nécessaire de se tourner vers des méthodes d'apprentissage capables d'utiliser l'information dans sa forme brute ou légèrement travaillée. Cette problématique est au coeur de ma thèse, où il s'agit, dans une première partie, d'exploiter des annotations humaines dites " faibles ", puis d'exploiter des données partiellement annotées dans une seconde partie. La détection automatique d'évènements sonores polyphoniques est une problématique difficile à résoudre. Les évènements sonores se superposent, se répètent et varient dans le domaine fréquentiel même au sein d'une même catégorie. Toutes ces difficultés rendent la tâche d'annotation encore plus difficile, non seulement pour un annotateur humain, mais aussi pour des systèmes entraînés à la classification. La classification audio de manière semi-supervisée, c'est-à -dire lorsqu'une partie conséquente du jeu de données n'a pas été annotée, est l'une des solutions proposées à la problématique de l'immense quantité de données générée chaque jour. Les méthodes d'apprentissage profond semi-supervisées sont nombreuses et utilisent différents mécanismes permettant d'extraire implicitement des informations des données non-annotées, les rendant ainsi utiles et directement utilisables. L'objectif de cette thèse est dans un premier temps, d'étudier et proposer des approches faiblement supervisées pour la tâche de détection d'évènements sonores, mises en oeuvre lors de notre participation à la tâche quatre du défi international DCASE. Il s'agit ici d'enregistrements audio faiblement supervisés réalistes, de type bruits domestique. Afin de résoudre cette tâche, nous avons proposé deux solutions fondées sur les réseaux de neurones convolutifs récurrents, ainsi que sur des hypothèses statistiques contraignant l'entraînement. Dans un second temps, nous nous pencherons sur l'apprentissage profond semi-supervisé, lorsqu'une majorité de l'information n'est pas annotée. Nous comparons des approches développées pour la classification d'images au départ, avant de proposer leur application À la classification audio. Nous montrons que les approches les plus récentes permettent d'obtenir des résultats aussi bons qu'un entraînement entièrement supervisé, qui lui aurait eu accès à l'intégralité des annotations.The amount of information produced by media such as Youtube, Facebook, or Instagram is a gold mine of information for machine and deep learning algorithms. A gold mine that cannot be reached until this information has been refined. For supervised algorithms, it is necessary to associate a label to each available piece of information allowing to identify and use it. This is a tedious, slow, and costly task, performed by human annotators on a voluntary or professional basis. However, the amount of information generated each day far exceeds our human annotation capabilities. It is then necessary to turn to learning methods capable of using the information in its raw or slightly processed form. For that, we will focus on weak annotations in the first part, then on partial annotations in the second part. The detection of sound events in a polyphonic environment is a difficult problem to solve. The sound events overlap, repeat or vary in the frequency domain. All these difficulties make the annotation task even more challenging, not only for a human annotator but also for systems trained in simple classification (mono phone). Semi-supervised audio classification, i.e. when a significant part of the dataset has not been annotated, is another proposed solution to the problem of the huge amount of data generated every day. Semi-supervised deep learning methods are numerous and use different mechanisms to implicitly extract information from these unannotated data, making them useful and directly usable. The objectives of this thesis are two folds. Firstly, to study and propose weakly supervised approaches for the sound event detection task in our participation in the DCASE international challenge task four, which provides realistic weakly supervised audio recordings extracted from domestic scenes. To solve this task, we suggest two solutions based on recurrent neural networks and statistical assumptions constraining the training. Secondly, we focus on semi-supervised deep learning when most of the information is not annotated. We compare approaches developed for image classification before proposing their application to audio classification and a substantial improvement. We show that the most recent approaches can achieve results as good as fully supervised training, which would have had access to all annotations
- …