19 research outputs found
Infant Cry Signal Processing, Analysis, and Classification with Artificial Neural Networks
As a special type of speech and environmental sound, infant cry has been a growing research area covering infant cry reason classification, pathological infant cry identification, and infant cry detection in the past two decades. In this dissertation, we build a new dataset, explore new feature extraction methods, and propose novel classification approaches, to improve the infant cry classification accuracy and identify diseases by learning infant cry signals.
We propose a method through generating weighted prosodic features combined with acoustic features for a deep learning model to improve the performance of asphyxiated infant cry identification. The combined feature matrix captures the diversity of variations within infant cries and the result outperforms all other related studies on asphyxiated baby crying classification. We propose a non-invasive fast method of using infant cry signals with convolutional neural network (CNN) based age classification to diagnose the abnormality of infant vocal tract development as early as 4-month age. Experiments discover the pattern and tendency of the vocal tract changes and predict the abnormality of infant vocal tract by classifying the cry signals into younger age category. We propose an approach of generating hybrid feature set and using prior knowledge in a multi-stage CNNs model for robust infant sound classification. The dominant and auxiliary features within the set are beneficial to enlarge the coverage as well as keeping a good resolution for modeling the diversity of variations within infant sound and the experimental results give encouraging improvements on two relative databases. We propose an approach of graph convolutional network (GCN) with transfer learning for robust infant cry reason classification. Non-fully connected graphs based on the similarities among the relevant nodes are built to consider the short-term and long-term effects of infant cry signals related to inner-class and inter-class messages. With as limited as 20% of labeled training data, our model outperforms that of the CNN model with 80% labeled training data in both supervised and semi-supervised settings. Lastly, we apply mel-spectrogram decomposition to infant cry classification and propose a fusion method to further improve the infant cry classification performance
Neural Networks for Analysing Music and Environmental Audio
PhDIn this thesis, we consider the analysis of music and environmental audio
recordings with neural networks. Recently, neural networks have been
shown to be an effective family of models for speech recognition, computer
vision, natural language processing and a number of other statistical modelling
problems. The composite layer-wise structure of neural networks
allows for flexible model design, where prior knowledge about the domain
of application can be used to inform the design and architecture of the
neural network models. Additionally, it has been shown that when trained
on sufficient quantities of data, neural networks can be directly applied to
low-level features to learn mappings to high level concepts like phonemes
in speech and object classes in computer vision. In this thesis we investigate
whether neural network models can be usefully applied to processing
music and environmental audio.
With regards to music signal analysis, we investigate 2 different problems.
The fi rst problem, automatic music transcription, aims to identify the
score or the sequence of musical notes that comprise an audio recording.
We also consider the problem of automatic chord transcription, where the
aim is to identify the sequence of chords in a given audio recording. For
both problems, we design neural network acoustic models which are applied
to low-level time-frequency features in order to detect the presence of
notes or chords. Our results demonstrate that the neural network acoustic
models perform similarly to state-of-the-art acoustic models, without the
need for any feature engineering. The networks are able to learn complex
transformations from time-frequency features to the desired outputs, given
sufficient amounts of training data. Additionally, we use recurrent neural
networks to model the temporal structure of sequences of notes or chords,
similar to language modelling in speech. Our results demonstrate that
the combination of the acoustic and language model predictions yields
improved performance over the acoustic models alone. We also observe
that convolutional neural networks yield better performance compared to
other neural network architectures for acoustic modelling.
For the analysis of environmental audio recordings, we consider the problem
of acoustic event detection. Acoustic event detection has a similar
structure to automatic music and chord transcription, where the system
is required to output the correct sequence of semantic labels along with
onset and offset times. We compare the performance of neural network
architectures against Gaussian mixture models and support vector machines.
In order to account for the fact that such systems are typically
deployed on embedded devices, we compare performance as a function of
the computational cost of each model. We evaluate the models on 2 large
datasets of real-world recordings of baby cries and smoke alarms. Our results
demonstrate that the neural networks clearly outperform the other
models and they are able to do so without incurring a heavy computation
cost
Spectro-ViT: A Vision Transformer Model for GABA-edited MRS Reconstruction Using Spectrograms
Purpose: To investigate the use of a Vision Transformer (ViT) to
reconstruct/denoise GABA-edited magnetic resonance spectroscopy (MRS) from a
quarter of the typically acquired number of transients using spectrograms.
Theory and Methods: A quarter of the typically acquired number of transients
collected in GABA-edited MRS scans are pre-processed and converted to a
spectrogram image representation using the Short-Time Fourier Transform (STFT).
The image representation of the data allows the adaptation of a pre-trained ViT
for reconstructing GABA-edited MRS spectra (Spectro-ViT). The Spectro-ViT is
fine-tuned and then tested using \textit{in vivo} GABA-edited MRS data. The
Spectro-ViT performance is compared against other models in the literature
using spectral quality metrics and estimated metabolite concentration values.
Results: The Spectro-ViT model significantly outperformed all other models in
four out of five quantitative metrics (mean squared error, shape score,
GABA+/water fit error, and full width at half maximum). The metabolite
concentrations estimated (GABA+/water, GABA+/Cr, and Glx/water) were consistent
with the metabolite concentrations estimated using typical GABA-edited MRS
scans reconstructed with the full amount of typically collected transients.
Conclusion: The proposed Spectro-ViT model achieved state-of-the-art results
in reconstructing GABA-edited MRS, and the results indicate these scans could
be up to four times faster
Classification of Sound Scenes and Events in Real-World Scenarios with Deep Learning Techniques
La clasificación de los eventos sonoros es un campo de la audición por computador que se está volviendo cada vez más interesante debido al gran número de aplicaciones que podrÃan beneficiarse de esta tecnologÃa. A diferencia de otros campos de la audición por computador relacionados con la recuperación de información musical o el reconocimiento del habla, la clasificación de eventos sonoros tiene una serie de problemas intrÃnsecos. Estos problemas son la naturaleza polifónica de la mayorÃa de las grabaciones de sonido ambiental, la diferencia en la naturaleza de cada sonido, la falta de estructura temporal y la adición de ruido de fondo y reverberación en el proceso de grabación. Estos problemas son campos de estudio para la comunidad cientÃfica a dÃa de hoy. Sin embargo, cabe señalar que cuando se despliega una solución de audición por computador en entornos reales, pueden surgir una serie de problemas adicionales. Estos problemas son el Reconocimiento de Conjunto Abierto (OSR), el Aprendizaje de Pocos Disparos (FSL) y la consideración del tiempo de ejecución del sistema (baja complejidad). El OSR se define como el problema que aparece cuando un sistema de inteligencia artificial tiene que enfrentarse a una situación desconocida en la que clases no vistas durante la etapa de entrenamiento están presentes en una etapa de inferencia. El FSL corresponde al problema que se produce cuando hay muy pocas muestras disponibles para cada clase considerada. Por último, dado que estos sistemas se despliegan normalmente en dispositivos de borde, hay que tener en cuenta el tiempo de ejecución, ya que cuanto menos tiempo tarde el sistema en dar una respuesta, mejor será la experiencia percibida por los usuarios. Las soluciones basadas en las técnicas de aprendizaje en profundidad para problemas similares en el dominio de la imagen han mostrado resultados prometedores. Las soluciones más difundidas son las que implementan Redes Neuronales Convolucionales (CNN). Por lo tanto, muchos sistemas de audio de última generación proponen convertir las señales de audio en una representación bidimensional que puede ser tratada como una imagen. La generación de mapas internos se realiza a menudo por las capas convolucionales de las CNN. Sin embargo, estas capas tienen una serie de limitaciones que deben ser estudiadas para poder proponer técnicas para mejorar los mapas de caracterÃsticas resultantes. Con este fin, se han propuesto novedosas redes que fusionan dos métodos diferentes, como el aprendizaje residual y las técnicas de excitación y compresión. Los resultados muestran una mejora de la precisión del sistema con la adición de un número reducido de parámetros adicionales. Por otra parte, estas soluciones basadas en entradas bidimensionales pueden mostrar un cierto sesgo, ya que la elección de la representación de audio puede ser especÃfica para una tarea concreta. Por lo tanto, se ha realizado un estudio comparativo de diferentes redes residuales alimentadas directamente por la señal de audio en bruto. Estas soluciones se conocen como de extremo a extremo. Si bien se han realizado estudios similares en la literatura en el dominio de la imagen, los resultados sugieren que los bloques residuales de mejor rendimiento para las tareas de visión artificial pueden no ser los mismos que los de la clasificación de audio. En cuanto a los problemas de FSL y OSR, se propone un marco basado en un autoencoder capaz de mitigar ambos problemas juntos. Esta solución es capaz de crear representaciones robustas de estos patrones de audio a partir de sólo unas pocas muestras, al tiempo que es capaz de rechazar las clases de audio no deseadas.The classification of sound events is a field of machine listening that is becoming increasingly interesting due to the large number of applications that could benefit from this technology. Unlike other fields of machine listening related to music information retrieval or speech recognition, sound event classification has a number of intrinsic problems. These problems are the polyphonic nature of most environmental sound recordings, the difference in the nature of each sound, the lack of temporal structure and the addition of background noise and reverberation in the recording process. These problems are fields of study for the scientific community today. However, it should be noted that when a machine listening solution is deployed in real environments, a number of extra problems may arise. These problems are Open-Set Recognition (OSR), Few-Shot Learning (FSL) and consideration of system runtime (low-complexity). OSR is defined as the problem that appears when an artificial intelligence system has to face an unknown situation where classes unseen during the training stage are present at a usage stage. FSL corresponds to the problem that occurs when there are very few samples available for each considered class. Finally, since these systems are normally deployed in edge devices, the consideration of the execution time must be taken into account, as the less time the system takes to give a response, the better the experience perceived by the users. Solutions based on Deep Learning techniques for similar problems in the image domain have shown promising results. The most widespread solutions are those that implement Convolutional Neural Networks (CNNs). Therefore, many state-of-the-art audio systems propose to convert audio signals into a two-dimensional representation that can be treated as an image. The generation of internal maps is often done by the convolutional layers of the CNNs. However, these layers have a series of limitations that must be studied in order to be able to propose techniques for improving the resulting feature maps. To this end, novel networks have been proposed that merge two different methods such as residual learning and squeeze-excitation techniques. The results show an improvement in the accuracy of the system with the addition of few number of extra parameters. On the other hand, these solutions based on two-dimensional inputs can show a certain bias since the choice of audio representation can be specific to a particular task. Therefore, a comparative study of different residual networks directly fed by the raw audio signal has been carried out. These solutions are known as end-to-end. While similar studies have been carried out in the literature in the image domain, the results suggest that the best performing residual blocks for computer vision tasks may not be the same as those for audio classification. Regarding the FSL and OSR problems, an autoencoder-based framework capable of mitigating both problems together is proposed. This solution is capable of creating robust representations of these audio patterns from just a few samples while being able to reject unwanted audio classes
Contributions à la sonification d’image et à la classification de sons
L’objectif de cette thèse est d’étudier d’une part le problème de sonification d’image
et de le solutionner à travers de nouveaux modèles de correspondance entre domaines
visuel et sonore. D’autre part d’étudier le problème de la classification de son et de le résoudre
avec des méthodes ayant fait leurs preuves dans le domaine de la reconnaissance
d’image.
La sonification d’image est la traduction de données d’image (forme, couleur, texture,
objet) en sons. Il est utilisé dans les domaines de l’assistance visuelle et de l’accessibilité
des images pour les personnes malvoyantes. En raison de sa complexité, un
système de sonification d’image qui traduit correctement les données d’image en son de
manière intuitive n’est pas facile à concevoir.
Notre première contribution est de proposer un nouveau système de sonification
d’image de bas-niveau qui utilise une approche hiérarchique basée sur les caractéristiques
visuelles. Il traduit, à l’aide de notes musicales, la plupart des propriétés d’une
image (couleur, gradient, contour, texture, région) vers le domaine audio, de manière
très prévisible et donc est facilement ensuite décodable par l’être humain.
Notre deuxième contribution est une application Android de sonification de haut
niveau qui est complémentaire à notre première contribution car elle implémente la traduction
des objets et du contenu sémantique de l’image. Il propose également une base
de données pour la sonification d’image.
Finalement dans le domaine de l’audio, notre dernière contribution généralise le motif
binaire local (LBP) Ã 1D et le combine avec des descripteurs audio pour faire de
la classification de sons environnementaux. La méthode proposée surpasse les résultats
des méthodes qui utilisent des algorithmes d’apprentissage automatique classiques et
est plus rapide que toutes les méthodes de réseau neuronal convolutif. Il représente un
meilleur choix lorsqu’il y a une rareté des données ou une puissance de calcul minimale.The objective of this thesis is to study on the one hand the problem of image sonification
and to solve it through new models of mapping between visual and sound domains.
On the other hand, to study the problem of sound classification and to solve it with
methods which have proven track record in the field of image recognition.
Image sonification is the translation of image data (shape, color, texture, objects)
into sounds. It is used in vision assistance and image accessibility domains for visual
impaired people. Due to its complexity, an image sonification system that properly conveys
the image data to sound in an intuitive way is not easy to design.
Our first contribution is to propose a new low-level image sonification system which
uses an hierarchical visual feature-based approach to translate, usingmusical notes, most
of the properties of an image (color, gradient, edge, texture, region) to the audio domain,
in a very predictable way in which is then easily decodable by the human being.
Our second contribution is a high-level sonification Android application which is
complementary to our first contribution because it implements the translation to the audio
domain of the objects and the semantic content of an image. It also proposes a dataset
for an image sonification.
Finally, in the audio domain, our third contribution generalizes the Local Binary
Pattern (LBP) to 1D and combines it with audio features for an environmental sound
classification task. The proposed method outperforms the results of methods that uses
handcrafted features with classical machine learning algorithms and is faster than any
convolutional neural network methods. It represents a better choice when there is data
scarcity or minimal computing power
Ubiquitous Technologies for Emotion Recognition
Emotions play a very important role in how we think and behave. As such, the emotions we feel every day can compel us to act and influence the decisions and plans we make about our lives. Being able to measure, analyze, and better comprehend how or why our emotions may change is thus of much relevance to understand human behavior and its consequences. Despite the great efforts made in the past in the study of human emotions, it is only now, with the advent of wearable, mobile, and ubiquitous technologies, that we can aim to sense and recognize emotions, continuously and in real time. This book brings together the latest experiences, findings, and developments regarding ubiquitous sensing, modeling, and the recognition of human emotions
Automatic Recognition of Non-Verbal Acoustic Communication Events With Neural Networks
Non-verbal acoustic communication is of high importance to humans and animals: Infants use the voice as a primary communication tool. Animals of all kinds employ acoustic communication, such as chimpanzees, which use pant-hoot vocalizations for long-distance communication.
Many applications require the assessment of such communication for a variety of analysis goals. Computational systems can support these areas through automatization of the assessment process. This is of particular importance in monitoring scenarios over large spatial and time scales, which are infeasible to perform manually.
Algorithms for sound recognition have traditionally been based on conventional machine learning approaches. In recent years, so-called representation learning approaches have gained increasing popularity. This particularly includes deep learning approaches that feed raw data to deep neural networks. However, there remain open challenges in applying these approaches to automatic recognition of non-verbal acoustic communication events, such as compensating for small data set sizes.
The leading question of this thesis is: How can we apply deep learning more effectively to automatic recognition of non-verbal acoustic communication events? The target communication types were specifically (1) infant vocalizations and (2) chimpanzee long-distance calls.
This thesis comprises four studies that investigated aspects of this question:
Study (A) investigated the assessment of infant vocalizations by laypersons. The central goal was to derive an infant vocalization classification scheme based on the laypersons' perception. The study method was based on the Nijmegen Protocol, where participants rated vocalization recordings through various items, such as affective ratings and class labels. Results showed a strong association between valence ratings and class labels, which was used to derive a classification scheme.
Study (B) was a comparative study on various neural network types for the automatic classification of infant vocalizations. The goal was to determine the best performing network type among the currently most prevailing ones, while considering the influence of their architectural configuration. Results showed that convolutional neural networks outperformed recurrent neural networks and that the choice of the frequency and time aggregation layer inside the network is the most important architectural choice.
Study (C) was a detailed investigation on computer vision-like convolutional neural networks for infant vocalization classification. The goal was to determine the most important architectural properties for increasing classification performance. Results confirmed the importance of the aggregation layer and additionally identified the input size of the fully-connected layers and the accumulated receptive field to be of major importance.
Study (D) was an investigation on compensating class imbalance for chimpanzee call detection in naturalistic long-term recordings. The goal was to determine which compensation method among a selected group improved performance the most for a deep learning system. Results showed that spectrogram denoising was most effective, while methods for compensating relative imbalance either retained or decreased performance.:1. Introduction
2. Foundations in Automatic Recognition of Acoustic Communication
3. State of Research
4. Study (A): Investigation of the Assessment of Infant Vocalizations by Laypersons
5. Study (B): Comparison of Neural Network Types for Automatic Classification of Infant Vocalizations
6. Study (C): Detailed Investigation of CNNs for Automatic Classification of Infant Vocalizations
7. Study (D): Compensating Class Imbalance for Acoustic Chimpanzee Detection With Convolutional Recurrent Neural Networks
8. Conclusion and Collected Discussion
9. AppendixNonverbale akustische Kommunikation ist für Menschen und Tiere von großer Bedeutung: Säuglinge nutzen die Stimme als primäres Kommunikationsmittel. Schimpanse verwenden sogenannte 'Pant-hoots' und Trommeln zur Kommunikation über weite Entfernungen.
Viele Anwendungen erfordern die Beurteilung solcher Kommunikation für verschiedenste Analyseziele. Algorithmen können solche Bereiche durch die Automatisierung der Beurteilung unterstützen. Dies ist besonders wichtig beim Monitoring langer Zeitspannen oder großer Gebiete, welche manuell nicht durchführbar sind.
Algorithmen zur Geräuscherkennung verwendeten bisher größtenteils konventionelle Ansätzen des maschinellen Lernens. In den letzten Jahren hat eine alternative Herangehensweise Popularität gewonnen, das sogenannte Representation Learning. Dazu gehört insbesondere Deep Learning, bei dem Rohdaten in tiefe neuronale Netze eingespeist werden. Jedoch gibt es bei der Anwendung dieser Ansätze auf die automatische Erkennung von nonverbaler akustischer Kommunikation ungelöste Herausforderungen, wie z.B. die Kompensation der relativ kleinen Datenmengen.
Die Leitfrage dieser Arbeit ist: Wie können wir Deep Learning effektiver zur automatischen Erkennung nonverbaler akustischer Kommunikation verwenden? Diese Arbeit konzentriert sich speziell auf zwei Kommunikationsarten: (1) vokale Laute von Säuglingen (2) Langstreckenrufe von Schimpansen.
Diese Arbeit umfasst vier Studien, welche Aspekte dieser Frage untersuchen:
Studie (A) untersuchte die Beurteilung von Säuglingslauten durch Laien. Zentrales Ziel war die Ableitung eines Klassifikationsschemas für Säuglingslaute auf der Grundlage der Wahrnehmung von Laien. Die Untersuchungsmethode basierte auf dem sogenannten Nijmegen-Protokoll. Hier beurteilten die Teilnehmenden Lautaufnahmen von Säuglingen anhand verschiedener Variablen, wie z.B. affektive Bewertungen und Klassenbezeichnungen. Die Ergebnisse zeigten eine starke Assoziation zwischen Valenzbewertungen und Klassenbezeichnungen, die zur Ableitung eines Klassifikationsschemas verwendet wurde.
Studie (B) war eine vergleichende Studie verschiedener Typen neuronaler Netzwerke für die automatische Klassifizierung von Säuglingslauten. Ziel war es, den leistungsfähigsten Netzwerktyp unter den momentan verbreitetsten Typen zu ermitteln. Hierbei wurde der Einfluss verschiedener architektonischer Konfigurationen innerhalb der Typen berücksichtigt. Die Ergebnisse zeigten, dass Convolutional Neural Networks eine höhere Performance als Recurrent Neural Networks erreichten. Außerdem wurde gezeigt, dass die Wahl der Frequenz- und Zeitaggregationsschicht die wichtigste architektonische Entscheidung ist.
Studie (C) war eine detaillierte Untersuchung von Computer Vision-ähnlichen Convolutional Neural Networks für die Klassifizierung von Säuglingslauten. Ziel war es, die wichtigsten architektonischen Eigenschaften zur Steigerung der Erkennungsperformance zu bestimmen. Die Ergebnisse bestätigten die Bedeutung der Aggregationsschicht. Zusätzlich Eigenschaften, die als wichtig identifiziert wurden, waren die Eingangsgröße der vollständig verbundenen Schichten und das akkumulierte rezeptive Feld.
Studie (D) war eine Untersuchung zur Kompensation der Klassenimbalance zur Erkennung von Schimpansenrufen in Langzeitaufnahmen. Ziel war es, herauszufinden, welche Kompensationsmethode aus einer Menge ausgewählter Methoden die Performance eines Deep Learning Systems am meisten verbessert. Die Ergebnisse zeigten, dass Spektrogrammentrauschen am effektivsten war, während Methoden zur Kompensation des relativen Ungleichgewichts die Performance entweder gleichhielten oder verringerten.:1. Introduction
2. Foundations in Automatic Recognition of Acoustic Communication
3. State of Research
4. Study (A): Investigation of the Assessment of Infant Vocalizations by Laypersons
5. Study (B): Comparison of Neural Network Types for Automatic Classification of Infant Vocalizations
6. Study (C): Detailed Investigation of CNNs for Automatic Classification of Infant Vocalizations
7. Study (D): Compensating Class Imbalance for Acoustic Chimpanzee Detection With Convolutional Recurrent Neural Networks
8. Conclusion and Collected Discussion
9. Appendi
Models and analysis of vocal emissions for biomedical applications: 5th International Workshop: December 13-15, 2007, Firenze, Italy
The MAVEBA Workshop proceedings, held on a biannual basis, collect the scientific papers presented both as oral and poster contributions, during the conference. The main subjects are: development of theoretical and mechanical models as an aid to the study of main phonatory dysfunctions, as well as the biomedical engineering methods for the analysis of voice signals and images, as a support to clinical diagnosis and classification of vocal pathologies. The Workshop has the sponsorship of: Ente Cassa Risparmio di Firenze, COST Action 2103, Biomedical Signal Processing and Control Journal (Elsevier Eds.), IEEE Biomedical Engineering Soc. Special Issues of International Journals have been, and will be, published, collecting selected papers from the conference