14 research outputs found
Multimodal Speech Emotion Recognition Using Audio and Text
Speech emotion recognition is a challenging task, and extensive reliance has
been placed on models that use audio features in building well-performing
classifiers. In this paper, we propose a novel deep dual recurrent encoder
model that utilizes text data and audio signals simultaneously to obtain a
better understanding of speech data. As emotional dialogue is composed of sound
and spoken content, our model encodes the information from audio and text
sequences using dual recurrent neural networks (RNNs) and then combines the
information from these sources to predict the emotion class. This architecture
analyzes speech data from the signal level to the language level, and it thus
utilizes the information within the data more comprehensively than models that
focus on audio features. Extensive experiments are conducted to investigate the
efficacy and properties of the proposed model. Our proposed model outperforms
previous state-of-the-art methods in assigning data to one of four emotion
categories (i.e., angry, happy, sad and neutral) when the model is applied to
the IEMOCAP dataset, as reflected by accuracies ranging from 68.8% to 71.8%.Comment: 7 pages, Accepted as a conference paper at IEEE SLT 201
Improvement of DOA Estimation by using Quaternion Output in Sound Event Localization and Detection
This paper describes improvement of Direction of Arrival (DOA) estimation performance using quaternion output in the Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 Task 3. DCASE 2019 Task3 focuses on the sound event localization and detection (SELD) which is a task that simultaneously estimates the sound source direction in addition to conventional sound event detection (SED). In the baseline method, the sound source direction angle is directly regressed. However, the angle is a periodic function and it has discontinuities which may make learning unstable. Specifical-ly, even though -180 deg and 180 deg are in the same direc-tion, a large loss is calculated. Estimating DOA angles with a classification approach instead of regression can solve such instability of discontinuities but this causes limitation of reso-lution. In this paper, we propose to introduce the quaternion which is a continuous function into the output layer of the neural network instead of directly estimating the sound source direction angle. This method can be easily implemented only by changing the output of the existing neural network, and thus does not significantly increase the number of parameters in the middle layers. Experimental results show that proposed method improves the DOA estimation without significantly increasing the number of parameters.24424
Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection
Sound events often occur in unstructured environments where they exhibit wide
variations in their frequency content and temporal structure. Convolutional
neural networks (CNN) are able to extract higher level features that are
invariant to local spectral and temporal variations. Recurrent neural networks
(RNNs) are powerful in learning the longer term temporal context in the audio
signals. CNNs and RNNs as classifiers have recently shown improved performances
over established methods in various sound recognition tasks. We combine these
two approaches in a Convolutional Recurrent Neural Network (CRNN) and apply it
on a polyphonic sound event detection task. We compare the performance of the
proposed CRNN method with CNN, RNN, and other established methods, and observe
a considerable improvement for four different datasets consisting of everyday
sound events.Comment: Accepted for IEEE Transactions on Audio, Speech and Language
Processing, Special Issue on Sound Scene and Event Analysi
Continuous Health Interface Event Retrieval
Knowing the state of our health at every moment in time is critical for
advances in health science. Using data obtained outside an episodic clinical
setting is the first step towards building a continuous health estimation
system. In this paper, we explore a system that allows users to combine events
and data streams from different sources to retrieve complex biological events,
such as cardiovascular volume overload. These complex events, which have been
explored in biomedical literature and which we call interface events, have a
direct causal impact on relevant biological systems. They are the interface
through which the lifestyle events influence our health. We retrieve the
interface events from existing events and data streams by encoding domain
knowledge using an event operator language.Comment: ACM International Conference on Multimedia Retrieval 2020 (ICMR
2020), held in Dublin, Ireland from June 8-11, 202
A Four-Stage Data Augmentation Approach to ResNet-Conformer Based Acoustic Modeling for Sound Event Localization and Detection
In this paper, we propose a novel four-stage data augmentation approach to
ResNet-Conformer based acoustic modeling for sound event localization and
detection (SELD). First, we explore two spatial augmentation techniques, namely
audio channel swapping (ACS) and multi-channel simulation (MCS), to deal with
data sparsity in SELD. ACS and MDS focus on augmenting the limited training
data with expanding direction of arrival (DOA) representations such that the
acoustic models trained with the augmented data are robust to localization
variations of acoustic sources. Next, time-domain mixing (TDM) and
time-frequency masking (TFM) are also investigated to deal with overlapping
sound events and data diversity. Finally, ACS, MCS, TDM and TFM are combined in
a step-by-step manner to form an effective four-stage data augmentation scheme.
Tested on the Detection and Classification of Acoustic Scenes and Events
(DCASE) 2020 data sets, our proposed augmentation approach greatly improves the
system performance, ranking our submitted system in the first place in the SELD
task of DCASE 2020 Challenge. Furthermore, we employ a ResNet-Conformer
architecture to model both global and local context dependencies of an audio
sequence to yield further gains over those architectures used in the DCASE 2020
SELD evaluations.Comment: 12 pages, 8 figure
FSD50K: an Open Dataset of Human-Labeled Sound Events
Most existing datasets for sound event recognition (SER) are relatively small
and/or domain-specific, with the exception of AudioSet, based on a massive
amount of audio tracks from YouTube videos and encompassing over 500 classes of
everyday sounds. However, AudioSet is not an open dataset---its release
consists of pre-computed audio features (instead of waveforms), which limits
the adoption of some SER methods. Downloading the original audio tracks is also
problematic due to constituent YouTube videos gradually disappearing and usage
rights issues, which casts doubts over the suitability of this resource for
systems' benchmarking. To provide an alternative benchmark dataset and thus
foster SER research, we introduce FSD50K, an open dataset containing over 51k
audio clips totalling over 100h of audio manually labeled using 200 classes
drawn from the AudioSet Ontology. The audio clips are licensed under Creative
Commons licenses, making the dataset freely distributable (including
waveforms). We provide a detailed description of the FSD50K creation process,
tailored to the particularities of Freesound data, including challenges
encountered and solutions adopted. We include a comprehensive dataset
characterization along with discussion of limitations and key factors to allow
its audio-informed usage. Finally, we conduct sound event classification
experiments to provide baseline systems as well as insight on the main factors
to consider when splitting Freesound audio data for SER. Our goal is to develop
a dataset to be widely adopted by the community as a new open benchmark for SER
research
Deep Neural Networks for Sound Event Detection
The objective of this thesis is to develop novel classification and feature learning techniques for the task of sound event detection (SED) in real-world environments. Throughout their lives, humans experience a consistent learning process on how to assign meanings to sounds. Thanks to this, most of the humans can easily recognize the sound of a thunder, dog bark, door bell, bird singing etc. In this work, we aim to develop systems that can automatically detect the sound events commonly present in our daily lives. Such systems can be utilized in e.g. contextaware devices, acoustic surveillance, bio-acoustical and healthcare monitoring, and smart-home cities.In this thesis, we propose to apply the modern machine learning methods called deep learning for SED. The relationship between the commonly used timefrequency representations for SED (such as mel spectrogram and magnitude spectrogram) and the target sound event labels are highly complex. Deep learning methods such as deep neural networks (DNN) utilize a layered structure of units to extract features from the given sound representation input with increased abstraction at each layer. This increases the network’s capacity to efficiently learn the highly complex relationship between the sound representation and the target sound event labels. We found that the proposed DNN approach performs significantly better than the established classifier techniques for SED such as Gaussian mixture models.In a time-frequency representation of an audio recording, a sound event can often be recognized as a distinct pattern that may exhibit shifts in both dimensions. The intra-class variability of the sound events may cause to small shifts in the frequency domain content, and the time domain shift results from the fact that a sound event can occur at any time for a given audio recording. We found that convolutional neural networks (CNN) are useful to learn shift-invariant filters that are essential for robust modeling of sound events. In addition, we show that recurrent neural networks (RNN) are effective in modeling the long-term temporal characteristics of the sound events. Finally, we combine the convolutional and recurrent layers in a single classifier called convolutional recurrent neural networks (CRNN), which emphasizes the benefits of both and provides state-of-the-art results in multiple SED benchmark datasets.Aside from learning the mappings between the time-frequency representations and the sound event labels, we show that deep learning methods can also be utilized to learn a direct mapping between the the target labels and a lower level representation such as the magnitude spectrogram or even the raw audio signals. In this thesis, the feature learning capabilities of the deep learning methods and the empirical knowledge on the human auditory perception are proposed to be integrated through the means of layer weight initialization with filterbank coefficients. This results with an optimal, ad-hoc filterbank that is obtained through gradient based optimization of the original coefficients to improve the SED performance