299 research outputs found
Speech Emotion Recognition using Supervised Deep Recurrent System for Mental Health Monitoring
Understanding human behavior and monitoring mental health are essential to
maintaining the community and society's safety. As there has been an increase
in mental health problems during the COVID-19 pandemic due to uncontrolled
mental health, early detection of mental issues is crucial. Nowadays, the usage
of Intelligent Virtual Personal Assistants (IVA) has increased worldwide.
Individuals use their voices to control these devices to fulfill requests and
acquire different services. This paper proposes a novel deep learning model
based on the gated recurrent neural network and convolution neural network to
understand human emotion from speech to improve their IVA services and monitor
their mental health.Comment: 6 pages, 5 figures, 3 tables, accepted in the IEEE WFIoT202
Dual Quaternion Ambisonics Array for Six-Degree-of-Freedom Acoustic Representation
Spatial audio methods are gaining a growing interest due to the spread of
immersive audio experiences and applications, such as virtual and augmented
reality. For these purposes, 3D audio signals are often acquired through arrays
of Ambisonics microphones, each comprising four capsules that decompose the
sound field in spherical harmonics. In this paper, we propose a dual quaternion
representation of the spatial sound field acquired through an array of two
First Order Ambisonics (FOA) microphones. The audio signals are encapsulated in
a dual quaternion that leverages quaternion algebra properties to exploit
correlations among them. This augmented representation with 6 degrees of
freedom (6DOF) involves a more accurate coverage of the sound field, resulting
in a more precise sound localization and a more immersive audio experience. We
evaluate our approach on a sound event localization and detection (SELD)
benchmark. We show that our dual quaternion SELD model with temporal
convolution blocks (DualQSELD-TCN) achieves better results with respect to real
and quaternion-valued baselines thanks to our augmented representation of the
sound field. Full code is available at:
https://github.com/ispamm/DualQSELD-TCN.Comment: Paper under consideration at Elsevier Pattern Recognition Letter
Delayed Memory Unit: Modelling Temporal Dependency Through Delay Gate
Recurrent Neural Networks (RNNs) are renowned for their adeptness in modeling
temporal dependencies, a trait that has driven their widespread adoption for
sequential data processing. Nevertheless, vanilla RNNs are confronted with the
well-known issue of gradient vanishing and exploding, posing a significant
challenge for learning and establishing long-range dependencies. Additionally,
gated RNNs tend to be over-parameterized, resulting in poor network
generalization. To address these challenges, we propose a novel Delayed Memory
Unit (DMU) in this paper, wherein a delay line structure, coupled with delay
gates, is introduced to facilitate temporal interaction and temporal credit
assignment, so as to enhance the temporal modeling capabilities of vanilla
RNNs. Particularly, the DMU is designed to directly distribute the input
information to the optimal time instant in the future, rather than aggregating
and redistributing it over time through intricate network dynamics. Our
proposed DMU demonstrates superior temporal modeling capabilities across a
broad range of sequential modeling tasks, utilizing considerably fewer
parameters than other state-of-the-art gated RNN models in applications such as
speech recognition, radar gesture recognition, ECG waveform segmentation, and
permuted sequential image classification
Compact recurrent neural networks for acoustic event detection on low-energy low-complexity platforms
Outdoor acoustic events detection is an exciting research field but
challenged by the need for complex algorithms and deep learning techniques,
typically requiring many computational, memory, and energy resources. This
challenge discourages IoT implementation, where an efficient use of resources
is required. However, current embedded technologies and microcontrollers have
increased their capabilities without penalizing energy efficiency. This paper
addresses the application of sound event detection at the edge, by optimizing
deep learning techniques on resource-constrained embedded platforms for the
IoT. The contribution is two-fold: firstly, a two-stage student-teacher
approach is presented to make state-of-the-art neural networks for sound event
detection fit on current microcontrollers; secondly, we test our approach on an
ARM Cortex M4, particularly focusing on issues related to 8-bits quantization.
Our embedded implementation can achieve 68% accuracy in recognition on
Urbansound8k, not far from state-of-the-art performance, with an inference time
of 125 ms for each second of the audio stream, and power consumption of 5.5 mW
in just 34.3 kB of RAM
Speaker verification using attentive multi-scale convolutional recurrent network
In this paper, we propose a speaker verification method by an Attentive
Multi-scale Convolutional Recurrent Network (AMCRN). The proposed AMCRN can
acquire both local spatial information and global sequential information from
the input speech recordings. In the proposed method, logarithm Mel spectrum is
extracted from each speech recording and then fed to the proposed AMCRN for
learning speaker embedding. Afterwards, the learned speaker embedding is fed to
the back-end classifier (such as cosine similarity metric) for scoring in the
testing stage. The proposed method is compared with state-of-the-art methods
for speaker verification. Experimental data are three public datasets that are
selected from two large-scale speech corpora (VoxCeleb1 and VoxCeleb2).
Experimental results show that our method exceeds baseline methods in terms of
equal error rate and minimal detection cost function, and has advantages over
most of baseline methods in terms of computational complexity and memory
requirement. In addition, our method generalizes well across truncated speech
segments with different durations, and the speaker embedding learned by the
proposed AMCRN has stronger generalization ability across two back-end
classifiers.Comment: 21 pages, 6 figures, 8 tables. Accepted for publication in Applied
Soft Computin
- …