247 research outputs found
Learning Speech Emotion Representations in the Quaternion Domain
The modeling of human emotion expression in speech signals is an important,
yet challenging task. The high resource demand of speech emotion recognition
models, combined with the the general scarcity of emotion-labelled data are
obstacles to the development and application of effective solutions in this
field. In this paper, we present an approach to jointly circumvent these
difficulties. Our method, named RH-emo, is a novel semi-supervised architecture
aimed at extracting quaternion embeddings from real-valued monoaural
spectrograms, enabling the use of quaternion-valued networks for speech emotion
recognition tasks. RH-emo is a hybrid real/quaternion autoencoder network that
consists of a real-valued encoder in parallel to a real-valued emotion
classifier and a quaternion-valued decoder. On the one hand, the classifier
permits to optimize each latent axis of the embeddings for the classification
of a specific emotion-related characteristic: valence, arousal, dominance and
overall emotion. On the other hand, the quaternion reconstruction enables the
latent dimension to develop intra-channel correlations that are required for an
effective representation as a quaternion entity. We test our approach on speech
emotion recognition tasks using four popular datasets: Iemocap, Ravdess, EmoDb
and Tess, comparing the performance of three well-established real-valued CNN
architectures (AlexNet, ResNet-50, VGG) and their quaternion-valued equivalent
fed with the embeddings created with RH-emo. We obtain a consistent improvement
in the test accuracy for all datasets, while drastically reducing the
resources' demand of models. Moreover, we performed additional experiments and
ablation studies that confirm the effectiveness of our approach. The RH-emo
repository is available at: https://github.com/ispamm/rhemo.Comment: Paper Submitted to IEEE/ACM Transactions on Audio, Speech and
Language Processin
A Quaternion Gated Recurrent Unit Neural Network for Sensor Fusion
Recurrent Neural Networks (RNNs) are known for their ability to learn relationships within temporal sequences. Gated Recurrent Unit (GRU) networks have found use in challenging time-dependent applications such as Natural Language Processing (NLP), financial analysis and sensor fusion due to their capability to cope with the vanishing gradient problem. GRUs are also known to be more computationally efficient than their variant, the Long Short-Term Memory neural network (LSTM), due to their less complex structure and as such, are more suitable for applications requiring more efficient management of computational resources. Many of such applications require a stronger mapping of their features to further enhance the prediction accuracy. A novel Quaternion Gated Recurrent Unit (QGRU) is proposed in this paper, which leverages the internal and external dependencies within the quaternion algebra to map correlations within and across multidimensional features. The QGRU can be used to efficiently capture the inter- and intra-dependencies within multidimensional features unlike the GRU, which only captures the dependencies within the sequence. Furthermore, the performance of the proposed method is evaluated on a sensor fusion problem involving navigation in Global Navigation Satellite System (GNSS) deprived environments as well as a human activity recognition problem. The results obtained show that the QGRU produces competitive results with almost 3.7 times fewer parameters compared to the GRU
Recommended from our members
Learning Speech Emotion Representations in the Quaternion Domain
The modeling of human emotion expression in speech signals is an important, yet challenging task. The high resource demand of speech emotion recognition models, combined with the general scarcity of emotion-labelled data are obstacles to the development and application of effective solutions in this field. In this paper, we present an approach to jointly circumvent these difficulties. Our method, named RH-emo, is a novel semi-supervised architecture aimed at extracting quaternion embeddings from real-valued monoaural spectrograms, enabling the use of quaternion-valued networks for speech emotion recognition tasks. RH-emo is a hybrid real/quaternion autoencoder network that consists of a real-valued encoder in parallel to a real-valued emotion classifier and a quaternion-valued decoder. On the one hand, the classifier permits to optimization of each latent axis of the embeddings for the classification of a specific emotion-related characteristic: valence, arousal, dominance, and overall emotion. On the other hand, quaternion reconstruction enables the latent dimension to develop intra-channel correlations that are required for an effective representation as a quaternion entity. We test our approach on speech emotion recognition tasks using four popular datasets: IEMOCAP, RAVDESS, EmoDB, and TESS, comparing the performance of three well-established real-valued CNN architectures (AlexNet, ResNet-50, VGG) and their quaternion-valued equivalent fed with the embeddings created with RH-emo. We obtain a consistent improvement in the test accuracy for all datasets, while drastically reducing the resources' demand of models. Moreover, we performed additional experiments and ablation studies that confirm the effectiveness of our approach
Exploiting Single-Channel Speech for Multi-Channel End-to-End Speech Recognition: A Comparative Study
Recently, the end-to-end training approach for multi-channel ASR has shown
its effectiveness, which usually consists of a beamforming front-end and a
recognition back-end. However, the end-to-end training becomes more difficult
due to the integration of multiple modules, particularly considering that
multi-channel speech data recorded in real environments are limited in size.
This raises the demand to exploit the single-channel data for multi-channel
end-to-end ASR. In this paper, we systematically compare the performance of
three schemes to exploit external single-channel data for multi-channel
end-to-end ASR, namely back-end pre-training, data scheduling, and data
simulation, under different settings such as the sizes of the single-channel
data and the choices of the front-end. Extensive experiments on CHiME-4 and
AISHELL-4 datasets demonstrate that while all three methods improve the
multi-channel end-to-end speech recognition performance, data simulation
outperforms the other two, at the cost of longer training time. Data scheduling
outperforms back-end pre-training marginally but nearly consistently,
presumably because that in the pre-training stage, the back-end tends to
overfit on the single-channel data, especially when the single-channel data
size is small.Comment: submitted to INTERSPEECH 2022. arXiv admin note: substantial text
overlap with arXiv:2107.0267
Recommended from our members
Enhancing the Generalization of Convolutional Neural Networks for Speech Emotion Recognition
Human-machine interaction is rapidly gaining significance in our daily lives. While speech recognition has achieved near-human performance in recent years, the intricate details embedded in speech extend beyond the mere arrangement of words. Speech Emotion Recognition (SER) is therefore acquiring a growing role in this field by decoding not only the linguistic content but also the emotional nuances of human spoken communication and enabling therefore a more exhaustive comprehension of the information conveyed by speech signals.
Despite the success that neural networks have already achieved in this task, SER is still challenging due to the variability of emotional expression, especially in real-world scenarios where generalization to unseen speakers and contexts is required. In addition, the high resource demand of SER models, combined with the scarcity of emotion-labelled data, hinder the development and application of effective solutions in this field. In this thesis, we present multiple approaches to overcome the aforementioned difficulties. We first introduce a multiple-time-scale (MTS) convolutional neural network architecture to create flexibility towards temporal variations when analyzing time-frequency representations of audio data. We show that resilience to speed fluctuations is relevant in SER tasks, since emotion is expressed through complex spectral patterns that can exhibit significant local dilation and compression on the time axis depending on speaker and context. The results indicate that the use of MTS consistently improves the generalization of networks of different capacity and depth, compared to standard convolution.
In a second stage, we propose a more general approach to discourage unwanted sensitivity towards specific target properties in CNNs, introducing the novel concept of anti-transfer learning. While transfer learning assumes that the learning process for a target task will benefit from re-using representations learned for another task, anti-transfer avoids the learning of representations that have been learned for an orthogonal task, i.e., one that is not relevant and potentially confounding for the target task, such as speaker identity and speech content for emotion recognition. In anti-transfer learning we penalize similarity between activations of a network being trained and another network previously trained on an orthogonal task. This leads to better generalization and provides a degree of control over correlations that are spurious or undesirable. We show that anti-transfer actually leads to the intended invariance to the orthogonal task and to more appropriate feature maps for the target task at hand. Anti-transfer creates a computation and memory cost at training time, but it enables enables the reuse of pre-trained models.
In order to avoid the high resource demand of SER models in general and anti-transfer learning specifically, we propose RH-emo, a novel semisupervised architecture aimed at extracting quaternion embeddings from realvalued monoaural spectrograms, enabling the use of quaternion-valued networks for SER tasks. RH-emo is a hybrid real/quaternion autoencoder network that consists of a real-valued encoder in parallel to a real-valued emotion classifier and a quaternion-valued decoder. We show that the use of RHemo, combined with quaternion convolutional neural networks provides a consistent improvement in SER tasks, while requiring far fewer trainable parameters and therefore substantially reducing the resource demand of SER models.
Finally, we apply anti-transfer learning to quaternion-valued neural networks fed with RH-emo embeddings. We demonstrate that the combination of the two approaches maintains the disentanglement properties of antitransfer, while using a reduced amount of memory, computation, and training time, making this a suitable approach for SER scenarios with limited resources and where context and speaker independence are needed
Task-oriented and Semantics-aware Communication Framework for Augmented Reality
Upon the advent of the emerging metaverse and its related applications in
Augmented Reality (AR), the current bit-oriented network struggles to support
real-time changes for the vast amount of associated information, hindering its
development. Thus, a critical revolution in the Sixth Generation (6G) networks
is envisioned through the joint exploitation of information context and its
importance to the task, leading to a communication paradigm shift towards
semantic and effectiveness levels. However, current research has not yet
proposed any explicit and systematic communication framework for AR
applications that incorporate these two levels. To fill this research gap, this
paper presents a task-oriented and semantics-aware communication framework for
augmented reality (TSAR) to enhance communication efficiency and effectiveness
in 6G. Specifically, we first analyse the traditional wireless AR point cloud
communication framework and then summarize our proposed semantic information
along with the end-to-end wireless communication. We then detail the design
blocks of the TSAR framework, covering both semantic and effectiveness levels.
Finally, numerous experiments have been conducted to demonstrate that, compared
to the traditional point cloud communication framework, our proposed TSAR
significantly reduces wireless AR application transmission latency by 95.6%,
while improving communication effectiveness in geometry and color aspects by up
to 82.4% and 20.4%, respectively
Robust Adversarial Attacks Detection for Deep Learning based Relative Pose Estimation for Space Rendezvous
Research on developing deep learning techniques for autonomous spacecraft
relative navigation challenges is continuously growing in recent years.
Adopting those techniques offers enhanced performance. However, such approaches
also introduce heightened apprehensions regarding the trustability and security
of such deep learning methods through their susceptibility to adversarial
attacks. In this work, we propose a novel approach for adversarial attack
detection for deep neural network-based relative pose estimation schemes based
on the explainability concept. We develop for an orbital rendezvous scenario an
innovative relative pose estimation technique adopting our proposed
Convolutional Neural Network (CNN), which takes an image from the chaser's
onboard camera and outputs accurately the target's relative position and
rotation. We perturb seamlessly the input images using adversarial attacks that
are generated by the Fast Gradient Sign Method (FGSM). The adversarial attack
detector is then built based on a Long Short Term Memory (LSTM) network which
takes the explainability measure namely SHapley Value from the CNN-based pose
estimator and flags the detection of adversarial attacks when acting.
Simulation results show that the proposed adversarial attack detector achieves
a detection accuracy of 99.21%. Both the deep relative pose estimator and
adversarial attack detector are then tested on real data captured from our
laboratory-designed setup. The experimental results from our
laboratory-designed setup demonstrate that the proposed adversarial attack
detector achieves an average detection accuracy of 96.29%
- …