Search CORE

247 research outputs found

Learning Speech Emotion Representations in the Quaternion Domain

Author: Comminiello Danilo
Guizzo Eric
Scardapane Simone
Weyde Tillman
Publication venue
Publication date: 05/04/2022
Field of study

The modeling of human emotion expression in speech signals is an important, yet challenging task. The high resource demand of speech emotion recognition models, combined with the the general scarcity of emotion-labelled data are obstacles to the development and application of effective solutions in this field. In this paper, we present an approach to jointly circumvent these difficulties. Our method, named RH-emo, is a novel semi-supervised architecture aimed at extracting quaternion embeddings from real-valued monoaural spectrograms, enabling the use of quaternion-valued networks for speech emotion recognition tasks. RH-emo is a hybrid real/quaternion autoencoder network that consists of a real-valued encoder in parallel to a real-valued emotion classifier and a quaternion-valued decoder. On the one hand, the classifier permits to optimize each latent axis of the embeddings for the classification of a specific emotion-related characteristic: valence, arousal, dominance and overall emotion. On the other hand, the quaternion reconstruction enables the latent dimension to develop intra-channel correlations that are required for an effective representation as a quaternion entity. We test our approach on speech emotion recognition tasks using four popular datasets: Iemocap, Ravdess, EmoDb and Tess, comparing the performance of three well-established real-valued CNN architectures (AlexNet, ResNet-50, VGG) and their quaternion-valued equivalent fed with the embeddings created with RH-emo. We obtain a consistent improvement in the test accuracy for all datasets, while drastically reducing the resources' demand of models. Moreover, we performed additional experiments and ablation studies that confirm the effectiveness of our approach. The RH-emo repository is available at: https://github.com/ispamm/rhemo.Comment: Paper Submitted to IEEE/ACM Transactions on Audio, Speech and Language Processin

arXiv.org e-Print Archive

City Research Online

Archivio della ricerca- Università di Roma La Sapienza

A Quaternion Gated Recurrent Unit Neural Network for Sensor Fusion

Author: Christopoulos Stavros
Kanarachos Stratis
Onyekpe Uche Abiola
Palade Vasile
Publication venue: 'MDPI AG'
Publication date: 09/03/2021
Field of study

Recurrent Neural Networks (RNNs) are known for their ability to learn relationships within temporal sequences. Gated Recurrent Unit (GRU) networks have found use in challenging time-dependent applications such as Natural Language Processing (NLP), financial analysis and sensor fusion due to their capability to cope with the vanishing gradient problem. GRUs are also known to be more computationally efficient than their variant, the Long Short-Term Memory neural network (LSTM), due to their less complex structure and as such, are more suitable for applications requiring more efficient management of computational resources. Many of such applications require a stronger mapping of their features to further enhance the prediction accuracy. A novel Quaternion Gated Recurrent Unit (QGRU) is proposed in this paper, which leverages the internal and external dependencies within the quaternion algebra to map correlations within and across multidimensional features. The QGRU can be used to efficiently capture the inter- and intra-dependencies within multidimensional features unlike the GRU, which only captures the dependencies within the sequence. Furthermore, the performance of the proposed method is evaluated on a sensor fusion problem involving navigation in Global Navigation Satellite System (GNSS) deprived environments as well as a human activity recognition problem. The results obtained show that the QGRU produces competitive results with almost 3.7 times fewer parameters compared to the GRU

Multidisciplinary Digital Publishing Institute

Coventry University Pure Portal

Recommended from our members

Learning Speech Emotion Representations in the Quaternion Domain

Author: Comminiello D.
Guizzo E.
Scardapane S.
Weyde T.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/03/2023
Field of study

The modeling of human emotion expression in speech signals is an important, yet challenging task. The high resource demand of speech emotion recognition models, combined with the general scarcity of emotion-labelled data are obstacles to the development and application of effective solutions in this field. In this paper, we present an approach to jointly circumvent these difficulties. Our method, named RH-emo, is a novel semi-supervised architecture aimed at extracting quaternion embeddings from real-valued monoaural spectrograms, enabling the use of quaternion-valued networks for speech emotion recognition tasks. RH-emo is a hybrid real/quaternion autoencoder network that consists of a real-valued encoder in parallel to a real-valued emotion classifier and a quaternion-valued decoder. On the one hand, the classifier permits to optimization of each latent axis of the embeddings for the classification of a specific emotion-related characteristic: valence, arousal, dominance, and overall emotion. On the other hand, quaternion reconstruction enables the latent dimension to develop intra-channel correlations that are required for an effective representation as a quaternion entity. We test our approach on speech emotion recognition tasks using four popular datasets: IEMOCAP, RAVDESS, EmoDB, and TESS, comparing the performance of three well-established real-valued CNN architectures (AlexNet, ResNet-50, VGG) and their quaternion-valued equivalent fed with the embeddings created with RH-emo. We obtain a consistent improvement in the test accuracy for all datasets, while drastically reducing the resources' demand of models. Moreover, we performed additional experiments and ablation studies that confirm the effectiveness of our approach

City Research Online

Exploiting Single-Channel Speech for Multi-Channel End-to-End Speech Recognition: A Comparative Study

Author: An Keyu
Ou Zhijian
Publication venue
Publication date: 30/03/2022
Field of study

Recently, the end-to-end training approach for multi-channel ASR has shown its effectiveness, which usually consists of a beamforming front-end and a recognition back-end. However, the end-to-end training becomes more difficult due to the integration of multiple modules, particularly considering that multi-channel speech data recorded in real environments are limited in size. This raises the demand to exploit the single-channel data for multi-channel end-to-end ASR. In this paper, we systematically compare the performance of three schemes to exploit external single-channel data for multi-channel end-to-end ASR, namely back-end pre-training, data scheduling, and data simulation, under different settings such as the sizes of the single-channel data and the choices of the front-end. Extensive experiments on CHiME-4 and AISHELL-4 datasets demonstrate that while all three methods improve the multi-channel end-to-end speech recognition performance, data simulation outperforms the other two, at the cost of longer training time. Data scheduling outperforms back-end pre-training marginally but nearly consistently, presumably because that in the pre-training stage, the back-end tends to overfit on the single-channel data, especially when the single-channel data size is small.Comment: submitted to INTERSPEECH 2022. arXiv admin note: substantial text overlap with arXiv:2107.0267

arXiv.org e-Print Archive

Recommended from our members

Enhancing the Generalization of Convolutional Neural Networks for Speech Emotion Recognition

Author: Guizzo E.
Publication venue
Publication date
Field of study

Human-machine interaction is rapidly gaining significance in our daily lives. While speech recognition has achieved near-human performance in recent years, the intricate details embedded in speech extend beyond the mere arrangement of words. Speech Emotion Recognition (SER) is therefore acquiring a growing role in this field by decoding not only the linguistic content but also the emotional nuances of human spoken communication and enabling therefore a more exhaustive comprehension of the information conveyed by speech signals. Despite the success that neural networks have already achieved in this task, SER is still challenging due to the variability of emotional expression, especially in real-world scenarios where generalization to unseen speakers and contexts is required. In addition, the high resource demand of SER models, combined with the scarcity of emotion-labelled data, hinder the development and application of effective solutions in this field. In this thesis, we present multiple approaches to overcome the aforementioned difficulties. We first introduce a multiple-time-scale (MTS) convolutional neural network architecture to create flexibility towards temporal variations when analyzing time-frequency representations of audio data. We show that resilience to speed fluctuations is relevant in SER tasks, since emotion is expressed through complex spectral patterns that can exhibit significant local dilation and compression on the time axis depending on speaker and context. The results indicate that the use of MTS consistently improves the generalization of networks of different capacity and depth, compared to standard convolution. In a second stage, we propose a more general approach to discourage unwanted sensitivity towards specific target properties in CNNs, introducing the novel concept of anti-transfer learning. While transfer learning assumes that the learning process for a target task will benefit from re-using representations learned for another task, anti-transfer avoids the learning of representations that have been learned for an orthogonal task, i.e., one that is not relevant and potentially confounding for the target task, such as speaker identity and speech content for emotion recognition. In anti-transfer learning we penalize similarity between activations of a network being trained and another network previously trained on an orthogonal task. This leads to better generalization and provides a degree of control over correlations that are spurious or undesirable. We show that anti-transfer actually leads to the intended invariance to the orthogonal task and to more appropriate feature maps for the target task at hand. Anti-transfer creates a computation and memory cost at training time, but it enables enables the reuse of pre-trained models. In order to avoid the high resource demand of SER models in general and anti-transfer learning specifically, we propose RH-emo, a novel semisupervised architecture aimed at extracting quaternion embeddings from realvalued monoaural spectrograms, enabling the use of quaternion-valued networks for SER tasks. RH-emo is a hybrid real/quaternion autoencoder network that consists of a real-valued encoder in parallel to a real-valued emotion classifier and a quaternion-valued decoder. We show that the use of RHemo, combined with quaternion convolutional neural networks provides a consistent improvement in SER tasks, while requiring far fewer trainable parameters and therefore substantially reducing the resource demand of SER models. Finally, we apply anti-transfer learning to quaternion-valued neural networks fed with RH-emo embeddings. We demonstrate that the combination of the two approaches maintains the disentanglement properties of antitransfer, while using a reduced amount of memory, computation, and training time, making this a suitable approach for SER scenarios with limited resources and where context and speaker independence are needed

City Research Online

Task-oriented and Semantics-aware Communication Framework for Augmented Reality

Author: Aghvami A. Hamid
Deng Yansha
Wang Zhe
Publication venue
Publication date: 27/06/2023
Field of study

Upon the advent of the emerging metaverse and its related applications in Augmented Reality (AR), the current bit-oriented network struggles to support real-time changes for the vast amount of associated information, hindering its development. Thus, a critical revolution in the Sixth Generation (6G) networks is envisioned through the joint exploitation of information context and its importance to the task, leading to a communication paradigm shift towards semantic and effectiveness levels. However, current research has not yet proposed any explicit and systematic communication framework for AR applications that incorporate these two levels. To fill this research gap, this paper presents a task-oriented and semantics-aware communication framework for augmented reality (TSAR) to enhance communication efficiency and effectiveness in 6G. Specifically, we first analyse the traditional wireless AR point cloud communication framework and then summarize our proposed semantic information along with the end-to-end wireless communication. We then detail the design blocks of the TSAR framework, covering both semantic and effectiveness levels. Finally, numerous experiments have been conducted to demonstrate that, compared to the traditional point cloud communication framework, our proposed TSAR significantly reduces wireless AR application transmission latency by 95.6%, while improving communication effectiveness in geometry and color aspects by up to 82.4% and 20.4%, respectively

arXiv.org e-Print Archive

Robust Adversarial Attacks Detection for Deep Learning based Relative Pose Estimation for Space Rendezvous

Author: Aouf Nabil
Honvault Christophe
Pizarro Jose
Wang Ziwei
Publication venue
Publication date: 10/11/2023
Field of study

Research on developing deep learning techniques for autonomous spacecraft relative navigation challenges is continuously growing in recent years. Adopting those techniques offers enhanced performance. However, such approaches also introduce heightened apprehensions regarding the trustability and security of such deep learning methods through their susceptibility to adversarial attacks. In this work, we propose a novel approach for adversarial attack detection for deep neural network-based relative pose estimation schemes based on the explainability concept. We develop for an orbital rendezvous scenario an innovative relative pose estimation technique adopting our proposed Convolutional Neural Network (CNN), which takes an image from the chaser's onboard camera and outputs accurately the target's relative position and rotation. We perturb seamlessly the input images using adversarial attacks that are generated by the Fast Gradient Sign Method (FGSM). The adversarial attack detector is then built based on a Long Short Term Memory (LSTM) network which takes the explainability measure namely SHapley Value from the CNN-based pose estimator and flags the detection of adversarial attacks when acting. Simulation results show that the proposed adversarial attack detector achieves a detection accuracy of 99.21%. Both the deep relative pose estimator and adversarial attack detector are then tested on real data captured from our laboratory-designed setup. The experimental results from our laboratory-designed setup demonstrate that the proposed adversarial attack detector achieves an average detection accuracy of 96.29%

arXiv.org e-Print Archive