106 research outputs found
Multi-Time-Scale Convolution for Emotion Recognition from Speech Audio Signals
Robustness against temporal variations is important for emotion recognition from speech audio, since emotion is expressed through complex spectral patterns that can exhibit significant local dilation and compression on the time axis depending on speaker and context. To address this and potentially other tasks, we introduce the multi-time-scale (MTS) method to create flexibility towards temporal variations when analyzing time-frequency representations of audio data. MTS extends convolutional neural networks with convolution kernels that are scaled and re-sampled along the time axis, to increase temporal flexibility without increasing the number of trainable parameters compared to standard convolutional layers. We evaluate MTS and standard convolutional layers in different architectures for emotion recognition from speech audio, using 4 datasets of different sizes. The results show that the use of MTS layers consistently improves the generalization of networks of different capacity and depth, compared to standard convolution, especially on smaller datasets
Recommended from our members
Towards a Multimodal Time-Based Empathy Prediction System
We describe our system for empathic emotion recognition. It is based on deep learning on multiple modalities in a late fusion architecture. We describe the modules of our system and discuss the evaluation results. Our code is also available for the research community
Recommended from our members
Enhancing the Generalization of Convolutional Neural Networks for Speech Emotion Recognition
Human-machine interaction is rapidly gaining significance in our daily lives. While speech recognition has achieved near-human performance in recent years, the intricate details embedded in speech extend beyond the mere arrangement of words. Speech Emotion Recognition (SER) is therefore acquiring a growing role in this field by decoding not only the linguistic content but also the emotional nuances of human spoken communication and enabling therefore a more exhaustive comprehension of the information conveyed by speech signals.
Despite the success that neural networks have already achieved in this task, SER is still challenging due to the variability of emotional expression, especially in real-world scenarios where generalization to unseen speakers and contexts is required. In addition, the high resource demand of SER models, combined with the scarcity of emotion-labelled data, hinder the development and application of effective solutions in this field. In this thesis, we present multiple approaches to overcome the aforementioned difficulties. We first introduce a multiple-time-scale (MTS) convolutional neural network architecture to create flexibility towards temporal variations when analyzing time-frequency representations of audio data. We show that resilience to speed fluctuations is relevant in SER tasks, since emotion is expressed through complex spectral patterns that can exhibit significant local dilation and compression on the time axis depending on speaker and context. The results indicate that the use of MTS consistently improves the generalization of networks of different capacity and depth, compared to standard convolution.
In a second stage, we propose a more general approach to discourage unwanted sensitivity towards specific target properties in CNNs, introducing the novel concept of anti-transfer learning. While transfer learning assumes that the learning process for a target task will benefit from re-using representations learned for another task, anti-transfer avoids the learning of representations that have been learned for an orthogonal task, i.e., one that is not relevant and potentially confounding for the target task, such as speaker identity and speech content for emotion recognition. In anti-transfer learning we penalize similarity between activations of a network being trained and another network previously trained on an orthogonal task. This leads to better generalization and provides a degree of control over correlations that are spurious or undesirable. We show that anti-transfer actually leads to the intended invariance to the orthogonal task and to more appropriate feature maps for the target task at hand. Anti-transfer creates a computation and memory cost at training time, but it enables enables the reuse of pre-trained models.
In order to avoid the high resource demand of SER models in general and anti-transfer learning specifically, we propose RH-emo, a novel semisupervised architecture aimed at extracting quaternion embeddings from realvalued monoaural spectrograms, enabling the use of quaternion-valued networks for SER tasks. RH-emo is a hybrid real/quaternion autoencoder network that consists of a real-valued encoder in parallel to a real-valued emotion classifier and a quaternion-valued decoder. We show that the use of RHemo, combined with quaternion convolutional neural networks provides a consistent improvement in SER tasks, while requiring far fewer trainable parameters and therefore substantially reducing the resource demand of SER models.
Finally, we apply anti-transfer learning to quaternion-valued neural networks fed with RH-emo embeddings. We demonstrate that the combination of the two approaches maintains the disentanglement properties of antitransfer, while using a reduced amount of memory, computation, and training time, making this a suitable approach for SER scenarios with limited resources and where context and speaker independence are needed
Recommended from our members
GAP WORK project report: training for youth practitioners on tackling gender-related violence
This project sought to challenge gender-related violence against (and by) children and young people by developing training for practitioners who have everyday contact with general populations of children and young people (‘youth practitioners’). Through improved knowledge and understanding practitioners can better identify and challenge sexist, sexualising, homophobic or controlling language and behaviour, and know when and how to refer children and young people to the most appropriate support services. This summary outlines the Project and our initial findings about the success of the four training programmes developed and piloted.Co-funded by the DAPHNE III programme of the EU
Recommended from our members
Quaternion Anti-Transfer Learning for Speech Emotion Recognition
This study explores the benefits of anti-transfer learning with quaternion neural networks for robust, effective, and efficient speech emotion recognition. Anti-transfer learning selectively promotes task invariance through the introduction of a deep feature loss at training time. It has been shown to improve the performance of speech emotion recognition models by encouraging the independence of emotion predictions from specific uttered words and characteristics of the speaker’s voice. However, the improved accuracy comes at a cost of increased computation time and memory requirements. In order to reduce the resource demand of anti-transfer, we propose to exploit quaternion-valued processing. We design, implement, and evaluate the use of quaternion anti-transfer learning on the basis of the VGG16 architecture and quaternion embeddings on multiple datasets for different speech emotion recognition task setups. The effectiveness of this approach depends on the layer where it is applied, with early layers offering a good compromise between performance gain and resource requirements. Our results show that anti-transfer in the quaternion domain can enhance generalisation while reducing the model’s demand for computation and memory
L3DAS22 Challenge: Learning 3D Audio Sources in a Real Office Environment
The L3DAS22 Challenge is aimed at encouraging the development of machine learning strategies for 3D speech enhancement and 3D sound localization and detection in office-like environments. This challenge improves and extends the tasks of the L3DAS21 edition. We generated a new dataset, which maintains the same general characteristics of L3DAS21 datasets, but with an extended number of data points and adding constrains that improve the baseline model's efficiency and overcome the major difficulties encountered by the participants of the previous challenge. We updated the baseline model of Task 1, using the architecture that ranked first in the previous challenge edition. We wrote a new supporting API, improving its clarity and ease-of-use. In the end, we present and discuss the results submitted by all participants. L3DAS22 Challenge website: www.l3das.com/icassp2022
Recommended from our members
Learning Speech Emotion Representations in the Quaternion Domain
The modeling of human emotion expression in speech signals is an important, yet challenging task. The high resource demand of speech emotion recognition models, combined with the general scarcity of emotion-labelled data are obstacles to the development and application of effective solutions in this field. In this paper, we present an approach to jointly circumvent these difficulties. Our method, named RH-emo, is a novel semi-supervised architecture aimed at extracting quaternion embeddings from real-valued monoaural spectrograms, enabling the use of quaternion-valued networks for speech emotion recognition tasks. RH-emo is a hybrid real/quaternion autoencoder network that consists of a real-valued encoder in parallel to a real-valued emotion classifier and a quaternion-valued decoder. On the one hand, the classifier permits to optimization of each latent axis of the embeddings for the classification of a specific emotion-related characteristic: valence, arousal, dominance, and overall emotion. On the other hand, quaternion reconstruction enables the latent dimension to develop intra-channel correlations that are required for an effective representation as a quaternion entity. We test our approach on speech emotion recognition tasks using four popular datasets: IEMOCAP, RAVDESS, EmoDB, and TESS, comparing the performance of three well-established real-valued CNN architectures (AlexNet, ResNet-50, VGG) and their quaternion-valued equivalent fed with the embeddings created with RH-emo. We obtain a consistent improvement in the test accuracy for all datasets, while drastically reducing the resources' demand of models. Moreover, we performed additional experiments and ablation studies that confirm the effectiveness of our approach
Coxiella endosymbiont of Rhipicephalus microplus modulates tick physiology with a major impact in blood feeding capacity
In the past decade, metagenomics studies exploring tick microbiota have revealed widespread interactions between bacteria and arthropods, including symbiotic interactions. Functional studies showed that obligate endosymbionts contribute to tick biology, affecting reproductive fitness and molting. Understanding the molecular basis of the interaction between ticks and their mutualist endosymbionts may help to develop control methods based on microbiome manipulation. Previously, we showed that Rhipicephalus microplus larvae with reduced levels of Coxiella endosymbiont of R. microplus (CERM) were arrested at the metanymph life stage (partially engorged nymph) and did not molt into adults. In this study, we performed a transcriptomic differential analysis of the R. microplus metanymph in the presence and absence of its mutualist endosymbiont. The lack of CERM resulted in an altered expression profile of transcripts from several functional categories. Gene products such as DA-P36, protease inhibitors, metalloproteases, and evasins, which are involved in blood feeding capacity, were underexpressed in CERM-free metanymphs. Disregulation in genes related to extracellular matrix remodeling was also observed in the absence of the symbiont. Taken together, the observed alterations in gene expression may explain the blockage of development at the metanymph stage and reveal a novel physiological aspect of the symbiont-tick-vertebrate host interaction
Implementation and Characterization of Vibrotactile Interfaces
While a standard approach is more or less established for rendering basic vibratory cues in consumer electronics, the implementation of advanced vibrotactile feedback still requires designers and engineers to solve a number of technical issues. Several off-the-shelf vibration actuators are currently available, having different characteristics and limitations that should be considered in the design process. We suggest an iterative approach to design in which vibrotactile interfaces are validated by testing their accuracy in rendering vibratory cues and in measuring input gestures. Several examples of prototype interfaces yielding audio-haptic feedback are described, ranging from open-ended devices to musical interfaces, addressing their design and the characterization of their vibratory output
- …