874 research outputs found

    Improving the Generalizability of Speech Emotion Recognition: Methods for Handling Data and Label Variability

    Full text link
    Emotion is an essential component in our interaction with others. It transmits information that helps us interpret the content of what others say. Therefore, detecting emotion from speech is an important step towards enabling machine understanding of human behaviors and intentions. Researchers have demonstrated the potential of emotion recognition in areas such as interactive systems in smart homes and mobile devices, computer games, and computational medical assistants. However, emotion communication is variable: individuals may express emotion in a manner that is uniquely their own; different speech content and environments may shape how emotion is expressed and recorded; individuals may perceive emotional messages differently. Practically, this variability is reflected in both the audio-visual data and the labels used to create speech emotion recognition (SER) systems. SER systems must be robust and generalizable to handle the variability effectively. The focus of this dissertation is on the development of speech emotion recognition systems that handle variability in emotion communications. We break the dissertation into three parts, according to the type of variability we address: (I) in the data, (II) in the labels, and (III) in both the data and the labels. Part I: The first part of this dissertation focuses on handling variability present in data. We approximate variations in environmental properties and expression styles by corpus and gender of the speakers. We find that training on multiple corpora and controlling for the variability in gender and corpus using multi-task learning result in more generalizable models, compared to the traditional single-task models that do not take corpus and gender variability into account. Another source of variability present in the recordings used in SER is the phonetic modulation of acoustics. On the other hand, phonemes also provide information about the emotion expressed in speech content. We discover that we can make more accurate predictions of emotion by explicitly considering both roles of phonemes. Part II: The second part of this dissertation addresses variability present in emotion labels, including the differences between emotion expression and perception, and the variations in emotion perception. We discover that it is beneficial to jointly model both the perception of others and how one perceives one’s own expression, compared to focusing on either one. Further, we show that the variability in emotion perception is a modelable signal and can be captured using probability distributions that describe how groups of evaluators perceive emotional messages. Part III: The last part of this dissertation presents methods that handle variability in both data and labels. We reduce the data variability due to non-emotional factors using deep metric learning and model the variability in emotion perception using soft labels. We propose a family of loss functions and show that by pairing examples that potentially vary in expression styles and lexical content and preserving the real-valued emotional similarity between them, we develop systems that generalize better across datasets and are more robust to over-training. These works demonstrate the importance of considering data and label variability in the creation of robust and generalizable emotion recognition systems. We conclude this dissertation with the following future directions: (1) the development of real-time SER systems; (2) the personalization of general SER systems.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/147639/1/didizbq_1.pd

    Leveraging Multi-Modal Sensing for Mobile Health: A Case Review in Chronic Pain

    Get PDF
    Active and passive mobile sensing has garnered much attention in recent years. In this paper, we focus on chronic pain measurement and management as a case application to exemplify the state of the art. We present a consolidated discussion on the leveraging of various sensing modalities along with modular server-side and on-device architectures required for this task. Modalities included are: activity monitoring from accelerometry and location sensing, audio analysis of speech, image processing for facial expressions as well as modern methods for effective patient self-reporting. We review examples that deliver actionable information to clinicians and patients while addressing privacy, usability, and computational constraints. We also discuss open challenges in the higher level inferencing of patient state and effective feedback with potential directions to address them. The methods and challenges presented here are also generalizable and relevant to a broad range of other applications in mobile sensing

    Interaction intermodale dans les réseaux neuronaux profonds pour la classification et la localisation d'évènements audiovisuels

    Get PDF
    La compréhension automatique du monde environnant a de nombreuses applications telles que la surveillance et sécurité, l'interaction Homme-Machine, la robotique, les soins de santé, etc. Plus précisément, la compréhension peut s'exprimer par le biais de différentes taches telles que la classification et localisation dans l'espace d'évènements. Les êtres vivants exploitent un maximum de l'information disponible pour comprendre ce qui les entoure. En s'inspirant du comportement des êtres vivants, les réseaux de neurones artificiels devraient également utiliser conjointement plusieurs modalités, par exemple, la vision et l'audition. Premièrement, les modèles de classification et localisation, basés sur l'information audio-visuelle, doivent être évalués de façon objective. Nous avons donc enregistré une nouvelle base de données pour compléter les bases actuellement disponibles. Comme aucun modèle audio-visuel de classification et localisation n'existe, seule la partie sonore de la base est évaluée avec un modèle de la littérature. Deuxièmement, nous nous concentrons sur le cœur de la thèse: comment utiliser conjointement de l'information visuelle et sonore pour résoudre une tâche spécifique, la reconnaissance d'évènements. Le cerveau n'est pas constitué d'une "simple" fusion mais comprend de multiples interactions entre les deux modalités. Il y a un couplage important entre le traitement de l'information visuelle et sonore. Les réseaux de neurones offrent la possibilité de créer des interactions entre les modalités en plus de la fusion. Dans cette thèse, nous explorons plusieurs stratégies pour fusionner les modalités visuelles et sonores et pour créer des interactions entre les modalités. Ces techniques ont les meilleures performances en comparaison aux architectures de l'état de l'art au moment de la publication. Ces techniques montrent l'utilité de la fusion audio-visuelle mais surtout l'importance des interactions entre les modalités. Pour conclure la thèse, nous proposons un réseau de référence pour la classification et localisation d'évènements audio-visuels. Ce réseau a été testé avec la nouvelle base de données. Les modèles précédents de classification sont modifiés pour prendre en compte la localisation dans l'espace en plus de la classification.Abstract: The automatic understanding of the surrounding world has a wide range of applications, including surveillance, human-computer interaction, robotics, health care, etc. The understanding can be expressed in several ways such as event classification and its localization in space. Living beings exploit a maximum of the available information to understand the surrounding world. Artificial neural networks should build on this behavior and jointly use several modalities such as vision and hearing. First, audio-visual networks for classification and localization must be evaluated objectively. We recorded a new audio-visual dataset to fill a gap in the current available datasets. We were not able to find audio-visual models for classification and localization. Only the dataset audio part is evaluated with a state-of-the-art model. Secondly, we focus on the main challenge of the thesis: How to jointly use visual and audio information to solve a specific task, event recognition. The brain does not comprise a simple fusion but has multiple interactions between the two modalities to create a strong coupling between them. The neural networks offer the possibility to create interactions between the two modalities in addition to the fusion. We explore several strategies to fuse the audio and visual modalities and to create interactions between modalities. These techniques have the best performance compared to the state-of-the-art architectures at the time of publishing. They show the usefulness of audio-visual fusion but above all the contribution of the interaction between modalities. To conclude, we propose a benchmark for audio-visual classification and localization on the new dataset. Previous models for the audio-visual classification are modified to address the localization in addition to the classification

    A Review of Audio-Visual Speech Recognition

    Get PDF
    Speech is the most important tool of interaction among human beings. This has inspired researchers to study further on speech recognition and develop a computer system that is able to integrate and understand human speech. But acoustic noisy environment can highly contaminate audio speech and affect the overall recognition performance. Thus, Audio-Visual Speech Recognition (AVSR) is designed to overcome the problems by utilising visual images which are unaffected by noise. The aim of this paper is to discuss the AVSR structures, which includes the front end processes, audio-visual data corpus used, recent works and accuracy estimation methods

    Learning Sensory Representations with Minimal Supervision

    Get PDF

    Multimodal Emotion Recognition among Couples from Lab Settings to Daily Life using Smartwatches

    Full text link
    Couples generally manage chronic diseases together and the management takes an emotional toll on both patients and their romantic partners. Consequently, recognizing the emotions of each partner in daily life could provide an insight into their emotional well-being in chronic disease management. The emotions of partners are currently inferred in the lab and daily life using self-reports which are not practical for continuous emotion assessment or observer reports which are manual, time-intensive, and costly. Currently, there exists no comprehensive overview of works on emotion recognition among couples. Furthermore, approaches for emotion recognition among couples have (1) focused on English-speaking couples in the U.S., (2) used data collected from the lab, and (3) performed recognition using observer ratings rather than partner's self-reported / subjective emotions. In this body of work contained in this thesis (8 papers - 5 published and 3 currently under review in various journals), we fill the current literature gap on couples' emotion recognition, develop emotion recognition systems using 161 hours of data from a total of 1,051 individuals, and make contributions towards taking couples' emotion recognition from the lab which is the status quo, to daily life. This thesis contributes toward building automated emotion recognition systems that would eventually enable partners to monitor their emotions in daily life and enable the delivery of interventions to improve their emotional well-being.Comment: PhD Thesis, 2022 - ETH Zuric

    A Comprehensive Survey on Applications of Transformers for Deep Learning Tasks

    Full text link
    Transformer is a deep neural network that employs a self-attention mechanism to comprehend the contextual relationships within sequential data. Unlike conventional neural networks or updated versions of Recurrent Neural Networks (RNNs) such as Long Short-Term Memory (LSTM), transformer models excel in handling long dependencies between input sequence elements and enable parallel processing. As a result, transformer-based models have attracted substantial interest among researchers in the field of artificial intelligence. This can be attributed to their immense potential and remarkable achievements, not only in Natural Language Processing (NLP) tasks but also in a wide range of domains, including computer vision, audio and speech processing, healthcare, and the Internet of Things (IoT). Although several survey papers have been published highlighting the transformer's contributions in specific fields, architectural differences, or performance evaluations, there is still a significant absence of a comprehensive survey paper encompassing its major applications across various domains. Therefore, we undertook the task of filling this gap by conducting an extensive survey of proposed transformer models from 2017 to 2022. Our survey encompasses the identification of the top five application domains for transformer-based models, namely: NLP, Computer Vision, Multi-Modality, Audio and Speech Processing, and Signal Processing. We analyze the impact of highly influential transformer-based models in these domains and subsequently classify them based on their respective tasks using a proposed taxonomy. Our aim is to shed light on the existing potential and future possibilities of transformers for enthusiastic researchers, thus contributing to the broader understanding of this groundbreaking technology
    • …
    corecore