34 research outputs found

    TOWARDS BUILDING GENERALIZABLE SPEECH EMOTION RECOGNITION MODELS

    Get PDF
    Abstract: Detecting the mental state of a person has implications in psychiatry, medicine, psychology and human-computer interaction systems among others. It includes (but is not limited to) a wide variety of problems such as emotion detection, valence-affect-dominance states prediction, mood detection and detection of clinical depression. In this thesis we focus primarily on emotion recognition. Like any recognition system, building an emotion recognition model consists of the following two steps: 1. Extraction of meaningful features that would help in classification 2. Development of an appropriate classifier Speech data being non-invasive and the ease with which it can be collected has made it a popular candidate for feature extraction. However, an ideal system designed should be agnostic to speaker and channel effects. While feature normalization schemes can counter these problems to some extent, we still see a drastic drop in performance when the training and test data-sets are unmatched. In this dissertation we explore some novel ways towards building models that are more robust to speaker and domain differences. Training discriminative classifiers involves learning a conditional distribution p(y_i|x_i), given a set of feature vectors x_i and the corresponding labels y_i; i=1,...N. For a classifier to be generalizable and not overfit to training data, the resulting conditional distribution p(y_i|x_i) is desired to be smoothly varying over the inputs x_i. Adversarial training procedures enforce this smoothness using manifold regularization techniques. Manifold regularization makes the model’s output distribution more robust to local perturbation added to a datapoint x_i. In the first part of the dissertation, we investigate two training procedures: (i) adversarial training where we determine the perturbation direction based on the given labels for the training data and, (ii) virtual adversarial training where we determine the perturbation direction based only on the output distribution of the training data. We demonstrate the efficacy of adversarial training procedures by performing a k-fold cross validation experiment on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) and a cross-corpus performance analysis on three separate corpora. We compare their performances to that of a model utilizing other regularization schemes such as L1/L2 and graph based manifold regularization scheme. Results show improvement over a purely supervised approach, as well as better generalization capability to cross-corpus settings. Our second approach to better discriminate between emotions leverages multi-modal learning and automated speech recognition (ASR) systems toward improving the generalizability of an emotion recognition model that requires only speech as input. Previous studies have shown that emotion recognition models using only acoustic features do not perform satisfactorily in detecting valence level. Text analysis has been shown to be helpful for sentiment classification. We compared classification accuracies obtained from an audio-only model, a text-only model and a multi-modal system leveraging both by performing a cross-validation analysis on IEMOCAP dataset. Confusion matrices show it’s the valence level detection that is being improved by incorporating textual information. In the second stage of experiments, we used three ASR application programming interfaces (APIs) to get the transcriptions. We compare the performances of multi-modal systems using the ASR transcriptions with each other and with that of one using ground truth transcription. This is followed by a cross-corpus study. In the third part of the study we investigate the generalizability of generative of generative adversarial networks (GANs) based models. GANs have gained a lot of attention from machine learning community due to their ability to learn and mimic an input data distribution. GANs consist of a discriminator and a generator working in tandem playing a min-max game to learn a target underlying data distribution; when fed with data-points sampled from a simpler distribution (like uniform or Gaussian distribution). Once trained, they allow synthetic generation of examples sampled from the target distribution. We investigate the applicability of GANs to get lower dimensional representations from the higher dimensional feature vectors pertinent for emotion recognition. We also investigate their ability to generate synthetic higher dimensional feature vectors using points sampled from a lower dimensional prior. Specifically, we investigate two set ups: (i) when the lower dimensional prior from which synthetic feature vectors are generated is pre-defined, (ii) when the distribution of lower dimensional prior is learned from training data. We define the metrics that we used to measure and analyze the performance of these generative models in different train/test conditions. We perform cross validation analysis followed by a cross-corpus study. Finally we make an attempt towards understanding the relation between two different sub-problems encompassed under mental state detection namely depression detection and emotion recognition. We propose approaches that can be investigated to build better depression detection models by leveraging our ability to recognize emotions accurately

    Graph neural network for audio representation learning

    Get PDF
    Learning audio representations is an important task with many potential applications. Whether it takes the shape of speech, music, or ambient sounds, audio is a common form of data that may communicate rich information. Audio representation learning is also a fundamental ingredient of deep learning. However, learning a good representation is a challenging task. Audio representation learning can also enable more accurate downstream tasks both in audio and video, such as emotion recognition. For audio representation learning, such a representation should contain the information needed to understand the input sound and make discriminative patterns. This necessitates a sizable volume of carefully annotated data, which requires a considerable amount of labour. In this thesis, we propose a set of models for audio representation learning. We address the discriminative patterns by proposing graph structure and graph neural network to further process it. Our work is the first to consider the graph structure for audio data. In contrast to existing methods that use approximation, our first model proposes a manual graph structure and uses a graph convolution layer with accurate graph convolution operation. In the second model, By integrating a graph inception network, we expand the manually created graph structure and simultaneously learn it with the primary objective in our model. In the third model, we addressed the dearth of annotated data by including a semi-supervised graph technique that represents audio corpora as nodes in a graph and connects them depending on label information in smaller subgraphs. We brought up the issue of leveraging multimodal data to improve audio representation learning in addition to earlier works. To accommodate multimodal input data, we included heterogeneous graph data to our fourth model. Additionally, we created a new graph architecture to handle multimodal data

    Personality Traits of GitHub Maintainers and Their Effects on Project Success

    Get PDF
    Online collaborative environments have become important virtual workplaces for developers to work on a common problem. GitHub is an example of such environment that hosts a wealth of open source software projects. Questions such as "Who contributes to successful projects?" and "What are the characteristics of lead developers?" require further investigations. We qualitatively identify 211 maintainers in 25 maintained repositories and 23 unmaintained repositories in GitHub. We measure their Big Five personality traits (Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism) as the weighted sum of their Linguistic Inquiry and Word Count (LIWC) dimensions. Our results indicate that maintainers and non-maintainers are significantly different in virtually all personality traits except in Neuroticism. Maintainers in maintained repositories tend to be more open, but less extraverted and less agreeable than maintainers in unmaintained repositories. In addition to Agreeableness being a significant predictor, our analysis suggest that the success of a repository may be explained by the absolute differences in personality traits between maintainers and non-maintainers. In sum, our work aims to understand the role of a maintainer and the effects of personality traits on project success. Our findings have direct implications such that developers can be more cognizant of their behaviours, as well as their colleagues, which can result in better collaboration. By highlighting personality differences, we show that studying social and psychological constructs can be invaluable in understanding group dynamics during collaborative process

    Exploiting physiological changes during the flow experience for assessing virtual-reality game design.

    Get PDF
    Immersive experiences are considered the principal attraction of video games. Achieving a healthy balance between the game's demands and the user's skills is a particularly challenging goal. However, it is a coveted outcome, as it gives rise to the flow experience – a mental state of deep concentration and game engagement. When this balance fractures, the player may experience considerable disinclination to continue playing, which may be a product of anxiety or boredom. Thus, being able to predict manifestations of these psychological states in video game players is essential for understanding player motivation and designing better games. To this end, we build on earlier work to evaluate flow dynamics from a physiological perspective using a custom video game. Although advancements in this area are growing, there has been little consideration given to the interpersonal characteristics that may influence the expression of the flow experience. In this thesis, two angles are introduced that remain poorly understood. First, the investigation is contextualized in the virtual reality domain, a technology that putatively amplifies affective experiences, yet is still insufficiently addressed in the flow literature. Second, a novel analysis setup is proposed, whereby the recorded physiological responses and psychometric self-ratings are combined to assess the effectiveness of our game's design in a series of experiments. The analysis workflow employed heart rate and eye blink variability, and electroencephalography (EEG) as objective assessment measures of the game's impact, and self-reports as subjective assessment measures. These inputs were submitted to a clustering method, cross-referencing the membership of the observations with self-report ratings of the players they originated from. Next, this information was used to effectively inform specialized decoders of the flow state from the physiological responses. This approach successfully enabled classifiers to operate at high accuracy rates in all our studies. Furthermore, we addressed the compression of medium-resolution EEG sensors to a minimal set required to decode flow. Overall, our findings suggest that the approaches employed in this thesis have wide applicability and potential for improving game designing practices

    Human-centric explanation facilities

    Get PDF
    corecore