754 research outputs found

    Robust Methods for the Automatic Quantification and Prediction of Affect in Spoken Interactions

    Full text link
    Emotional expression plays a key role in interactions as it communicates the necessary context needed for understanding the behaviors and intentions of individuals. Therefore, a speech-based Artificial Intelligence (AI) system that can recognize and interpret emotional expression has many potential applications with measurable impact to a variety of areas, including human-computer interaction (HCI) and healthcare. However, there are several factors that make speech emotion recognition (SER) a difficult task; these factors include: variability in speech data, variability in emotion annotations, and data sparsity. This dissertation explores methodologies for improving the robustness of the automatic recognition of emotional expression from speech by addressing the impacts of these factors on various aspects of the SER system pipeline. For addressing speech data variability in SER, we propose modeling techniques that improve SER performance by leveraging short-term dynamical properties of speech. Furthermore, we demonstrate how data augmentation improves SER robustness to speaker variations. Lastly, we discover that we can make more accurate predictions of emotion by considering the fine-grained interactions between the acoustic and lexical components of speech. For addressing the variability in emotion annotations, we propose SER modeling techniques that account for the behaviors of annotators (i.e., annotators' reaction delay) to improve time-continuous SER robustness. For addressing data sparsity, we investigate two methods that enable us to learn robust embeddings, which highlight the differences that exist between neutral speech and emotionally expressive speech, without requiring emotion annotations. In the first method, we demonstrate how emotionally charged vocal expressions change speaker characteristics as captured by embeddings extracted from a speaker identification model, and we propose the use of these embeddings in SER applications. In the second method, we propose a framework for learning emotion embeddings using audio-textual data that is not annotated for emotion. The unification of the methods and results presented in this thesis helps enable the development of more robust SER systems, making key advancements toward an interactive speech-based AI system that is capable of recognizing and interpreting human behaviors.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/166106/1/aldeneh_1.pd

    BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data

    Full text link
    We introduce a text-to-speech (TTS) model called BASE TTS, which stands for B\textbf{B}ig A\textbf{A}daptive S\textbf{S}treamable TTS with E\textbf{E}mergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billion-parameter autoregressive Transformer that converts raw texts into discrete codes ("speechcodes") followed by a convolution-based decoder which converts these speechcodes into waveforms in an incremental, streamable manner. Further, our speechcodes are built using a novel speech tokenization technique that features speaker ID disentanglement and compression with byte-pair encoding. Echoing the widely-reported "emergent abilities" of large language models when trained on increasing volume of data, we show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences. We design and share a specialized dataset to measure these emergent abilities for text-to-speech. We showcase state-of-the-art naturalness of BASE TTS by evaluating against baselines that include publicly available large-scale text-to-speech systems: YourTTS, Bark and TortoiseTTS. Audio samples generated by the model can be heard at https://amazon-ltts-paper.com/.Comment: v1.1 (fixed typos

    Towards Better Understanding of Spoken Conversations: Assessment of Emotion and Sentiment

    Get PDF
    Emotions play a vital role in our daily life as they help us convey information impossible to express verbally to other parties. While humans can easily perceive emotions, these are notoriously difficult to define and recognize by machines. However, automatically detecting the emotion of a spoken conversation can be useful for a diverse range of applications such as human-machine interaction and conversation analysis. In this thesis, we present several approaches based on machine learning to recognize emotion from isolated utterances and long recordings. Isolated utterances are usually shorter than 10s in duration and are assumed to contain only one major emotion. One of the main obstacles in achieving high emotion recognition accuracy is the lack of large annotated data. We propose to mitigate this problem by using transfer learning and data augmentation techniques. We show that x-vector representations extracted from speaker recognition models (x-vector models) contain emotion predictive information and adapting those models provide significant improvements in emotion recognition performance. To further improve the performance, we propose a novel perceptually motivated data augmentation method, Copy-Paste on isolated utterances. This method is based on the assumption that the presence of emotions other than neutral dictates a speaker ’s overall perceived emotion in a recording. As isolated utterances are assumed to contain only one emotion, the proposed models make predictions on the utterance level. However, these models can not be directly applied to conversations that can have multiple emotions unless we know the locations of emotion boundaries. In this work, we propose to recognize emotions in the conversations by doing frame-level classification where predictions are made at regular intervals. We compare models trained on isolated utterances and conversations. We propose a data augmentation method, DiverseCatAugment based on attention operation to improve the transformer models. To further improve the performance, we incorporate the turn-taking structure of the conversations into our models. Annotating utterances with emotions is not a simple task and it depends on the number of emotions used for annotation. However, annotation schemes can be changed to reduce annotation efforts based on application. We consider one such application: predicting customer satisfaction (CSAT) in a call center conversation where the goal is to predict the overall sentiment of the customer. We conduct a comprehensive search for adequate acoustic and lexical representations at different granular levels of conversations. We show that the methods that use transfer learning (x-vectors and CSAT Tracker) perform best. Our error analysis shows that the calls where customers accomplished their goal but were still dissatisfied are the most difficult to predict correctly, and the customer’s speech is more emotional compared to the agent’s speech

    CWI-evaluation - Progress Report 1993-1998

    Get PDF

    A Critical Study on the Effect of Dimensionality Reduction on Intrusion Detection in Water Storage Critical Infrastructure

    Get PDF
    Supervisory control and data acquisition (SCADA) systems are often imperiled bycyber-attacks, which can often be detected using intrusion detection system (IDSs).However, the performance and efficiency of IDSs can be affected by several factors,including the quality of data, curse of dimensionality of the data, and computationalcost. Feature reduction techniques can overcome most of these challenges by eliminatingthe redundant and non-informative features, thereby increasing the detectionaccuracy. This study aims to shows the importance of feature reduction on the intrusiondetection performance. To do this, a multi-modular IDS is designed that isconnected to the SCADA system of a water storage tank. A comparative study isalso performed by employing advanced feature selection and dimensionality reductiontechniques. The utilized feature reduction techniques improves the IDS efficiency byreducing the memory usage and using data with better quality, which in turn increasethe detection accuracy. The obtained results have been analyzed in terms of F1-scoreand accuracy
    corecore