138 research outputs found

    Distributing Recognition in Computational Paralinguistics

    Get PDF

    Recent developments in openSMILE, the munich open-source multimedia feature extractor

    Full text link
    We present recent developments in the openSMILE feature extraction toolkit. Version 2.0 now unites feature extraction paradigms from speech, music, and general sound events with basic video features for multi-modal processing. De-scriptors from audio and video can be processed jointly in a single framework allowing for time synchronization of param-eters, on-line incremental processing as well as off-line and batch processing, and the extraction of statistical function-als (feature summaries), such as moments, peaks, regression parameters, etc. Postprocessing of the features includes sta-tistical classifiers such as support vector machine models or file export for popular toolkits such as Weka or HTK. Avail-able low-level descriptors include popular speech, music and video features including Mel-frequency and similar cepstral and spectral coefficients, Chroma, CENS, auditory model based loudness, voice quality, local binary pattern, color, and optical flow histograms. Besides, voice activity detection, pitch tracking and face detection are supported. openSMILE is implemented in C++, using standard open source libraries for on-line audio and video input. It is fast, runs on Unix and Windows platforms, and has a modular, component based architecture which makes extensions via plug-ins easy. openSMILE 2.0 is distributed under a research license and can be downloaded fro

    Computer audition for emotional wellbeing

    Get PDF
    This thesis is focused on the application of computer audition (i. e., machine listening) methodologies for monitoring states of emotional wellbeing. Computer audition is a growing field and has been successfully applied to an array of use cases in recent years. There are several advantages to audio-based computational analysis; for example, audio can be recorded non-invasively, stored economically, and can capture rich information on happenings in a given environment, e. g., human behaviour. With this in mind, maintaining emotional wellbeing is a challenge for humans and emotion-altering conditions, including stress and anxiety, have become increasingly common in recent years. Such conditions manifest in the body, inherently changing how we express ourselves. Research shows these alterations are perceivable within vocalisation, suggesting that speech-based audio monitoring may be valuable for developing artificially intelligent systems that target improved wellbeing. Furthermore, computer audition applies machine learning and other computational techniques to audio understanding, and so by combining computer audition with applications in the domain of computational paralinguistics and emotional wellbeing, this research concerns the broader field of empathy for Artificial Intelligence (AI). To this end, speech-based audio modelling that incorporates and understands paralinguistic wellbeing-related states may be a vital cornerstone for improving the degree of empathy that an artificial intelligence has. To summarise, this thesis investigates the extent to which speech-based computer audition methodologies can be utilised to understand human emotional wellbeing. A fundamental background on the fields in question as they pertain to emotional wellbeing is first presented, followed by an outline of the applied audio-based methodologies. Next, detail is provided for several machine learning experiments focused on emotional wellbeing applications, including analysis and recognition of under-researched phenomena in speech, e. g., anxiety, and markers of stress. Core contributions from this thesis include the collection of several related datasets, hybrid fusion strategies for an emotional gold standard, novel machine learning strategies for data interpretation, and an in-depth acoustic-based computational evaluation of several human states. All of these contributions focus on ascertaining the advantage of audio in the context of modelling emotional wellbeing. Given the sensitive nature of human wellbeing, the ethical implications involved with developing and applying such systems are discussed throughout

    Time- and value-continuous explainable affect estimation in-the-wild

    Get PDF
    Today, the relevance of Affective Computing, i.e., of making computers recognise and simulate human emotions, cannot be overstated. All technology giants (from manufacturers of laptops to mobile phones to smart speakers) are in a fierce competition to make their devices understand not only what is being said, but also how it is being said to recognise user’s emotions. The goals have evolved from predicting the basic emotions (e.g., happy, sad) to now the more nuanced affective states (e.g., relaxed, bored) real-time. The databases used in such research too have evolved, from earlier featuring the acted behaviours to now spontaneous behaviours. There is a more powerful shift lately, called in-the-wild affect recognition, i.e., taking the research out of the laboratory, into the uncontrolled real-world. This thesis discusses, for the very first time, affect recognition for two unique in-the-wild audiovisual databases, GRAS2 and SEWA. The GRAS2 is the only database till date with time- and value-continuous affect annotations for Labov effect-free affective behaviours, i.e., without the participant’s awareness of being recorded (which otherwise is known to affect the naturalness of one’s affective behaviour). The SEWA features participants from six different cultural backgrounds, conversing using a video-calling platform. Thus, SEWA features in-the-wild recordings further corrupted by unpredictable artifacts, such as the network-induced delays, frame-freezing and echoes. The two databases present a unique opportunity to study time- and value-continuous affect estimation that is truly in-the-wild. A novel ‘Evaluator Weighted Estimation’ formulation is proposed to generate a gold standard sequence from several annotations. An illustration is presented demonstrating that the moving bag-of-words (BoW) representation better preserves the temporal context of the features, yet remaining more robust against the outliers compared to other statistical summaries, e.g., moving average. A novel, data-independent randomised codebook is proposed for the BoW representation; especially useful for cross-corpus model generalisation testing when the feature-spaces of the databases differ drastically. Various deep learning models and support vector regressors are used to predict affect dimensions time- and value-continuously. Better generalisability of the models trained on GRAS2 , despite the smaller training size, makes a strong case for the collection and use of Labov effect-free data. A further foundational contribution is the discovery of the missing many-to-many mapping between the mean square error (MSE) and the concordance correlation coefficient (CCC), i.e., between two of the most popular utility functions till date. The newly invented cost function |MSE_{XY}/σ_{XY}| has been evaluated in the experiments aimed at demystifying the inner workings of a well-performing, simple, low-cost neural network effectively utilising the BoW text features. Also proposed herein is the shallowest-possible convolutional neural network (CNN) that uses the facial action unit (FAU) features. The CNN exploits sequential context, but unlike RNNs, also inherently allows data- and process-parallelism. Interestingly, for the most part, these white-box AI models have shown to utilise the provided features consistent with the human perception of emotion expression

    Machine Learning-based Methods for Driver Identification and Behavior Assessment: Applications for CAN and Floating Car Data

    Get PDF
    The exponential growth of car generated data, the increased connectivity, and the advances in artificial intelligence (AI), enable novel mobility applications. This dissertation focuses on two use-cases of driving data, namely distraction detection and driver identification (ID). Low and medium-income countries account for 93% of traffic deaths; moreover, a major contributing factor to road crashes is distracted driving. Motivated by this, the first part of this thesis explores the possibility of an easy-to-deploy solution to distracted driving detection. Most of the related work uses sophisticated sensors or cameras, which raises privacy concerns and increases the cost. Therefore a machine learning (ML) approach is proposed that only uses signals from the CAN-bus and the inertial measurement unit (IMU). It is then evaluated against a hand-annotated dataset of 13 drivers and delivers reasonable accuracy. This approach is limited in detecting short-term distractions but demonstrates that a viable solution is possible. In the second part, the focus is on the effective identification of drivers using their driving behavior. The aim is to address the shortcomings of the state-of-the-art methods. First, a driver ID mechanism based on discriminative classifiers is used to find a set of suitable signals and features. It uses five signals from the CAN-bus, with hand-engineered features, which is an improvement from current state-of-the-art that mainly focused on external sensors. The second approach is based on Gaussian mixture models (GMMs), although it uses two signals and fewer features, it shows improved accuracy. In this system, the enrollment of a new driver does not require retraining of the models, which was a limitation in the previous approach. In order to reduce the amount of training data a Triplet network is used to train a deep neural network (DNN) that learns to discriminate drivers. The training of the DNN does not require any driving data from the target set of drivers. The DNN encodes pieces of driving data to an embedding space so that in this space examples of the same driver will appear closer to each other and far from examples of other drivers. This technique reduces the amount of data needed for accurate prediction to under a minute of driving data. These three solutions are validated against a real-world dataset of 57 drivers. Lastly, the possibility of a driver ID system is explored that only uses floating car data (FCD), in particular, GPS data from smartphones. A DNN architecture is then designed that encodes the routes, origin, and destination coordinates as well as various other features computed based on contextual information. The proposed model is then evaluated against a dataset of 678 drivers and shows high accuracy. In a nutshell, this work demonstrates that proper driver ID is achievable. The constraints imposed by the use-case and data availability negatively affect the performance; in such cases, the efficient use of the available data is crucial

    Voice Stress Analysis: A New Framework for Voice and Effort in Human Performance

    Get PDF
    People rely on speech for communication, both in a personal and professional context, and often under different conditions of physical, cognitive and/or emotional load. Since vocalization is entirely integrated within both our central (CNS) and autonomic nervous system (ANS), a mounting number of studies have examined the relationship between voice output and the impact of stress. In the current paper, we will outline the different stages of voice output, i.e., breathing, phonation and resonance in relation to a neurovisceral integrated perspective on stress and human performance. In reviewing the function of these three stages of voice output, we will give an overview of the voice parameters encountered in studies on voice stress analysis (VSA) and review the impact of the different types of physiological, cognitive and/or emotional load. In the section “Discussion,” with regard to physical load, a competition for ventilation processes required to speak and those to meet metabolic demand of exercised muscles is described. With regard to cognitive and emotional load, we will present the “Model for Voice and Effort” (MoVE) that comprises the integration of ongoing top-down and bottom-up activity under different types of load and combined patterns of voice output. In the MoVE, it is proposed that the fundamental frequency (F0) values as well as jitter give insight in bottom-up/arousal activity and the effort a subject is capable to generate but that its range and variance are related to ongoing top-down processes and the amount of control a subject can maintain. Within the MoVE, a key-role is given to the anterior cingulate cortex (ACC) which is known to be involved in both the equilibration between bottom-up arousal and top-down regulation and vocal activity. Moreover, the connectivity between the ACC and the nervus vagus (NV) is underlined as an indication of the importance of respiration. Since respiration is the driving force of both stress and voice production, it is hypothesized to be the missing-link in our understanding of the underlying mechanisms of the dynamic between speech and stress

    Emotion on the Road—Necessity, Acceptance, and Feasibility of Affective Computing in the Car

    Get PDF
    Besides reduction of energy consumption, which implies alternate actuation and light construction, the main research domain in automobile development in the near future is dominated by driver assistance and natural driver-car communication. The ability of a car to understand natural speech and provide a human-like driver assistance system can be expected to be a factor decisive for market success on par with automatic driving systems. Emotional factors and affective states are thereby crucial for enhanced safety and comfort. This paper gives an extensive literature overview on work related to influence of emotions on driving safety and comfort, automatic recognition, control of emotions, and improvement of in-car interfaces by affect sensitive technology. Various use-case scenarios are outlined as possible applications for emotion-oriented technology in the vehicle. The possible acceptance of such future technology by drivers is assessed in a Wizard-Of-Oz user study, and feasibility of automatically recognising various driver states is demonstrated by an example system for monitoring driver attentiveness. Thereby an accuracy of 91.3% is reported for classifying in real-time whether the driver is attentive or distracted

    The Multi-Dimensional Contributions of Prefrontal Circuits to Emotion Regulation during Adulthood and Critical Stages of Development

    Get PDF
    The prefrontal cortex (PFC) plays a pivotal role in regulating our emotions. The importance of ventromedial regions in emotion regulation, including the ventral sector of the medial PFC, the medial sector of the orbital cortex and subgenual cingulate cortex, have been recognized for a long time. However, it is increasingly apparent that lateral and dorsal regions of the PFC, as well as neighbouring dorsal anterior cingulate cortex, also play a role. Defining the underlying psychological mechanisms by which these functionally distinct regions modulate emotions and the nature and extent of their interactions is a critical step towards better stratification of the symptoms of mood and anxiety disorders. It is also important to extend our understanding of these prefrontal circuits in development. Specifically, it is important to determine whether they exhibit differential sensitivity to perturbations by known risk factors such as stress and inflammation at distinct developmental epochs. This Special Issue brings together the most recent research in humans and other animals that addresses these important issues, and in doing so, highlights the value of the translational approach
    corecore