14,150 research outputs found

    Multimodal Affect Recognition: Current Approaches and Challenges

    Get PDF
    Many factors render multimodal affect recognition approaches appealing. First, humans employ a multimodal approach in emotion recognition. It is only fitting that machines, which attempt to reproduce elements of the human emotional intelligence, employ the same approach. Second, the combination of multiple-affective signals not only provides a richer collection of data but also helps alleviate the effects of uncertainty in the raw signals. Lastly, they potentially afford us the flexibility to classify emotions even when one or more source signals are not possible to retrieve. However, the multimodal approach presents challenges pertaining to the fusion of individual signals, dimensionality of the feature space, and incompatibility of collected signals in terms of time resolution and format. In this chapter, we explore the aforementioned challenges while presenting the latest scholarship on the topic. Hence, we first discuss the various modalities used in affect classification. Second, we explore the fusion of modalities. Third, we present publicly accessible multimodal datasets designed to expedite work on the topic by eliminating the laborious task of dataset collection. Fourth, we analyze representative works on the topic. Finally, we summarize the current challenges in the field and provide ideas for future research directions

    Apprentissage automatique pour le codage cognitif de la parole

    Get PDF
    Depuis les années 80, les codecs vocaux reposent sur des stratégies de codage à court terme qui fonctionnent au niveau de la sous-trame ou de la trame (généralement 5 à 20 ms). Les chercheurs ont essentiellement ajusté et combiné un nombre limité de technologies disponibles (transformation, prédiction linéaire, quantification) et de stratégies (suivi de forme d'onde, mise en forme du bruit) pour construire des architectures de codage de plus en plus complexes. Dans cette thèse, plutôt que de s'appuyer sur des stratégies de codage à court terme, nous développons un cadre alternatif pour la compression de la parole en codant les attributs de la parole qui sont des caractéristiques perceptuellement importantes des signaux vocaux. Afin d'atteindre cet objectif, nous résolvons trois problèmes de complexité croissante, à savoir la classification, la prédiction et l'apprentissage des représentations. La classification est un élément courant dans les conceptions de codecs modernes. Dans un premier temps, nous concevons un classifieur pour identifier les émotions, qui sont parmi les attributs à long terme les plus complexes de la parole. Dans une deuxième étape, nous concevons un prédicteur d'échantillon de parole, qui est un autre élément commun dans les conceptions de codecs modernes, pour mettre en évidence les avantages du traitement du signal de parole à long terme et non linéaire. Ensuite, nous explorons les variables latentes, un espace de représentations de la parole, pour coder les attributs de la parole à court et à long terme. Enfin, nous proposons un réseau décodeur pour synthétiser les signaux de parole à partir de ces représentations, ce qui constitue notre dernière étape vers la construction d'une méthode complète de compression de la parole basée sur l'apprentissage automatique de bout en bout. Bien que chaque étape de développement proposée dans cette thèse puisse faire partie d'un codec à elle seule, chaque étape fournit également des informations et une base pour la prochaine étape de développement jusqu'à ce qu'un codec entièrement basé sur l'apprentissage automatique soit atteint. Les deux premières étapes, la classification et la prédiction, fournissent de nouveaux outils qui pourraient remplacer et améliorer des éléments des codecs existants. Dans la première étape, nous utilisons une combinaison de modèle source-filtre et de machine à état liquide (LSM), pour démontrer que les caractéristiques liées aux émotions peuvent être facilement extraites et classées à l'aide d'un simple classificateur. Dans la deuxième étape, un seul réseau de bout en bout utilisant une longue mémoire à court terme (LSTM) est utilisé pour produire des trames vocales avec une qualité subjective élevée pour les applications de masquage de perte de paquets (PLC). Dans les dernières étapes, nous nous appuyons sur les résultats des étapes précédentes pour concevoir un codec entièrement basé sur l'apprentissage automatique. un réseau d'encodage, formulé à l'aide d'un réseau neuronal profond (DNN) et entraîné sur plusieurs bases de données publiques, extrait et encode les représentations de la parole en utilisant la prédiction dans un espace latent. Une approche d'apprentissage non supervisé basée sur plusieurs principes de cognition est proposée pour extraire des représentations à partir de trames de parole courtes et longues en utilisant l'information mutuelle et la perte contrastive. La capacité de ces représentations apprises à capturer divers attributs de la parole à court et à long terme est démontrée. Enfin, une structure de décodage est proposée pour synthétiser des signaux de parole à partir de ces représentations. L'entraînement contradictoire est utilisé comme une approximation des mesures subjectives de la qualité de la parole afin de synthétiser des échantillons de parole à consonance naturelle. La haute qualité perceptuelle de la parole synthétisée ainsi obtenue prouve que les représentations extraites sont efficaces pour préserver toutes sortes d'attributs de la parole et donc qu'une méthode de compression complète est démontrée avec l'approche proposée.Abstract: Since the 80s, speech codecs have relied on short-term coding strategies that operate at the subframe or frame level (typically 5 to 20ms). Researchers essentially adjusted and combined a limited number of available technologies (transform, linear prediction, quantization) and strategies (waveform matching, noise shaping) to build increasingly complex coding architectures. In this thesis, rather than relying on short-term coding strategies, we develop an alternative framework for speech compression by encoding speech attributes that are perceptually important characteristics of speech signals. In order to achieve this objective, we solve three problems of increasing complexity, namely classification, prediction and representation learning. Classification is a common element in modern codec designs. In a first step, we design a classifier to identify emotions, which are among the most complex long-term speech attributes. In a second step, we design a speech sample predictor, which is another common element in modern codec designs, to highlight the benefits of long-term and non-linear speech signal processing. Then, we explore latent variables, a space of speech representations, to encode both short-term and long-term speech attributes. Lastly, we propose a decoder network to synthesize speech signals from these representations, which constitutes our final step towards building a complete, end-to-end machine-learning based speech compression method. The first two steps, classification and prediction, provide new tools that could replace and improve elements of existing codecs. In the first step, we use a combination of source-filter model and liquid state machine (LSM), to demonstrate that features related to emotions can be easily extracted and classified using a simple classifier. In the second step, a single end-to-end network using long short-term memory (LSTM) is shown to produce speech frames with high subjective quality for packet loss concealment (PLC) applications. In the last steps, we build upon the results of previous steps to design a fully machine learning-based codec. An encoder network, formulated using a deep neural network (DNN) and trained on multiple public databases, extracts and encodes speech representations using prediction in a latent space. An unsupervised learning approach based on several principles of cognition is proposed to extract representations from both short and long frames of data using mutual information and contrastive loss. The ability of these learned representations to capture various short- and long-term speech attributes is demonstrated. Finally, a decoder structure is proposed to synthesize speech signals from these representations. Adversarial training is used as an approximation to subjective speech quality measures in order to synthesize natural-sounding speech samples. The high perceptual quality of synthesized speech thus achieved proves that the extracted representations are efficient at preserving all sorts of speech attributes and therefore that a complete compression method is demonstrated with the proposed approach

    Emotion Recognition from Speech with Acoustic, Non-Linear and Wavelet-based Features Extracted in Different Acoustic Conditions

    Get PDF
    ABSTRACT: In the last years, there has a great progress in automatic speech recognition. The challenge now it is not only recognize the semantic content in the speech but also the called "paralinguistic" aspects of the speech, including the emotions, and the personality of the speaker. This research work aims in the development of a methodology for the automatic emotion recognition from speech signals in non-controlled noise conditions. For that purpose, different sets of acoustic, non-linear, and wavelet based features are used to characterize emotions in different databases created for such purpose

    Continuous Estimation of Emotions in Speech by Dynamic Cooperative Speaker Models

    Get PDF
    Automatic emotion recognition from speech has been recently focused on the prediction of time-continuous dimensions (e.g., arousal and valence) of spontaneous and realistic expressions of emotion, as found in real-life interactions. However, the automatic prediction of such emotions poses several challenges, such as the subjectivity found in the definition of a gold standard from a pool of raters and the issue of data scarcity in training models. In this work, we introduce a novel emotion recognition system, based on ensemble of single-speaker-regression-models (SSRMs). The estimation of emotion is provided by combining a subset of the initial pool of SSRMs selecting those that are most concordance among them. The proposed approach allows the addition or removal of speakers from the ensemble without the necessity to re-build the entire machine learning system. The simplicity of this aggregation strategy, coupled with the flexibility assured by the modular architecture, and the promising results obtained on the RECOLA database highlight the potential implications of the proposed method in a real-life scenario and in particular in WEB-based applications

    Evaluation of room acoustic qualities and defects by use of auralization

    Get PDF

    On the mechanism of response latencies in auditory nerve fibers

    Get PDF
    Despite the structural differences of the middle and inner ears, the latency pattern in auditory nerve fibers to an identical sound has been found similar across numerous species. Studies have shown the similarity in remarkable species with distinct cochleae or even without a basilar membrane. This stimulus-, neuron-, and species- independent similarity of latency cannot be simply explained by the concept of cochlear traveling waves that is generally accepted as the main cause of the neural latency pattern. An original concept of Fourier pattern is defined, intended to characterize a feature of temporal processing—specifically phase encoding—that is not readily apparent in more conventional analyses. The pattern is created by marking the first amplitude maximum for each sinusoid component of the stimulus, to encode phase information. The hypothesis is that the hearing organ serves as a running analyzer whose output reflects synchronization of auditory neural activity consistent with the Fourier pattern. A combined research of experimental, correlational and meta-analysis approaches is used to test the hypothesis. Manipulations included phase encoding and stimuli to test their effects on the predicted latency pattern. Animal studies in the literature using the same stimulus were then compared to determine the degree of relationship. The results show that each marking accounts for a large percentage of a corresponding peak latency in the peristimulus-time histogram. For each of the stimuli considered, the latency predicted by the Fourier pattern is highly correlated with the observed latency in the auditory nerve fiber of representative species. The results suggest that the hearing organ analyzes not only amplitude spectrum but also phase information in Fourier analysis, to distribute the specific spikes among auditory nerve fibers and within a single unit. This phase-encoding mechanism in Fourier analysis is proposed to be the common mechanism that, in the face of species differences in peripheral auditory hardware, accounts for the considerable similarities across species in their latency-by-frequency functions, in turn assuring optimal phase encoding across species. Also, the mechanism has the potential to improve phase encoding of cochlear implants

    A Review on Emotion Recognition Algorithms using Speech Analysis

    Get PDF
    In recent years, there is a growing interest in speech emotion recognition (SER) by analyzing input speech. SER can be considered as simply pattern recognition task which includes features extraction, classifier, and speech emotion database. The objective of this paper is to provide a comprehensive review on various literature available on SER. Several audio features are available, including linear predictive coding coefficients (LPCC), Mel-frequency cepstral coefficients (MFCC), and Teager energy based features. While for classifier, many algorithms are available including hidden Markov model (HMM), Gaussian mixture model (GMM), vector quantization (VQ), artificial neural networks (ANN), and deep neural networks (DNN). In this paper, we also reviewed various speech emotion database. Finally, recent related works on SER using DNN will be discussed
    corecore