1,212 research outputs found

    A framework for emotion and sentiment predicting supported in ensembles

    Get PDF
    Humans are prepared to comprehend each other’s emotions through subtle body movements or facial expressions; using those expressions, individuals change how they deliver messages when communicating between them. Machines, user interfaces, or robots need to empower this ability, in a way to change the interaction from the traditional “human-computer interaction” to a “human-machine cooperation”, where the machine provides the “right” information and functionality, at the “right” time, and in the “right” way. This dissertation presents a framework for emotion classification based on facial, speech, and text emotion prediction sources, supported by an ensemble of open-source code retrieved from off-the-shelf available methods. The main contribution is integrating outputs from different sources and methods in a single prediction, consistent with the emotions presented by the system’s user. For each different source, an initial aggregation of primary classifiers was implemented: for facial emotion classification, the aggregation achieved an accuracy above 73% in both FER2013 and RAF-DB datasets; For the speech emotion classification, four datasets were used, namely: RAVDESS, TESS, CREMA-D, and SAVEE. The aggregation of primary classifiers, achieved for a combination of three of the mentioned datasets results above 86 % of accuracy; The text emotion aggregation of primary classifiers was tested with one dataset called EMOTIONLINES, the classification of emotions achieved an accuracy above 53 %. Finally, the integration of all the methods in a single framework allows us to develop an emotion multi-source aggregator (EMsA), which aggregates the results extracted from the primary emotion classifications from different sources, such as facial, speech, text etc. We describe the EMsA and results using the RAVDESS dataset, which achieved 81.99% accuracy, in the case of the EMsA using a combination of faces and speech. Finally, we present an initial approach for sentiment classification.Os humanos estão preparados para compreender as emoções uns dos outros por meio de movimentos subtis do corpo ou expressões faciais; i.e., a forma como esses movimentos e expressões são enviados mudam a forma de como são entregues as mensagens quando os humanos comunicam entre eles. Máquinas, interfaces de utilizador ou robôs precisam de potencializar essa capacidade, de forma a mudar a interação do tradicional “interação humano-computador” para uma “cooperação homem-máquina”, onde a máquina fornece as informações e funcionalidades “certas”, na hora “certa” e da maneira “certa”. Nesta dissertação é apresentada uma estrutura (um ensemble de modelos) para classificação de emoções baseada em múltiplas fontes, nomeadamente na previsão de emoções faciais, de fala e de texto. Os classificadores base são suportados em código-fonte aberto associados a métodos disponíveis na literatura (classificadores primários). A principal contribuição é integrar diferentes fontes e diferentes métodos (os classificadores primários) numa única previsão consistente com as emoções apresentadas pelo utilizador do sistema. Neste contexto, salienta-se que da análise ao estado da arte efetuada sobre as diferentes formas de classificar emoções em humanos, existe o reconhecimento de emoção corporal (não considerando a face). No entanto, não foi encontrado código-fonte aberto e publicado para os classificadores primários que possam ser utilizados no âmbito desta dissertação. No reconhecimento de emoções da fala e texto foram também encontradas algumas dificuldades em encontrar classificadores primários com os requisitos necessários, principalmente no texto, pois existem bastantes modelos, mas com inúmeras emoções diferentes das 6 emoções básicas consideradas (tristeza, medo, surpresa, repulsa, raiva e alegria). Para o texto ainda possível verificar que existem mais modelos com a previsão de sentimento do que de emoções. De forma isolada para cada uma das fontes, i.e., para cada componente analisada (face, fala e texto), foi desenvolvido uma framework em Python que implementa um agregador primário com n classificadores primários (nesta dissertação considerou-se n igual 3). Para executar os testes e obter os resultados de cada agregador primário é usado um dataset específico e é enviado a informação do dataset para o agregador. I.e., no caso do agregador facial é enviado uma imagem, no caso do agregador da fala é enviado um áudio e no caso do texto é enviado a frase para a correspondente framework. Cada dataset usado foi dividido em ficheiros treino, validação e teste. Quando a framework acaba de processar a informação recebida são gerados os respetivos resultados, nomeadamente: nome do ficheiro/identificação do input, resultados do primeiro classificador primário, resultados do segundo classificador primário, resultados do terceiro classificador primário e ground-truth do dataset. Os resultados dos classificadores primários são depois enviados para o classificador final desse agregador primário, onde foram testados quatro classificadores: (a) voting, que, no caso de n igual 3, consiste na comparação dos resultados da emoção de cada classificador primário, i.e., se 2 classificadores primários tiverem a mesma emoção o resultado do voting será esse, se todos os classificadores tiverem resultados diferentes nenhum resultado é escolhido. Além deste “classificador” foram ainda usados (b) Random Forest, (c) Adaboost e (d) MLP (multiplayer perceptron). Quando a framework de cada agregador primário foi concluída, foi desenvolvido um super-agregador que tem o mesmo princípio dos agregadores primários, mas, agora, em vez de ter os resultados/agregação de apenas 3 classificadores primários, vão existir n × 3 resultados de classificadores primários (n da face, n da fala e n do texto). Relativamente aos resultados dos agregadores usados para cada uma das fontes, face, fala e texto, obteve-se para a classificação de emoção facial uma precisão de classificação acima de 73% nos datasets FER2013 e RAF-DB. Na classificação da emoção da fala foram utilizados quatro datasets, nomeadamente RAVDESS, TESS, CREMA-D e SAVEE, tendo que o melhor resultado de precisão obtido foi acima dos 86% quando usado a combinação de 3 dos 4 datasets. Para a classificação da emoção do texto, testou-se com o um dataset EMOTIONLINES, sendo o melhor resultado obtido foi de 53% (precisão). A integração de todas os classificadores primários agora num único framework permitiu desenvolver o agregador multi-fonte (emotion multi-source aggregator - EMsA), onde a classificação final da emoção é extraída, como já referido da agregação dos classificadores de emoções primárias de diferentes fontes. Para EMsA são apresentados resultados usando o dataset RAVDESS, onde foi alcançado uma precisão de 81.99 %, no caso do EMsA usar uma combinação de faces e fala. Não foi possível testar EMsA usando um dataset reconhecido na literatura que tenha ao mesmo tempo informação do texto, face e fala. Por último, foi apresentada uma abordagem inicial para classificação de sentimentos

    Multimodaalsel emotsioonide tuvastamisel põhineva inimese-roboti suhtluse arendamine

    Get PDF
    Väitekirja elektrooniline versioon ei sisalda publikatsiooneÜks afektiivse arvutiteaduse peamistest huviobjektidest on mitmemodaalne emotsioonituvastus, mis leiab rakendust peamiselt inimese-arvuti interaktsioonis. Emotsiooni äratundmiseks uuritakse nendes süsteemides nii inimese näoilmeid kui kakõnet. Käesolevas töös uuritakse inimese emotsioonide ja nende avaldumise visuaalseid ja akustilisi tunnuseid, et töötada välja automaatne multimodaalne emotsioonituvastussüsteem. Kõnest arvutatakse mel-sageduse kepstri kordajad, helisignaali erinevate komponentide energiad ja prosoodilised näitajad. Näoilmeteanalüüsimiseks kasutatakse kahte erinevat strateegiat. Esiteks arvutatakse inimesenäo tähtsamate punktide vahelised erinevad geomeetrilised suhted. Teiseks võetakse emotsionaalse sisuga video kokku vähendatud hulgaks põhikaadriteks, misantakse sisendiks konvolutsioonilisele tehisnärvivõrgule emotsioonide visuaalsekseristamiseks. Kolme klassifitseerija väljunditest (1 akustiline, 2 visuaalset) koostatakse uus kogum tunnuseid, mida kasutatakse õppimiseks süsteemi viimasesetapis. Loodud süsteemi katsetati SAVEE, Poola ja Serbia emotsionaalse kõneandmebaaside, eNTERFACE’05 ja RML andmebaaside peal. Saadud tulemusednäitavad, et võrreldes olemasolevatega võimaldab käesoleva töö raames loodudsüsteem suuremat täpsust emotsioonide äratundmisel. Lisaks anname käesolevastöös ülevaate kirjanduses väljapakutud süsteemidest, millel on võimekus tunda äraemotsiooniga seotud ̆zeste. Selle ülevaate eesmärgiks on hõlbustada uute uurimissuundade leidmist, mis aitaksid lisada töö raames loodud süsteemile ̆zestipõhiseemotsioonituvastuse võimekuse, et veelgi enam tõsta süsteemi emotsioonide äratundmise täpsust.Automatic multimodal emotion recognition is a fundamental subject of interest in affective computing. Its main applications are in human-computer interaction. The systems developed for the foregoing purpose consider combinations of different modalities, based on vocal and visual cues. This thesis takes the foregoing modalities into account, in order to develop an automatic multimodal emotion recognition system. More specifically, it takes advantage of the information extracted from speech and face signals. From speech signals, Mel-frequency cepstral coefficients, filter-bank energies and prosodic features are extracted. Moreover, two different strategies are considered for analyzing the facial data. First, facial landmarks' geometric relations, i.e. distances and angles, are computed. Second, we summarize each emotional video into a reduced set of key-frames. Then they are taught to visually discriminate between the emotions. In order to do so, a convolutional neural network is applied to the key-frames summarizing the videos. Afterward, the output confidence values of all the classifiers from both of the modalities are used to define a new feature space. Lastly, the latter values are learned for the final emotion label prediction, in a late fusion. The experiments are conducted on the SAVEE, Polish, Serbian, eNTERFACE'05 and RML datasets. The results show significant performance improvements by the proposed system in comparison to the existing alternatives, defining the current state-of-the-art on all the datasets. Additionally, we provide a review of emotional body gesture recognition systems proposed in the literature. The aim of the foregoing part is to help figure out possible future research directions for enhancing the performance of the proposed system. More clearly, we imply that incorporating data representing gestures, which constitute another major component of the visual modality, can result in a more efficient framework

    EXTENDED SPEECH EMOTION RECOGNITION AND PREDICTION

    Get PDF
    Humans are considered to reason and act rationally and that is believed to be their fundamental difference from the rest of the living entities. Furthermore, modern approaches in the science of psychology underline that humans as a thinking creatures are also sentimental and emotional organisms. There are fifteen universal extended emotions plus neutral emotion: hot anger, cold anger, panic, fear, anxiety, despair, sadness, elation, happiness, interest, boredom, shame, pride, disgust, contempt and neutral position. The scope of the current research is to understand the emotional state of a human being by capturing the speech utterances that one uses during a common conversation. It is proved that having enough acoustic evidence available the emotional state of a person can be classified by a set of majority voting classifiers. The proposed set of classifiers is based on three main classifiers: kNN, C4.5 and SVM RBF Kernel. This set achieves better performance than each basic classifier taken separately. It is compared with two other sets of classifiers: one-against-all (OAA) multiclass SVM with Hybrid kernels and the set of classifiers which consists of the following two basic classifiers: C5.0 and Neural Network. The proposed variant achieves better performance than the other two sets of classifiers. The paper deals with emotion classification by a set of majority voting classifiers that combines three certain types of basic classifiers with low computational complexity. The basic classifiers stem from different theoretical background in order to avoid bias and redundancy which gives the proposed set of classifiers the ability to generalize in the emotion domain space

    A Robust Interpretable Deep Learning Classifier for Heart Anomaly Detection Without Segmentation

    Full text link
    Traditionally, abnormal heart sound classification is framed as a three-stage process. The first stage involves segmenting the phonocardiogram to detect fundamental heart sounds; after which features are extracted and classification is performed. Some researchers in the field argue the segmentation step is an unwanted computational burden, whereas others embrace it as a prior step to feature extraction. When comparing accuracies achieved by studies that have segmented heart sounds before analysis with those who have overlooked that step, the question of whether to segment heart sounds before feature extraction is still open. In this study, we explicitly examine the importance of heart sound segmentation as a prior step for heart sound classification, and then seek to apply the obtained insights to propose a robust classifier for abnormal heart sound detection. Furthermore, recognizing the pressing need for explainable Artificial Intelligence (AI) models in the medical domain, we also unveil hidden representations learned by the classifier using model interpretation techniques. Experimental results demonstrate that the segmentation plays an essential role in abnormal heart sound classification. Our new classifier is also shown to be robust, stable and most importantly, explainable, with an accuracy of almost 100% on the widely used PhysioNet dataset

    Multi-Classifier Interactive Learning for Ambiguous Speech Emotion Recognition

    Full text link
    In recent years, speech emotion recognition technology is of great significance in industrial applications such as call centers, social robots and health care. The combination of speech recognition and speech emotion recognition can improve the feedback efficiency and the quality of service. Thus, the speech emotion recognition has been attracted much attention in both industry and academic. Since emotions existing in an entire utterance may have varied probabilities, speech emotion is likely to be ambiguous, which poses great challenges to recognition tasks. However, previous studies commonly assigned a single-label or multi-label to each utterance in certain. Therefore, their algorithms result in low accuracies because of the inappropriate representation. Inspired by the optimally interacting theory, we address the ambiguous speech emotions by proposing a novel multi-classifier interactive learning (MCIL) method. In MCIL, multiple different classifiers first mimic several individuals, who have inconsistent cognitions of ambiguous emotions, and construct new ambiguous labels (the emotion probability distribution). Then, they are retrained with the new labels to interact with their cognitions. This procedure enables each classifier to learn better representations of ambiguous data from others, and further improves the recognition ability. The experiments on three benchmark corpora (MAS, IEMOCAP, and FAU-AIBO) demonstrate that MCIL does not only improve each classifier's performance, but also raises their recognition consistency from moderate to substantial.Comment: 10 pages, 4 figure

    Learning to Detect Human Emotions in Digital World by Integrating Ensemble Voting Classifiers

    Get PDF
    Due to the expansion of world of the internet and the quick acceptance of platforms for social media, information is now able to exchange in ways never previously imagined in history of mankind. A social networking site like Twitter offers a forum where people may interact, discuss, as well as respond to specific issues via short entries, like tweets of 140 characters and fewer. Users may engage by utilizing the comment, like and share tabs on texts, videos, images and other content. Although platforms for social media are now so extensively utilized, individuals are creating as well as sharing so much information than shared before, which can be incorrect or unconnected to reality. It is difficult to identify erroneous or inaccurate statements in textual content autonomously and find emotions of people. In this paper, we suggest an Ensemble method for sentiment and emotion analysis. Different textual features of actual and Emotion and sentiment have been utilized. We used a publicly accessible dataset of twitter sentiment analysis that included total 48,247 authenticated tweets out of 23,947 of which were authentic positive texts labeled as binary 0s  and 24,300 of which were  negative texts labeled as binary 1s. In order to assess our approach, we used well-known (ML) machine learning techniques, these are Logistic Regression (LR), AdaBoost, Decision Tree (DT), SGD, XG-Boost as well as Naive Bayes. In order to get more accurate findings, we created a multi-model sentiment and emotion analyzing system utilizing the ensemble approach and the classifiers stated above. Our recommended ensemble learner method outperforms individual learners, according to an experimental study

    Multimodal Affect Recognition: Current Approaches and Challenges

    Get PDF
    Many factors render multimodal affect recognition approaches appealing. First, humans employ a multimodal approach in emotion recognition. It is only fitting that machines, which attempt to reproduce elements of the human emotional intelligence, employ the same approach. Second, the combination of multiple-affective signals not only provides a richer collection of data but also helps alleviate the effects of uncertainty in the raw signals. Lastly, they potentially afford us the flexibility to classify emotions even when one or more source signals are not possible to retrieve. However, the multimodal approach presents challenges pertaining to the fusion of individual signals, dimensionality of the feature space, and incompatibility of collected signals in terms of time resolution and format. In this chapter, we explore the aforementioned challenges while presenting the latest scholarship on the topic. Hence, we first discuss the various modalities used in affect classification. Second, we explore the fusion of modalities. Third, we present publicly accessible multimodal datasets designed to expedite work on the topic by eliminating the laborious task of dataset collection. Fourth, we analyze representative works on the topic. Finally, we summarize the current challenges in the field and provide ideas for future research directions
    corecore