6,462 research outputs found
Evaluation of Support Vector Machine and Decision Tree for Emotion Recognition of Malay Folklores
In this paper, the performance of Support Vector Machine (SVM) and Decision Tree (DT) in classifying emotions from Malay folklores is presented. This work is the continuation of our storytelling speech synthesis work to add emotions for a more natural storytelling. A total of 100 documents from children short stories are collected and used as the datasets of the text-based emotion recognition experiment. Term Frequency-Inverse Document Frequency (TF-IDF) is extracted from the text documents and classified using SVM and DT. Four types of common emotions, which are happy, angry, fearful and sad are classified using the two classifiers. Results showed that DT outperformed SVM by more than 22.2% accuracy rate. However, the overall emotion recognition is only at moderate rate suggesting an improvement is needed in future work. The accuracy of the emotion recognition should be improved in future studies by using semantic feature extractors or by incorporating deep learning for classification
An ongoing review of speech emotion recognition
User emotional status recognition is becoming a key feature in advanced Human Computer Interfaces (HCI). A key source of emotional information is the spoken expression, which may be part of the interaction between the human and the machine. Speech emotion recognition (SER) is a very active area of research that involves the application of current machine learning and neural networks tools. This ongoing review covers recent and classical approaches to SER reported in the literature.This work has been carried out with the support of project PID2020-116346GB-I00 funded by the Spanish MICIN
Intention Detection Based on Siamese Neural Network With Triplet Loss
Understanding the user's intention is an essential task for the spoken language understanding (SLU) module in the dialogue system, which further illustrates vital information for managing and generating future action and response. In this paper, we propose a triplet training framework based on the multiclass classification approach to conduct the training for the intention detection task. Precisely, we utilize a Siamese neural network architecture with metric learning to construct a robust and discriminative utterance feature embedding model. We modified the RMCNN model and fine-tuned BERT model as Siamese encoders to train utterance triplets from different semantic aspects. The triplet loss can effectively distinguish the details of two input data by learning a mapping from sequence utterances to a compact Euclidean space. After generating the mapping, the intention detection task can be easily implemented using standard techniques with pre-trained embeddings as feature vectors. Besides, we use the fusion strategy to enhance utterance feature representation in the downstream of intention detection task. We conduct experiments on several benchmark datasets of intention detection task: Snips dataset, ATIS dataset, Facebook multilingual task-oriented datasets, Daily Dialogue dataset, and MRDA dataset. The results illustrate that the proposed method can effectively improve the recognition performance of these datasets and achieves new state-of-the-art results on single-turn task-oriented datasets (Snips dataset, Facebook dataset), and a multi-turn dataset (Daily Dialogue dataset)
Audio-Visual Speaker Verification via Joint Cross-Attention
Speaker verification has been widely explored using speech signals, which has
shown significant improvement using deep models. Recently, there has been a
surge in exploring faces and voices as they can offer more complementary and
comprehensive information than relying only on a single modality of speech
signals. Though current methods in the literature on the fusion of faces and
voices have shown improvement over that of individual face or voice modalities,
the potential of audio-visual fusion is not fully explored for speaker
verification. Most of the existing methods based on audio-visual fusion either
rely on score-level fusion or simple feature concatenation. In this work, we
have explored cross-modal joint attention to fully leverage the inter-modal
complementary information and the intra-modal information for speaker
verification. Specifically, we estimate the cross-attention weights based on
the correlation between the joint feature presentation and that of the
individual feature representations in order to effectively capture both
intra-modal as well inter-modal relationships among the faces and voices. We
have shown that efficiently leveraging the intra- and inter-modal relationships
significantly improves the performance of audio-visual fusion for speaker
verification. The performance of the proposed approach has been evaluated on
the Voxceleb1 dataset. Results show that the proposed approach can
significantly outperform the state-of-the-art methods of audio-visual fusion
for speaker verification
EmoCNN: Encoding Emotional Expression from Text to Word Vector and Classifying Emotions—A Case Study in Thai Social Network Conversation
We present EmoCNN, a collection of specially-trained word embedding layer and convolutional neural network model for the classification of conversational texts into 4 types of emotion. This model is part of a chatbot for depression evaluation. The difficulty in classifying emotion from conversational text is that most word embeddings are trained with emotionally-neutral corpus such as Wikipedia or news articles, where emotional words do not appear very often or at all, and the language style is formal writing. We trained a new word embedding based on the word2vec architecture in an unsupervised manner and then fine-tuned it on soft-labelled data. The data was obtained from mining Twitter using emotion keywords. We show that this emotion word embedding can differentiate between words which have the same polarity and words which have opposite polarity, as well as find similar words with the same polarity, while the standard word embedding cannot. We then used this new embedding as the first layer of EmoCNN that classifies conversational text into the 4 emotions. EmoCNN achieved macro-averaged f1-score of 0.76 over the test set. We compared EmoCNN against three different models: a shallow fully-connected neural network, fine-tuning RoBERTa, and ULMFit. These got the best macro-averaged f1-score of 0.5556, 0.6402 and 0.7386 respectively
Learning to adapt in dialogue systems : data-driven models for personality recognition and generation.
Dialogue systems are artefacts that converse with human users in order to achieve
some task. Each step of the dialogue requires understanding the user's input, deciding
on what to reply, and generating an output utterance. Although there are
many ways to express any given content, most dialogue systems do not take linguistic
variation into account in both the understanding and generation phases,
i.e. the user's linguistic style is typically ignored, and the style conveyed by the
system is chosen once for all interactions at development time. We believe that
modelling linguistic variation can greatly improve the interaction in dialogue systems,
such as in intelligent tutoring systems, video games, or information retrieval
systems, which all require specific linguistic styles. Previous work has shown that
linguistic style affects many aspects of users' perceptions, even when the dialogue
is task-oriented. Moreover, users attribute a consistent personality to machines,
even when exposed to a limited set of cues, thus dialogue systems manifest personality
whether designed into the system or not. Over the past few years, psychologists
have identified the main dimensions of individual differences in human
behaviour: the Big Five personality traits. We hypothesise that the Big Five provide
a useful computational framework for modelling important aspects of linguistic
variation. This thesis first explores the possibility of recognising the user's personality
using data-driven models trained on essays and conversational data. We then
test whether it is possible to generate language varying consistently along each
personality dimension in the information presentation domain. We present PERSONAGE:
a language generator modelling findings from psychological studies to
project various personality traits. We use PERSONAGE to compare various generation
paradigms: (1) rule-based generation, (2) overgenerate and select and (3)
generation using parameter estimation models-a novel approach that learns to
produce recognisable variation along meaningful stylistic dimensions without the
computational cost incurred by overgeneration techniques. We also present the
first human evaluation of a data-driven generation method that projects multiple
stylistic dimensions simultaneously and on a continuous scale
Effects of cultural characteristics on building an emotion classifier through facial expression analysis
Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq)Facial expressions are an important demonstration of humanity's humors and emotions. Algorithms capable of recognizing facial expressions and associating them with emotions were developed and employed to compare the expressions that different cultural groups use to show their emotions. Static pictures of predominantly occidental and oriental subjects from public datasets were used to train machine learning algorithms, whereas local binary patterns, histogram of oriented gradients (HOGs), and Gabor filters were employed to describe the facial expressions for six different basic emotions. The most consistent combination, formed by the association of HOG filter and support vector machines, was then used to classify the other cultural group: there was a strong drop in accuracy, meaning that the subtle differences of facial expressions of each culture affected the classifier performance. Finally, a classifier was trained with images from both occidental and oriental subjects and its accuracy was higher on multicultural data, evidencing the need of a multicultural training set to build an efficient classifier. (C) 2015 SPIE and IS&TFacial expressions are an important demonstration of humanity's humors and emotions. Algorithms capable of recognizing facial expressions and associating them with emotions were developed and employed to compare the expressions that different cultural gro24219FAPESP - FUNDAÇÃO DE AMPARO À PESQUISA DO ESTADO DE SÃO PAULOCNPQ - CONSELHO NACIONAL DE DESENVOLVIMENTO CIENTÍFICO E TECNOLÓGICOFundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq)FAPESP [2011/22749-8, 2014/04020-9]CNPq [307113/2012-4]2011/22749-8; 2014/04020-9307113/2012-
MIS-AVoiDD: Modality Invariant and Specific Representation for Audio-Visual Deepfake Detection
Deepfakes are synthetic media generated using deep generative algorithms and
have posed a severe societal and political threat. Apart from facial
manipulation and synthetic voice, recently, a novel kind of deepfakes has
emerged with either audio or visual modalities manipulated. In this regard, a
new generation of multimodal audio-visual deepfake detectors is being
investigated to collectively focus on audio and visual data for multimodal
manipulation detection. Existing multimodal (audio-visual) deepfake detectors
are often based on the fusion of the audio and visual streams from the video.
Existing studies suggest that these multimodal detectors often obtain
equivalent performances with unimodal audio and visual deepfake detectors. We
conjecture that the heterogeneous nature of the audio and visual signals
creates distributional modality gaps and poses a significant challenge to
effective fusion and efficient performance. In this paper, we tackle the
problem at the representation level to aid the fusion of audio and visual
streams for multimodal deepfake detection. Specifically, we propose the joint
use of modality (audio and visual) invariant and specific representations. This
ensures that the common patterns and patterns specific to each modality
representing pristine or fake content are preserved and fused for multimodal
deepfake manipulation detection. Our experimental results on FakeAVCeleb and
KoDF audio-visual deepfake datasets suggest the enhanced accuracy of our
proposed method over SOTA unimodal and multimodal audio-visual deepfake
detectors by % and %, respectively. Thus, obtaining
state-of-the-art performance.Comment: 8 pages, 3 figure
Data-driven Attention and Data-independent DCT based Global Context Modeling for Text-independent Speaker Recognition
Learning an effective speaker representation is crucial for achieving
reliable performance in speaker verification tasks. Speech signals are
high-dimensional, long, and variable-length sequences that entail a complex
hierarchical structure. Signals may contain diverse information at each
time-frequency (TF) location. For example, it may be more beneficial to focus
on high-energy parts for phoneme classes such as fricatives. The standard
convolutional layer that operates on neighboring local regions cannot capture
the complex TF global context information. In this study, a general global
time-frequency context modeling framework is proposed to leverage the context
information specifically for speaker representation modeling. First, a
data-driven attention-based context model is introduced to capture the
long-range and non-local relationship across different time-frequency
locations. Second, a data-independent 2D-DCT based context model is proposed to
improve model interpretability. A multi-DCT attention mechanism is presented to
improve modeling power with alternate DCT base forms. Finally, the global
context information is used to recalibrate salient time-frequency locations by
computing the similarity between the global context and local features. The
proposed lightweight blocks can be easily incorporated into a speaker model
with little additional computational costs and effectively improves the speaker
verification performance compared to the standard ResNet model and
Squeeze\&Excitation block by a large margin. Detailed ablation studies are also
performed to analyze various factors that may impact performance of the
proposed individual modules. Results from experiments show that the proposed
global context modeling framework can efficiently improve the learned speaker
representations by achieving channel-wise and time-frequency feature
recalibration
- …