1,127 research outputs found
Multi-Channel Auto-Encoder for Speech Emotion Recognition
Inferring emotion status from users' queries plays an important role to
enhance the capacity in voice dialogues applications. Even though several
related works obtained satisfactory results, the performance can still be
further improved. In this paper, we proposed a novel framework named
multi-channel auto-encoder (MTC-AE) on emotion recognition from acoustic
information. MTC-AE contains multiple local DNNs based on different low-level
descriptors with different statistics functions that are partly concatenated
together, by which the structure is enabled to consider both local and global
features simultaneously. Experiment based on a benchmark dataset IEMOCAP shows
that our method significantly outperforms the existing state-of-the-art
results, achieving leave-one-speaker-out unweighted accuracy, which is
higher than the best result on this dataset
Phonocardiographic Sensing using Deep Learning for Abnormal Heartbeat Detection
Cardiac auscultation involves expert interpretation of abnormalities in heart
sounds using stethoscope. Deep learning based cardiac auscultation is of
significant interest to the healthcare community as it can help reducing the
burden of manual auscultation with automated detection of abnormal heartbeats.
However, the problem of automatic cardiac auscultation is complicated due to
the requirement of reliability and high accuracy, and due to the presence of
background noise in the heartbeat sound. In this work, we propose a Recurrent
Neural Networks (RNNs) based automated cardiac auscultation solution. Our
choice of RNNs is motivated by the great success of deep learning in medical
applications and by the observation that RNNs represent the deep learning
configuration most suitable for dealing with sequential or temporal data even
in the presence of noise. We explore the use of various RNN models, and
demonstrate that these models deliver the abnormal heartbeat classification
score with significant improvement. Our proposed approach using RNNs can be
potentially be used for real-time abnormal heartbeat detection in the Internet
of Medical Things for remote monitoring applications
Multimodal Speech Emotion Recognition and Ambiguity Resolution
Identifying emotion from speech is a non-trivial task pertaining to the
ambiguous definition of emotion itself. In this work, we adopt a
feature-engineering based approach to tackle the task of speech emotion
recognition. Formalizing our problem as a multi-class classification problem,
we compare the performance of two categories of models. For both, we extract
eight hand-crafted features from the audio signal. In the first approach, the
extracted features are used to train six traditional machine learning
classifiers, whereas the second approach is based on deep learning wherein a
baseline feed-forward neural network and an LSTM-based classifier are trained
over the same features. In order to resolve ambiguity in communication, we also
include features from the text domain. We report accuracy, f-score, precision,
and recall for the different experiment settings we evaluated our models in.
Overall, we show that lighter machine learning based models trained over a few
hand-crafted features are able to achieve performance comparable to the current
deep learning based state-of-the-art method for emotion recognition.Comment: 9 page
Group Emotion Recognition Using Machine Learning
Automatic facial emotion recognition is a challenging task that has gained
significant scientific interest over the past few years, but the problem of
emotion recognition for a group of people has been less extensively studied.
However, it is slowly gaining popularity due to the massive amount of data
available on social networking sites containing images of groups of people
participating in various social events. Group emotion recognition is a
challenging problem due to obstructions like head and body pose variations,
occlusions, variable lighting conditions, variance of actors, varied indoor and
outdoor settings and image quality. The objective of this task is to classify a
group's perceived emotion as Positive, Neutral or Negative. In this report, we
describe our solution which is a hybrid machine learning system that
incorporates deep neural networks and Bayesian classifiers. Deep Convolutional
Neural Networks (CNNs) work from bottom to top, analysing facial expressions
expressed by individual faces extracted from the image. The Bayesian network
works from top to bottom, inferring the global emotion for the image, by
integrating the visual features of the contents of the image obtained through a
scene descriptor. In the final pipeline, the group emotion category predicted
by an ensemble of CNNs in the bottom-up module is passed as input to the
Bayesian Network in the top-down module and an overall prediction for the image
is obtained. Experimental results show that the stated system achieves 65.27%
accuracy on the validation set which is in line with state-of-the-art results.
As an outcome of this project, a Progressive Web Application and an
accompanying Android app with a simple and intuitive user interface are
presented, allowing users to test out the system with their own pictures
An Integrated Autoencoder-Based Filter for Sparse Big Data
We propose a novel filter for sparse big data, called an integrated
autoencoder (IAE), which utilizes auxiliary information to mitigate data
sparsity. The proposed model achieves an appropriate balance between prediction
accuracy, convergence speed, and complexity. We implement experiments on a GPS
trajectory dataset, and the results demonstrate that the IAE is more accurate
and robust than some state-of-the-art methods
Toward Automated Classroom Observation: Multimodal Machine Learning to Estimate CLASS Positive Climate and Negative Climate
In this work we present a multi-modal machine learning-based system, which we
call ACORN, to analyze videos of school classrooms for the Positive Climate
(PC) and Negative Climate (NC) dimensions of the CLASS observation protocol
that is widely used in educational research. ACORN uses convolutional neural
networks to analyze spectral audio features, the faces of teachers and
students, and the pixels of each image frame, and then integrates this
information over time using Temporal Convolutional Networks. The audiovisual
ACORN's PC and NC predictions have Pearson correlations of and
with ground-truth scores provided by expert CLASS coders on the UVA Toddler
dataset (cross-validation on 15-min video segments), and a purely
auditory ACORN predicts PC and NC with correlations of and on the
MET dataset (test set of videos segments). These numbers are similar
to inter-coder reliability of human coders. Finally, using Graph Convolutional
Networks we make early strides (AUC=) toward predicting the specific
moments (45-90sec clips) when the PC is particularly weak/strong. Our findings
inform the design of automatic classroom observation and also more general
video activity recognition and summary recognition systems.Comment: The authors discovered that the results are not reproducibl
GrCAN: Gradient Boost Convolutional Autoencoder with Neural Decision Forest
Random forest and deep neural network are two schools of effective
classification methods in machine learning. While the random forest is robust
irrespective of the data domain, the deep neural network has advantages in
handling high dimensional data. In view that a differentiable neural decision
forest can be added to the neural network to fully exploit the benefits of both
models, in our work, we further combine convolutional autoencoder with neural
decision forest, where autoencoder has its advantages in finding the hidden
representations of the input data. We develop a gradient boost module and embed
it into the proposed convolutional autoencoder with neural decision forest to
improve the performance. The idea of gradient boost is to learn and use the
residual in the prediction. In addition, we design a structure to learn the
parameters of the neural decision forest and gradient boost module at
contiguous steps. The extensive experiments on several public datasets
demonstrate that our proposed model achieves good efficiency and prediction
performance compared with a series of baseline methods
audEERING's approach to the One-Minute-Gradual Emotion Challenge
This paper describes audEERING's submissions as well as additional
evaluations for the One-Minute-Gradual (OMG) emotion recognition challenge. We
provide the results for audio and video processing on subject (in)dependent
evaluations. On the provided Development set, we achieved 0.343 Concordance
Correlation Coefficient (CCC) for arousal (from audio) and .401 for valence
(from video).Comment: 3 page
Deep video gesture recognition using illumination invariants
In this paper we present architectures based on deep neural nets for gesture
recognition in videos, which are invariant to local scaling. We amalgamate
autoencoder and predictor architectures using an adaptive weighting scheme
coping with a reduced size labeled dataset, while enriching our models from
enormous unlabeled sets. We further improve robustness to lighting conditions
by introducing a new adaptive filer based on temporal local scale
normalization. We provide superior results over known methods, including recent
reported approaches based on neural nets
Finding Good Representations of Emotions for Text Classification
It is important for machines to interpret human emotions properly for better
human-machine communications, as emotion is an essential part of human-to-human
communications. One aspect of emotion is reflected in the language we use. How
to represent emotions in texts is a challenge in natural language processing
(NLP). Although continuous vector representations like word2vec have become the
new norm for NLP problems, their limitations are that they do not take emotions
into consideration and can unintentionally contain bias toward certain
identities like different genders.
This thesis focuses on improving existing representations in both word and
sentence levels by explicitly taking emotions inside text and model bias into
account in their training process. Our improved representations can help to
build more robust machine learning models for affect-related text
classification like sentiment/emotion analysis and abusive language detection.
We first propose representations called emotional word vectors (EVEC), which
is learned from a convolutional neural network model with an emotion-labeled
corpus, which is constructed using hashtags. Secondly, we extend to learning
sentence-level representations with a huge corpus of texts with the pseudo task
of recognizing emojis. Our results show that, with the representations trained
from millions of tweets with weakly supervised labels such as hashtags and
emojis, we can solve sentiment/emotion analysis tasks more effectively.
Lastly, as examples of model bias in representations of existing approaches,
we explore a specific problem of automatic detection of abusive language. We
address the issue of gender bias in various neural network models by conducting
experiments to measure and reduce those biases in the representations in order
to build more robust classification models.Comment: HKUST MPhil Thesis, 87 page
- …