23 research outputs found
Video Based Emotion Recognition Using CNN and BRNN
Video-based Emotion recognition is rather challenging than vision task. It needs to model spatial information of each image frame as well as the temporal contextual correlations among sequential frames. For this purpose, we propose hierarchical deep network architecture to extract high-level spatial temporal features. Two classic neural networks, Convolutional neural network (CNN) and Bi-directional recurrent neural network (BRNN) are employed to capture facial textural characteristics in spatial domain and dynamic emotion changes in temporal domain. We endeavor to coordinate the two networks by optimizing each of them to boost the performance of the emotion recognition as well as to achieve greater accuracy as compared with baselines
Convolutional Neural Network Array for Sign Language Recognition using Wearable IMUs
Advancements in gesture recognition algorithms have led to a significant
growth in sign language translation. By making use of efficient intelligent
models, signs can be recognized with precision. The proposed work presents a
novel one-dimensional Convolutional Neural Network (CNN) array architecture for
recognition of signs from the Indian sign language using signals recorded from
a custom designed wearable IMU device. The IMU device makes use of tri-axial
accelerometer and gyroscope. The signals recorded using the IMU device are
segregated on the basis of their context, such as whether they correspond to
signing for a general sentence or an interrogative sentence. The array
comprises of two individual CNNs, one classifying the general sentences and the
other classifying the interrogative sentence. Performances of individual CNNs
in the array architecture are compared to that of a conventional CNN
classifying the unsegregated dataset. Peak classification accuracies of 94.20%
for general sentences and 95.00% for interrogative sentences achieved with the
proposed CNN array in comparison to 93.50% for conventional CNN assert the
suitability of the proposed approach.Comment: https://doi.org/10.1109/SPIN.2019.871174
Beyond Short Snippets: Deep Networks for Video Classification
Convolutional neural networks (CNNs) have been extensively applied for image
recognition problems giving state-of-the-art results on recognition, detection,
segmentation and retrieval. In this work we propose and evaluate several deep
neural network architectures to combine image information across a video over
longer time periods than previously attempted. We propose two methods capable
of handling full length videos. The first method explores various convolutional
temporal feature pooling architectures, examining the various design choices
which need to be made when adapting a CNN for this task. The second proposed
method explicitly models the video as an ordered sequence of frames. For this
purpose we employ a recurrent neural network that uses Long Short-Term Memory
(LSTM) cells which are connected to the output of the underlying CNN. Our best
networks exhibit significant performance improvements over previously published
results on the Sports 1 million dataset (73.1% vs. 60.9%) and the UCF-101
datasets with (88.6% vs. 88.0%) and without additional optical flow information
(82.6% vs. 72.8%)
Adaptation and contextualization of deep neural network models
The ability of Deep Neural Networks (DNNs) to provide very high accuracy in classification and recognition problems makes them the major tool for developments in such problems. It is, however, known that DNNs are currently used in a ‘black box’ manner, lacking transparency and interpretability of their decision-making process. Moreover, DNNs should use prior information on data classes, or object categories, so as to provide efficient classification of new data, or objects, without forgetting their previous knowledge. In this paper, we propose a novel class of systems that are able to adapt and contextualize the structure of trained DNNs, providing ways for handling the above-mentioned problems. A hierarchical and distributed system memory is generated and used for this purpose. The main memory is composed of the trained DNN architecture for classification/prediction, i.e., its structure and weights, as well as of an extracted - equivalent – Clustered Representation Set (CRS) generated by the DNN during training at its final - before the output – hidden layer. The latter includes centroids - ‘points of attraction’ - which link the extracted representation to a specific area in the existing system memory. Drift detection, occurring, for example, in personalized data analysis, can be accomplished by comparing the distances of new data from the centroids, taking into account the intra-cluster distances. Moreover, using the generated CRS, the system is able to contextualize its decision-making process, when new data become available. A new public medical database on Parkinson’s disease is used as testbed to illustrate the capabilities of the proposed architecture
Multimodal Content Analysis for Effective Advertisements on YouTube
The rapid advances in e-commerce and Web 2.0 technologies have greatly
increased the impact of commercial advertisements on the general public. As a
key enabling technology, a multitude of recommender systems exists which
analyzes user features and browsing patterns to recommend appealing
advertisements to users. In this work, we seek to study the characteristics or
attributes that characterize an effective advertisement and recommend a useful
set of features to aid the designing and production processes of commercial
advertisements. We analyze the temporal patterns from multimedia content of
advertisement videos including auditory, visual and textual components, and
study their individual roles and synergies in the success of an advertisement.
The objective of this work is then to measure the effectiveness of an
advertisement, and to recommend a useful set of features to advertisement
designers to make it more successful and approachable to users. Our proposed
framework employs the signal processing technique of cross modality feature
learning where data streams from different components are employed to train
separate neural network models and are then fused together to learn a shared
representation. Subsequently, a neural network model trained on this joint
feature embedding representation is utilized as a classifier to predict
advertisement effectiveness. We validate our approach using subjective ratings
from a dedicated user study, the sentiment strength of online viewer comments,
and a viewer opinion metric of the ratio of the Likes and Views received by
each advertisement from an online platform.Comment: 11 pages, 5 figures, ICDM 201
Unsupervised Adversarial Domain Adaptation for Cross-Lingual Speech Emotion Recognition
Cross-lingual speech emotion recognition (SER) is a crucial task for many
real-world applications. The performance of SER systems is often degraded by
the differences in the distributions of training and test data. These
differences become more apparent when training and test data belong to
different languages, which cause a significant performance gap between the
validation and test scores. It is imperative to build more robust models that
can fit in practical applications of SER systems. Therefore, in this paper, we
propose a Generative Adversarial Network (GAN)-based model for multilingual
SER. Our choice of using GAN is motivated by their great success in learning
the underlying data distribution. The proposed model is designed in such a way
that can learn language invariant representations without requiring
target-language data labels. We evaluate our proposed model on four different
language emotional datasets, including an Urdu-language dataset to also
incorporate alternative languages for which labelled data is difficult to find
and which have not been studied much by the mainstream community. Our results
show that our proposed model can significantly improve the baseline
cross-lingual SER performance for all the considered datasets including the
non-mainstream Urdu language data without requiring any labels.Comment: Accepted in Affective Computing & Intelligent Interaction (ACII 2019