23 research outputs found

    Video Based Emotion Recognition Using CNN and BRNN

    Get PDF
    Video-based Emotion recognition is rather challenging than vision task. It needs to model spatial information of each image frame as well as the temporal contextual correlations among sequential frames. For this purpose, we propose hierarchical deep network architecture to extract high-level spatial temporal features. Two classic neural networks, Convolutional neural network (CNN) and Bi-directional recurrent neural network (BRNN) are employed to capture facial textural characteristics in spatial domain and dynamic emotion changes in temporal domain. We endeavor to coordinate the two networks by optimizing each of them to boost the performance of the emotion recognition as well as to achieve greater accuracy as compared with baselines

    Convolutional Neural Network Array for Sign Language Recognition using Wearable IMUs

    Full text link
    Advancements in gesture recognition algorithms have led to a significant growth in sign language translation. By making use of efficient intelligent models, signs can be recognized with precision. The proposed work presents a novel one-dimensional Convolutional Neural Network (CNN) array architecture for recognition of signs from the Indian sign language using signals recorded from a custom designed wearable IMU device. The IMU device makes use of tri-axial accelerometer and gyroscope. The signals recorded using the IMU device are segregated on the basis of their context, such as whether they correspond to signing for a general sentence or an interrogative sentence. The array comprises of two individual CNNs, one classifying the general sentences and the other classifying the interrogative sentence. Performances of individual CNNs in the array architecture are compared to that of a conventional CNN classifying the unsegregated dataset. Peak classification accuracies of 94.20% for general sentences and 95.00% for interrogative sentences achieved with the proposed CNN array in comparison to 93.50% for conventional CNN assert the suitability of the proposed approach.Comment: https://doi.org/10.1109/SPIN.2019.871174

    Beyond Short Snippets: Deep Networks for Video Classification

    Full text link
    Convolutional neural networks (CNNs) have been extensively applied for image recognition problems giving state-of-the-art results on recognition, detection, segmentation and retrieval. In this work we propose and evaluate several deep neural network architectures to combine image information across a video over longer time periods than previously attempted. We propose two methods capable of handling full length videos. The first method explores various convolutional temporal feature pooling architectures, examining the various design choices which need to be made when adapting a CNN for this task. The second proposed method explicitly models the video as an ordered sequence of frames. For this purpose we employ a recurrent neural network that uses Long Short-Term Memory (LSTM) cells which are connected to the output of the underlying CNN. Our best networks exhibit significant performance improvements over previously published results on the Sports 1 million dataset (73.1% vs. 60.9%) and the UCF-101 datasets with (88.6% vs. 88.0%) and without additional optical flow information (82.6% vs. 72.8%)

    Adaptation and contextualization of deep neural network models

    Get PDF
    The ability of Deep Neural Networks (DNNs) to provide very high accuracy in classification and recognition problems makes them the major tool for developments in such problems. It is, however, known that DNNs are currently used in a ‘black box’ manner, lacking transparency and interpretability of their decision-making process. Moreover, DNNs should use prior information on data classes, or object categories, so as to provide efficient classification of new data, or objects, without forgetting their previous knowledge. In this paper, we propose a novel class of systems that are able to adapt and contextualize the structure of trained DNNs, providing ways for handling the above-mentioned problems. A hierarchical and distributed system memory is generated and used for this purpose. The main memory is composed of the trained DNN architecture for classification/prediction, i.e., its structure and weights, as well as of an extracted - equivalent – Clustered Representation Set (CRS) generated by the DNN during training at its final - before the output – hidden layer. The latter includes centroids - ‘points of attraction’ - which link the extracted representation to a specific area in the existing system memory. Drift detection, occurring, for example, in personalized data analysis, can be accomplished by comparing the distances of new data from the centroids, taking into account the intra-cluster distances. Moreover, using the generated CRS, the system is able to contextualize its decision-making process, when new data become available. A new public medical database on Parkinson’s disease is used as testbed to illustrate the capabilities of the proposed architecture

    Multimodal Content Analysis for Effective Advertisements on YouTube

    Full text link
    The rapid advances in e-commerce and Web 2.0 technologies have greatly increased the impact of commercial advertisements on the general public. As a key enabling technology, a multitude of recommender systems exists which analyzes user features and browsing patterns to recommend appealing advertisements to users. In this work, we seek to study the characteristics or attributes that characterize an effective advertisement and recommend a useful set of features to aid the designing and production processes of commercial advertisements. We analyze the temporal patterns from multimedia content of advertisement videos including auditory, visual and textual components, and study their individual roles and synergies in the success of an advertisement. The objective of this work is then to measure the effectiveness of an advertisement, and to recommend a useful set of features to advertisement designers to make it more successful and approachable to users. Our proposed framework employs the signal processing technique of cross modality feature learning where data streams from different components are employed to train separate neural network models and are then fused together to learn a shared representation. Subsequently, a neural network model trained on this joint feature embedding representation is utilized as a classifier to predict advertisement effectiveness. We validate our approach using subjective ratings from a dedicated user study, the sentiment strength of online viewer comments, and a viewer opinion metric of the ratio of the Likes and Views received by each advertisement from an online platform.Comment: 11 pages, 5 figures, ICDM 201

    Unsupervised Adversarial Domain Adaptation for Cross-Lingual Speech Emotion Recognition

    Full text link
    Cross-lingual speech emotion recognition (SER) is a crucial task for many real-world applications. The performance of SER systems is often degraded by the differences in the distributions of training and test data. These differences become more apparent when training and test data belong to different languages, which cause a significant performance gap between the validation and test scores. It is imperative to build more robust models that can fit in practical applications of SER systems. Therefore, in this paper, we propose a Generative Adversarial Network (GAN)-based model for multilingual SER. Our choice of using GAN is motivated by their great success in learning the underlying data distribution. The proposed model is designed in such a way that can learn language invariant representations without requiring target-language data labels. We evaluate our proposed model on four different language emotional datasets, including an Urdu-language dataset to also incorporate alternative languages for which labelled data is difficult to find and which have not been studied much by the mainstream community. Our results show that our proposed model can significantly improve the baseline cross-lingual SER performance for all the considered datasets including the non-mainstream Urdu language data without requiring any labels.Comment: Accepted in Affective Computing & Intelligent Interaction (ACII 2019