Search CORE

1,408 research outputs found

Baseline CNN structure analysis for facial expression recognition

Author: Kim Munsang
Kwon Dong-Soo
Shin Minchul
Publication venue
Publication date: 13/11/2016
Field of study

We present a baseline convolutional neural network (CNN) structure and image preprocessing methodology to improve facial expression recognition algorithm using CNN. To analyze the most efficient network structure, we investigated four network structures that are known to show good performance in facial expression recognition. Moreover, we also investigated the effect of input image preprocessing methods. Five types of data input (raw, histogram equalization, isotropic smoothing, diffusion-based normalization, difference of Gaussian) were tested, and the accuracy was compared. We trained 20 different CNN models (4 networks x 5 data input types) and verified the performance of each network with test images from five different databases. The experiment result showed that a three-layer structure consisting of a simple convolutional and a max pooling layer with histogram equalization image input was the most efficient. We describe the detailed training procedure and analyze the result of the test accuracy based on considerable observation.Comment: 6 pages, RO-MAN2016 Conferenc

arXiv.org e-Print Archive

Crossref

On the Audio-Visual Emotion Recognition using Convolutional Neural Networks and Extreme Learning Machine

Author: Arifin Fatchul
Ashraf Arselan
Gunawan Teddy Surya
Habaebi Mohamed Hadi
Kartiwi Mira
Sophian Ali
Publication venue: IAES Indonesia Section
Publication date: 06/09/2022
Field of study

The advances in artificial intelligence and machine learning concerning emotion recognition have been enormous and in previously inconceivable ways. Inspired by the promising evolution in human-computer interaction, this paper is based on developing a multimodal emotion recognition system. This research encompasses two modalities as input, namely speech and video. In the proposed model, the input video samples are subjected to image pre-processing and image frames are obtained. The signal is pre-processed and transformed into the frequency domain for the audio input. The aim is to obtain Mel-spectrogram, which is processed further as images. Convolutional neural networks are used for training and feature extraction for both audio and video with different configurations. The fusion of outputs from two CNNs is done using two extreme learning machines. For classification, the proposed system incorporates a support vector machine. The model is evaluated using three databases, namely eNTERFACE, RML, and SAVEE. For the eNTERFACE dataset, the accuracy obtained without and with augmentation was 87.2% and 94.91%, respectively. The RML dataset yielded an accuracy of 98.5%, and for the SAVEE dataset, the accuracy reached 97.77%. Results achieved from this research are an illustration of the fruitful exploration and effectiveness of the proposed system

Indonesian Journal of Electrical Engineering and Informatics (IJEEI)

Modeling Multimodal Cues in a Deep Learning-based Framework for Emotion Recognition in the Wild

Author: Ben-Ahmed Olfa
Goodfellow I. J.
Han Kun
Ioffe Sergey
Khorrami Pooya
Parkhi Omkar M
Szegedy Christian
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2017
Field of study

In this paper, we propose a multimodal deep learning architecture for emotion recognition in video regarding our participation to the audio-video based sub-challenge of the Emotion Recognition in the Wild 2017 challenge. Our model combines cues from multiple video modalities, including static facial features, motion patterns related to the evolution of the human expression over time, and audio information. Specifically, it is composed of three sub-networks trained separately: the first and second ones extract static visual features and dynamic patterns through 2D and 3D Convolutional Neural Networks (CNN), while the third one consists in a pretrained audio network which is used to extract useful deep acoustic signals from video. In the audio branch, we also apply Long Short Term Memory (LSTM) networks in order to capture the temporal evolution of the audio features. To identify and exploit possible relationships among different modalities, we propose a fusion network that merges cues from the different modalities in one representation. The proposed architecture outperforms the challenge baselines (38.81% and 40.47%): we achieve an accuracy of 50.39% and 49.92% respectively on the validation and the testing data

Crossref

Archivio istituzionale della ricerca - Università di Modena e Reggio Emilia

Hal-Diderot