236 research outputs found
Personalized face and gesture analysis using hierarchical neural networks
The video-based computational analyses of human face and gesture signals encompass a myriad of challenging research problems involving computer vision, machine learning and human computer interaction. In this thesis, we focus on the following challenges: a) the classification of hand and body gestures along with the temporal localization of their occurrence in a continuous stream, b) the recognition of facial expressivity levels in people with Parkinson's Disease using multimodal feature representations, c) the prediction of student learning outcomes in intelligent tutoring systems using affect signals, and d) the personalization of machine learning models, which can adapt to subject and group-specific nuances in facial and gestural behavior. Specifically, we first conduct a quantitative comparison of two approaches to the problem of segmenting and classifying gestures on two benchmark gesture datasets: a method that simultaneously segments and classifies gestures versus a cascaded method that performs the tasks sequentially. Second, we introduce a framework that computationally predicts an accurate score for facial expressivity and validate it on a dataset of interview videos of people with Parkinson's disease. Third, based on a unique dataset of videos of students interacting with MathSpring, an intelligent tutoring system, collected by our collaborative research team, we build models to predict learning outcomes from their facial affect signals. Finally, we propose a novel solution to a relatively unexplored area in automatic face and gesture analysis research: personalization of models to individuals and groups. We develop hierarchical Bayesian neural networks to overcome the challenges posed by group or subject-specific variations in face and gesture signals. We successfully validate our formulation on the problems of personalized subject-specific gesture classification, context-specific facial expressivity recognition and student-specific learning outcome prediction. We demonstrate the flexibility of our hierarchical framework by validating the utility of both fully connected and recurrent neural architectures
The Improved Artificial Neural Network Based on Cosine Similarity in Facial Emotion Recognition
In this study, we present the improved artificial neural network based on cosine similarity in facial emotion recognition. We apply a shifting window that employs neural network for two concurrent processes consisting of face detection and emotional recognition. In order to prevent the slow and futile computations, non-face areas need to be filtered from neurons on each network layer, thus we propose the improved artificial neural network based on cosine similarity. Cosine similarity is employed to bypass the process of non-face areas in neural network. The accuracy of the proposed method reaches 0.84, while the accuracy of the original neural network method reaches 0.74. It can be concluded that our methods work accurately.proposed method is superior to the state-of-the-art algorithms
Learning deep physiological models of affect
Feature extraction and feature selection are crucial
phases in the process of affective modeling. Both, however,
incorporate substantial limitations that hinder the development
of reliable and accurate models of affect. For the purpose of
modeling affect manifested through physiology, this paper builds
on recent advances in machine learning with deep learning
(DL) approaches. The efficiency of DL algorithms that train
artificial neural network models is tested and compared against
standard feature extraction and selection approaches followed
in the literature. Results on a game data corpus â containing
playersâ physiological signals (i.e. skin conductance and blood
volume pulse) and subjective self-reports of affect â reveal that
DL outperforms manual ad-hoc feature extraction as it yields
significantly more accurate affective models. Moreover, it appears
that DL meets and even outperforms affective models that are
boosted by automatic feature selection, for several of the scenarios
examined. As the DL method is generic and applicable to any
affective modeling task, the key findings of the paper suggest
that ad-hoc feature extraction and selection â to a lesser degree
â could be bypassed.The authors would like to thank Tobias Mahlmann for his
work on the development and administration of the cluster
used to run the experiments. Special thanks for proofreading
goes to Yana Knight. Thanks also go to the Theano development
team, to all participants in our experiments, and to
Ubisoft, NSERC and Canada Research Chairs for funding.
This work is funded, in part, by the ILearnRW (project no:
318803) and the C2Learn (project no. 318480) FP7 ICT EU
projects.peer-reviewe
Computerized analysis of hypomimia and hypokinetic dysarthria for improved diagnosis of Parkinson's disease
Background and Objective: An aging society requires easy-to-use approaches for diagnosis and monitoring of neurodegenerative disorders, such as Parkinson's disease (PD), so that clinicians can effectively adjust a treatment policy and improve patients' quality of life. Current methods of PD diagnosis and monitoring usually require the patients to come to a hospital, where they undergo several neurological and neuropsychological examinations. These examinations are usually time-consuming, expensive, and performed just a few times per year. Hence, this study explores the possibility of fusing computerized analysis of hypomimia and hypokinetic dysarthria (two motor symptoms manifested in the majority of PD patients) with the goal of proposing a new methodology of PD diagnosis that could be easily integrated into mHealth systems. Methods: We enrolled 73 PD patients and 46 age- and gender-matched healthy controls, who performed several speech/voice tasks while recorded by a microphone and a camera. Acoustic signals were parametrized in the fields of phonation, articulation and prosody. Video recordings of a face were analyzed in terms of facial landmarks movement. Both modalities were consequently modeled by the XGBoost algorithm. Results: The acoustic analysis enabled diagnosis of PD with 77% balanced accuracy, while in the case of the facial analysis, we observed 81% balanced accuracy. The fusion of both modalities increased the balanced accuracy to 83% (88% sensitivity and 78% specificity). The most informative speech exercise in the multimodality system turned out to be a tongue twister. Additionally, we identified muscle movements that are characteristic of hypomimia. Conclusions: The introduced methodology, which is based on the myriad of speech exercises likewise audio and video modality, allows for the detection of PD with an accuracy of up to 83%. The speech exercise - tongue twisters occurred to be the most valuable from the clinical point of view. Additionally, the clinical interpretation of the created models is illustrated. The presented computer-supported methodology could serve as an extra tool for neurologists in PD detection and the proposed potential solution of mHealth will facilitate the patient's and doctor's life.Peer reviewe
Automatic Context-Driven Inference of Engagement in HMI: A Survey
An integral part of seamless human-human communication is engagement, the
process by which two or more participants establish, maintain, and end their
perceived connection. Therefore, to develop successful human-centered
human-machine interaction applications, automatic engagement inference is one
of the tasks required to achieve engaging interactions between humans and
machines, and to make machines attuned to their users, hence enhancing user
satisfaction and technology acceptance. Several factors contribute to
engagement state inference, which include the interaction context and
interactants' behaviours and identity. Indeed, engagement is a multi-faceted
and multi-modal construct that requires high accuracy in the analysis and
interpretation of contextual, verbal and non-verbal cues. Thus, the development
of an automated and intelligent system that accomplishes this task has been
proven to be challenging so far. This paper presents a comprehensive survey on
previous work in engagement inference for human-machine interaction, entailing
interdisciplinary definition, engagement components and factors, publicly
available datasets, ground truth assessment, and most commonly used features
and methods, serving as a guide for the development of future human-machine
interaction interfaces with reliable context-aware engagement inference
capability. An in-depth review across embodied and disembodied interaction
modes, and an emphasis on the interaction context of which engagement
perception modules are integrated sets apart the presented survey from existing
surveys
Multimodaalsel emotsioonide tuvastamisel pÔhineva inimese-roboti suhtluse arendamine
VĂ€itekirja elektrooniline versioon ei sisalda publikatsiooneĂks afektiivse arvutiteaduse peamistest huviobjektidest on mitmemodaalne emotsioonituvastus, mis leiab rakendust peamiselt inimese-arvuti interaktsioonis. Emotsiooni Ă€ratundmiseks uuritakse nendes sĂŒsteemides nii inimese nĂ€oilmeid kui kakĂ”net. KĂ€esolevas töös uuritakse inimese emotsioonide ja nende avaldumise visuaalseid ja akustilisi tunnuseid, et töötada vĂ€lja automaatne multimodaalne emotsioonituvastussĂŒsteem. KĂ”nest arvutatakse mel-sageduse kepstri kordajad, helisignaali erinevate komponentide energiad ja prosoodilised nĂ€itajad. NĂ€oilmeteanalĂŒĂŒsimiseks kasutatakse kahte erinevat strateegiat. Esiteks arvutatakse inimesenĂ€o tĂ€htsamate punktide vahelised erinevad geomeetrilised suhted. Teiseks vĂ”etakse emotsionaalse sisuga video kokku vĂ€hendatud hulgaks pĂ”hikaadriteks, misantakse sisendiks konvolutsioonilisele tehisnĂ€rvivĂ”rgule emotsioonide visuaalsekseristamiseks. Kolme klassifitseerija vĂ€ljunditest (1 akustiline, 2 visuaalset) koostatakse uus kogum tunnuseid, mida kasutatakse Ă”ppimiseks sĂŒsteemi viimasesetapis. Loodud sĂŒsteemi katsetati SAVEE, Poola ja Serbia emotsionaalse kĂ”neandmebaaside, eNTERFACEâ05 ja RML andmebaaside peal. Saadud tulemusednĂ€itavad, et vĂ”rreldes olemasolevatega vĂ”imaldab kĂ€esoleva töö raames loodudsĂŒsteem suuremat tĂ€psust emotsioonide Ă€ratundmisel. Lisaks anname kĂ€esolevastöös ĂŒlevaate kirjanduses vĂ€ljapakutud sĂŒsteemidest, millel on vĂ”imekus tunda Ă€raemotsiooniga seotud Ìzeste. Selle ĂŒlevaate eesmĂ€rgiks on hĂ”lbustada uute uurimissuundade leidmist, mis aitaksid lisada töö raames loodud sĂŒsteemile ÌzestipĂ”hiseemotsioonituvastuse vĂ”imekuse, et veelgi enam tĂ”sta sĂŒsteemi emotsioonide Ă€ratundmise tĂ€psust.Automatic multimodal emotion recognition is a fundamental subject of interest in affective computing. Its main applications are in human-computer interaction. The systems developed for the foregoing purpose consider combinations of different modalities, based on vocal and visual cues. This thesis takes the foregoing modalities into account, in order to develop an automatic multimodal emotion recognition system. More specifically, it takes advantage of the information extracted from speech and face signals. From speech signals, Mel-frequency cepstral coefficients, filter-bank energies and prosodic features are extracted. Moreover, two different strategies are considered for analyzing the facial data. First, facial landmarks' geometric relations, i.e. distances and angles, are computed. Second, we summarize each emotional video into a reduced set of key-frames. Then they are taught to visually discriminate between the emotions. In order to do so, a convolutional neural network is applied to the key-frames summarizing the videos. Afterward, the output confidence values of all the classifiers from both of the modalities are used to define a new feature space. Lastly, the latter values are learned for the final emotion label prediction, in a late fusion. The experiments are conducted on the SAVEE, Polish, Serbian, eNTERFACE'05 and RML datasets. The results show significant performance improvements by the proposed system in comparison to the existing alternatives, defining the current state-of-the-art on all the datasets. Additionally, we provide a review of emotional body gesture recognition systems proposed in the literature. The aim of the foregoing part is to help figure out possible future research directions for enhancing the performance of the proposed system. More clearly, we imply that incorporating data representing gestures, which constitute another major component of the visual modality, can result in a more efficient framework
Data-driven Communicative Behaviour Generation: A Survey
The development of data-driven behaviour generating systems has recently become the focus of considerable attention in the fields of humanâagent interaction and humanârobot interaction. Although rule-based approaches were dominant for years, these proved inflexible and expensive to develop. The difficulty of developing production rules, as well as the need for manual configuration to generate artificial behaviours, places a limit on how complex and diverse rule-based behaviours can be. In contrast, actual humanâhuman interaction data collected using tracking and recording devices makes humanlike multimodal co-speech behaviour generation possible using machine learning and specifically, in recent years, deep learning. This survey provides an overview of the state of the art of deep learning-based co-speech behaviour generation models and offers an outlook for future research in this area.</jats:p
- âŠ