236 research outputs found

    Personalized face and gesture analysis using hierarchical neural networks

    Full text link
    The video-based computational analyses of human face and gesture signals encompass a myriad of challenging research problems involving computer vision, machine learning and human computer interaction. In this thesis, we focus on the following challenges: a) the classification of hand and body gestures along with the temporal localization of their occurrence in a continuous stream, b) the recognition of facial expressivity levels in people with Parkinson's Disease using multimodal feature representations, c) the prediction of student learning outcomes in intelligent tutoring systems using affect signals, and d) the personalization of machine learning models, which can adapt to subject and group-specific nuances in facial and gestural behavior. Specifically, we first conduct a quantitative comparison of two approaches to the problem of segmenting and classifying gestures on two benchmark gesture datasets: a method that simultaneously segments and classifies gestures versus a cascaded method that performs the tasks sequentially. Second, we introduce a framework that computationally predicts an accurate score for facial expressivity and validate it on a dataset of interview videos of people with Parkinson's disease. Third, based on a unique dataset of videos of students interacting with MathSpring, an intelligent tutoring system, collected by our collaborative research team, we build models to predict learning outcomes from their facial affect signals. Finally, we propose a novel solution to a relatively unexplored area in automatic face and gesture analysis research: personalization of models to individuals and groups. We develop hierarchical Bayesian neural networks to overcome the challenges posed by group or subject-specific variations in face and gesture signals. We successfully validate our formulation on the problems of personalized subject-specific gesture classification, context-specific facial expressivity recognition and student-specific learning outcome prediction. We demonstrate the flexibility of our hierarchical framework by validating the utility of both fully connected and recurrent neural architectures

    The Improved Artificial Neural Network Based on Cosine Similarity in Facial Emotion Recognition

    Get PDF
    In this study, we present the improved artificial neural network based on cosine similarity in facial emotion recognition. We apply a shifting window that employs neural network for two concurrent processes consisting of face detection and emotional recognition. In order to prevent the slow and futile computations, non-face areas need to be filtered from neurons on each network layer, thus we propose the improved artificial neural network based on cosine similarity. Cosine similarity is employed to bypass the process of non-face areas in neural network. The accuracy of the proposed method reaches 0.84, while the accuracy of the original neural network method reaches 0.74. It can be concluded that our methods work accurately.proposed method is superior to the state-of-the-art algorithms

    Learning deep physiological models of affect

    Get PDF
    Feature extraction and feature selection are crucial phases in the process of affective modeling. Both, however, incorporate substantial limitations that hinder the development of reliable and accurate models of affect. For the purpose of modeling affect manifested through physiology, this paper builds on recent advances in machine learning with deep learning (DL) approaches. The efficiency of DL algorithms that train artificial neural network models is tested and compared against standard feature extraction and selection approaches followed in the literature. Results on a game data corpus — containing players’ physiological signals (i.e. skin conductance and blood volume pulse) and subjective self-reports of affect — reveal that DL outperforms manual ad-hoc feature extraction as it yields significantly more accurate affective models. Moreover, it appears that DL meets and even outperforms affective models that are boosted by automatic feature selection, for several of the scenarios examined. As the DL method is generic and applicable to any affective modeling task, the key findings of the paper suggest that ad-hoc feature extraction and selection — to a lesser degree — could be bypassed.The authors would like to thank Tobias Mahlmann for his work on the development and administration of the cluster used to run the experiments. Special thanks for proofreading goes to Yana Knight. Thanks also go to the Theano development team, to all participants in our experiments, and to Ubisoft, NSERC and Canada Research Chairs for funding. This work is funded, in part, by the ILearnRW (project no: 318803) and the C2Learn (project no. 318480) FP7 ICT EU projects.peer-reviewe

    Computerized analysis of hypomimia and hypokinetic dysarthria for improved diagnosis of Parkinson's disease

    Get PDF
    Background and Objective: An aging society requires easy-to-use approaches for diagnosis and monitoring of neurodegenerative disorders, such as Parkinson's disease (PD), so that clinicians can effectively adjust a treatment policy and improve patients' quality of life. Current methods of PD diagnosis and monitoring usually require the patients to come to a hospital, where they undergo several neurological and neuropsychological examinations. These examinations are usually time-consuming, expensive, and performed just a few times per year. Hence, this study explores the possibility of fusing computerized analysis of hypomimia and hypokinetic dysarthria (two motor symptoms manifested in the majority of PD patients) with the goal of proposing a new methodology of PD diagnosis that could be easily integrated into mHealth systems. Methods: We enrolled 73 PD patients and 46 age- and gender-matched healthy controls, who performed several speech/voice tasks while recorded by a microphone and a camera. Acoustic signals were parametrized in the fields of phonation, articulation and prosody. Video recordings of a face were analyzed in terms of facial landmarks movement. Both modalities were consequently modeled by the XGBoost algorithm. Results: The acoustic analysis enabled diagnosis of PD with 77% balanced accuracy, while in the case of the facial analysis, we observed 81% balanced accuracy. The fusion of both modalities increased the balanced accuracy to 83% (88% sensitivity and 78% specificity). The most informative speech exercise in the multimodality system turned out to be a tongue twister. Additionally, we identified muscle movements that are characteristic of hypomimia. Conclusions: The introduced methodology, which is based on the myriad of speech exercises likewise audio and video modality, allows for the detection of PD with an accuracy of up to 83%. The speech exercise - tongue twisters occurred to be the most valuable from the clinical point of view. Additionally, the clinical interpretation of the created models is illustrated. The presented computer-supported methodology could serve as an extra tool for neurologists in PD detection and the proposed potential solution of mHealth will facilitate the patient's and doctor's life.Peer reviewe

    Automatic Context-Driven Inference of Engagement in HMI: A Survey

    Full text link
    An integral part of seamless human-human communication is engagement, the process by which two or more participants establish, maintain, and end their perceived connection. Therefore, to develop successful human-centered human-machine interaction applications, automatic engagement inference is one of the tasks required to achieve engaging interactions between humans and machines, and to make machines attuned to their users, hence enhancing user satisfaction and technology acceptance. Several factors contribute to engagement state inference, which include the interaction context and interactants' behaviours and identity. Indeed, engagement is a multi-faceted and multi-modal construct that requires high accuracy in the analysis and interpretation of contextual, verbal and non-verbal cues. Thus, the development of an automated and intelligent system that accomplishes this task has been proven to be challenging so far. This paper presents a comprehensive survey on previous work in engagement inference for human-machine interaction, entailing interdisciplinary definition, engagement components and factors, publicly available datasets, ground truth assessment, and most commonly used features and methods, serving as a guide for the development of future human-machine interaction interfaces with reliable context-aware engagement inference capability. An in-depth review across embodied and disembodied interaction modes, and an emphasis on the interaction context of which engagement perception modules are integrated sets apart the presented survey from existing surveys

    Multimodaalsel emotsioonide tuvastamisel pÔhineva inimese-roboti suhtluse arendamine

    Get PDF
    VĂ€itekirja elektrooniline versioon ei sisalda publikatsiooneÜks afektiivse arvutiteaduse peamistest huviobjektidest on mitmemodaalne emotsioonituvastus, mis leiab rakendust peamiselt inimese-arvuti interaktsioonis. Emotsiooni Ă€ratundmiseks uuritakse nendes sĂŒsteemides nii inimese nĂ€oilmeid kui kakĂ”net. KĂ€esolevas töös uuritakse inimese emotsioonide ja nende avaldumise visuaalseid ja akustilisi tunnuseid, et töötada vĂ€lja automaatne multimodaalne emotsioonituvastussĂŒsteem. KĂ”nest arvutatakse mel-sageduse kepstri kordajad, helisignaali erinevate komponentide energiad ja prosoodilised nĂ€itajad. NĂ€oilmeteanalĂŒĂŒsimiseks kasutatakse kahte erinevat strateegiat. Esiteks arvutatakse inimesenĂ€o tĂ€htsamate punktide vahelised erinevad geomeetrilised suhted. Teiseks vĂ”etakse emotsionaalse sisuga video kokku vĂ€hendatud hulgaks pĂ”hikaadriteks, misantakse sisendiks konvolutsioonilisele tehisnĂ€rvivĂ”rgule emotsioonide visuaalsekseristamiseks. Kolme klassifitseerija vĂ€ljunditest (1 akustiline, 2 visuaalset) koostatakse uus kogum tunnuseid, mida kasutatakse Ă”ppimiseks sĂŒsteemi viimasesetapis. Loodud sĂŒsteemi katsetati SAVEE, Poola ja Serbia emotsionaalse kĂ”neandmebaaside, eNTERFACE’05 ja RML andmebaaside peal. Saadud tulemusednĂ€itavad, et vĂ”rreldes olemasolevatega vĂ”imaldab kĂ€esoleva töö raames loodudsĂŒsteem suuremat tĂ€psust emotsioonide Ă€ratundmisel. Lisaks anname kĂ€esolevastöös ĂŒlevaate kirjanduses vĂ€ljapakutud sĂŒsteemidest, millel on vĂ”imekus tunda Ă€raemotsiooniga seotud ̆zeste. Selle ĂŒlevaate eesmĂ€rgiks on hĂ”lbustada uute uurimissuundade leidmist, mis aitaksid lisada töö raames loodud sĂŒsteemile ̆zestipĂ”hiseemotsioonituvastuse vĂ”imekuse, et veelgi enam tĂ”sta sĂŒsteemi emotsioonide Ă€ratundmise tĂ€psust.Automatic multimodal emotion recognition is a fundamental subject of interest in affective computing. Its main applications are in human-computer interaction. The systems developed for the foregoing purpose consider combinations of different modalities, based on vocal and visual cues. This thesis takes the foregoing modalities into account, in order to develop an automatic multimodal emotion recognition system. More specifically, it takes advantage of the information extracted from speech and face signals. From speech signals, Mel-frequency cepstral coefficients, filter-bank energies and prosodic features are extracted. Moreover, two different strategies are considered for analyzing the facial data. First, facial landmarks' geometric relations, i.e. distances and angles, are computed. Second, we summarize each emotional video into a reduced set of key-frames. Then they are taught to visually discriminate between the emotions. In order to do so, a convolutional neural network is applied to the key-frames summarizing the videos. Afterward, the output confidence values of all the classifiers from both of the modalities are used to define a new feature space. Lastly, the latter values are learned for the final emotion label prediction, in a late fusion. The experiments are conducted on the SAVEE, Polish, Serbian, eNTERFACE'05 and RML datasets. The results show significant performance improvements by the proposed system in comparison to the existing alternatives, defining the current state-of-the-art on all the datasets. Additionally, we provide a review of emotional body gesture recognition systems proposed in the literature. The aim of the foregoing part is to help figure out possible future research directions for enhancing the performance of the proposed system. More clearly, we imply that incorporating data representing gestures, which constitute another major component of the visual modality, can result in a more efficient framework

    Data-driven Communicative Behaviour Generation: A Survey

    Get PDF
    The development of data-driven behaviour generating systems has recently become the focus of considerable attention in the fields of human–agent interaction and human–robot interaction. Although rule-based approaches were dominant for years, these proved inflexible and expensive to develop. The difficulty of developing production rules, as well as the need for manual configuration to generate artificial behaviours, places a limit on how complex and diverse rule-based behaviours can be. In contrast, actual human–human interaction data collected using tracking and recording devices makes humanlike multimodal co-speech behaviour generation possible using machine learning and specifically, in recent years, deep learning. This survey provides an overview of the state of the art of deep learning-based co-speech behaviour generation models and offers an outlook for future research in this area.</jats:p

    Towards Player-Driven Procedural Content Generation

    Get PDF
    • 

    corecore