    Facial expression recognition in the wild : from individual to group

    The progress in computing technology has increased the demand for smart systems capable of understanding human affect and emotional manifestations. One of the crucial factors in designing systems equipped with such intelligence is to have accurate automatic Facial Expression Recognition (FER) methods. In computer vision, automatic facial expression analysis is an active field of research for over two decades now. However, there are still a lot of questions unanswered. The research presented in this thesis attempts to address some of the key issues of FER in challenging conditions mentioned as follows: 1) creating a facial expressions database representing real-world conditions; 2) devising Head Pose Normalisation (HPN) methods which are independent of facial parts location; 3) creating automatic methods for the analysis of mood of group of people. The central hypothesis of the thesis is that extracting close to real-world data from movies and performing facial expression analysis on movies is a stepping stone in the direction of moving the analysis of faces towards real-world, unconstrained condition. A temporal facial expressions database, Acted Facial Expressions in the Wild (AFEW) is proposed. The database is constructed and labelled using a semi-automatic process based on closed caption subtitle based keyword search. Currently, AFEW is the largest facial expressions database representing challenging conditions available to the research community. For providing a common platform to researchers in order to evaluate and extend their state-of-the-art FER methods, the first Emotion Recognition in the Wild (EmotiW) challenge based on AFEW is proposed. An image-only based facial expressions database Static Facial Expressions In The Wild (SFEW) extracted from AFEW is proposed. Furthermore, the thesis focuses on HPN for real-world images. Earlier methods were based on fiducial points. However, as fiducial points detection is an open problem for real-world images, HPN can be error-prone. A HPN method based on response maps generated from part-detectors is proposed. The proposed shape-constrained method does not require fiducial points and head pose information, which makes it suitable for real-world images. Data from movies and the internet, representing real-world conditions poses another major challenge of the presence of multiple subjects to the research community. This defines another focus of this thesis where a novel approach for modeling the perception of mood of a group of people in an image is presented. A new database is constructed from Flickr based on keywords related to social events. Three models are proposed: averaging based Group Expression Model (GEM), Weighted Group Expression Model (GEM_w) and Augmented Group Expression Model (GEM_LDA). GEM_w is based on social contextual attributes, which are used as weights on each person's contribution towards the overall group's mood. Further, GEM_LDA is based on topic model and feature augmentation. The proposed framework is applied to applications of group candid shot selection and event summarisation. The application of Structural SIMilarity (SSIM) index metric is explored for finding similar facial expressions. The proposed framework is applied to the problem of creating image albums based on facial expressions, finding corresponding expressions for training facial performance transfer algorithms

    Affective Computing

    This book provides an overview of state of the art research in Affective Computing. It presents new ideas, original results and practical experiences in this increasingly important research field. The book consists of 23 chapters categorized into four sections. Since one of the most important means of human communication is facial expression, the first section of this book (Chapters 1 to 7) presents a research on synthesis and recognition of facial expressions. Given that we not only use the face but also body movements to express ourselves, in the second section (Chapters 8 to 11) we present a research on perception and generation of emotional expressions by using full-body motions. The third section of the book (Chapters 12 to 16) presents computational models on emotion, as well as findings from neuroscience research. In the last section of the book (Chapters 17 to 22) we present applications related to affective computing

    Intelligent Sensors for Human Motion Analysis

    The book, "Intelligent Sensors for Human Motion Analysis," contains 17 articles published in the Special Issue of the Sensors journal. These articles deal with many aspects related to the analysis of human movement. New techniques and methods for pose estimation, gait recognition, and fall detection have been proposed and verified. Some of them will trigger further research, and some may become the backbone of commercial systems

    Survey on Emotional Body Gesture Recognition

    Automatic emotion recognition has become a trending research topic in the past decade. While works based on facial expressions or speech abound, recognizing affect from body gestures remains a less explored topic. We present a new comprehensive survey hoping to boost research in the field. We first introduce emotional body gestures as a component of what is commonly known as "body language" and comment general aspects as gender differences and culture dependence. We then define a complete framework for automatic emotional body gesture recognition. We introduce person detection and comment static and dynamic body pose estimation methods both in RGB and 3D. We then comment the recent literature related to representation learning and emotion recognition from images of emotionally expressive gestures. We also discuss multi-modal approaches that combine speech or face with body gestures for improved emotion recognition. While pre-processing methodologies (e.g., human detection and pose estimation) are nowadays mature technologies fully developed for robust large scale analysis, we show that for emotion recognition the quantity of labelled data is scarce. There is no agreement on clearly defined output spaces and the representations are shallow and largely based on naive geometrical representations

    Animation and Interaction of Responsive, Expressive, and Tangible 3D Virtual Characters

    This thesis is framed within the field of 3D Character Animation. Virtual characters are used in many Human Computer Interaction applications such as video games and serious games. Within these virtual worlds they move and act in similar ways to humans controlled by users through some form of interface or by artificial intelligence. This work addresses the challenges of developing smoother movements and more natural behaviors driving motions in real-time, intuitively, and accurately. The interaction between virtual characters and intelligent objects will also be explored. With these subjects researched the work will contribute to creating more responsive, expressive, and tangible virtual characters. The navigation within virtual worlds uses locomotion such as walking, running, etc. To achieve maximum realism, actors' movements are captured and used to animate virtual characters. This is the philosophy of motion graphs: a structure that embeds movements where the continuous motion stream is generated from concatenating motion pieces. However, locomotion synthesis, using motion graphs, involves a tradeoff between the number of possible transitions between different kinds of locomotion, and the quality of these, meaning smooth transition between poses. To overcome this drawback, we propose the method of progressive transitions using Body Part Motion Graphs (BPMGs). This method deals with partial movements, and generates specific, synchronized transitions for each body part (group of joints) within a window of time. Therefore, the connectivity within the system is not linked to the similarity between global poses allowing us to find more and better quality transition points while increasing the speed of response and execution of these transitions in contrast to standard motion graphs method. Secondly, beyond getting faster transitions and smoother movements, virtual characters also interact with each other and with users by speaking. This interaction requires the creation of appropriate gestures according to the voice that they reproduced. Gestures are the nonverbal language that accompanies voiced language. The credibility of virtual characters when speaking is linked to the naturalness of their movements in sync with the voice in speech and intonation. Consequently, we analyzed the relationship between gestures, speech, and the performed gestures according to that speech. We defined intensity indicators for both gestures (GSI, Gesture Strength Indicator) and speech (PSI, Pitch Strength Indicator). We studied the relationship in time and intensity of these cues in order to establish synchronicity and intensity rules. Later we adapted the mentioned rules to select the appropriate gestures to the speech input (tagged text from speech signal) in the Gesture Motion Graph (GMG). The evaluation of resulting animations shows the importance of relating the intensity of speech and gestures to generate believable animations beyond time synchronization. Subsequently, we present a system that leads automatic generation of gestures and facial animation from a speech signal: BodySpeech. This system also includes animation improvements such as: increased use of data input, more flexible time synchronization, and new features like editing style of output animations. In addition, facial animation also takes into account speech intonation. Finally, we have moved virtual characters from virtual environments to the physical world in order to explore their interaction possibilities with real objects. To this end, we present AvatARs, virtual characters that have tangible representation and are integrated into reality through augmented reality apps on mobile devices. Users choose a physical object to manipulate in order to control the animation. They can select and configure the animation, which serves as a support for the virtual character represented. Then, we explored the interaction of AvatARs with intelligent physical objects like the Pleo social robot. Pleo is used to assist hospitalized children in therapy or simply for playing. Despite its benefits, there is a lack of emotional relationship and interaction between the children and Pleo which makes children lose interest eventually. This is why we have created a mixed reality scenario where Vleo (AvatAR as Pleo, virtual element) and Pleo (real element) interact naturally.     Computer Game Innovation

    Faculty of Technical Physics, Information Technology and Applied Mathematics. Institute of Information TechnologyWydział Fizyki Technicznej, Informatyki i Matematyki Stosowanej. Instytut InformatykiThe "Computer Game Innovations" series is an international forum designed to enable the exchange of knowledge and expertise in the field of video game development. Comprising both academic research and industrial needs, the series aims at advancing innovative industry-academia collaboration. The monograph provides a unique set of articles presenting original research conducted in the leading academic centres which specialise in video games education. The goal of the publication is, among others, to enhance networking opportunities for industry and university representatives seeking to form R&D partnerships. This publication covers the key focus areas specified in the GAMEINN sectoral programme supported by the National Centre for Research and Development

    Similarity, Retrieval, and Classification of Motion Capture Data

    Three-dimensional motion capture data is a digital representation of the complex spatio-temporal structure of human motion. Mocap data is widely used for the synthesis of realistic computer-generated characters in data-driven computer animation and also plays an important role in motion analysis tasks such as activity recognition. Both for efficiency and cost reasons, methods for the reuse of large collections of motion clips are gaining in importance in the field of computer animation. Here, an active field of research is the application of morphing and blending techniques for the creation of new, realistic motions from prerecorded motion clips. This requires the identification and extraction of logically related motions scattered within some data set. Such content-based retrieval of motion capture data, which is a central topic of this thesis, constitutes a difficult problem due to possible spatio-temporal deformations between logically related motions. Recent approaches to motion retrieval apply techniques such as dynamic time warping, which, however, are not applicable to large data sets due to their quadratic space and time complexity. In our approach, we introduce various kinds of relational features describing boolean geometric relations between specified body points and show how these features induce a temporal segmentation of motion capture data streams. By incorporating spatio-temporal invariance into the relational features and induced segments, we are able to adopt indexing methods allowing for flexible and efficient content-based retrieval in large motion capture databases. As a further application of relational motion features, a new method for fully automatic motion classification and retrieval is presented. We introduce the concept of motion templates (MTs), by which the spatio-temporal characteristics of an entire motion class can be learned from training data, yielding an explicit, compact matrix representation. The resulting class MT has a direct, semantic interpretation, and it can be manually edited, mixed, combined with other MTs, extended, and restricted. Furthermore, a class MT exhibits the characteristic as well as the variational aspects of the underlying motion class at a semantically high level. Classification is then performed by comparing a set of precomputed class MTs with unknown motion data and labeling matching portions with the respective motion class label. Here, the crucial point is that the variational (hence uncharacteristic) motion aspects encoded in the class MT are automatically masked out in the comparison, which can be thought of as locally adaptive feature selection