135 research outputs found
Synthesis and Control of High Resolution Facial Expressions for Visual Interactions
The synthesis of facial expression with control of intensity and personal styles is important in intelligent and affective human-computer interaction, especially in face-to-face inter-action between human and intelligent agent. We present a facial expression animation system that facilitates control of expressiveness and style. We learn a decomposable genera-tive model for the nonlinear deformation of facial expressions by analyzing the mapping space between low dimensional embedded representation and high resolution tracking data. Bilinear analysis of the mapping space provides a compact representation of the nonlinear generative model for facial expressions. The decomposition allows synthesis of new fa-cial expressions by control of geometry and expression style. The generative model provides control of expressiveness pre-serving nonlinear deformation in the expressions with simple parameters and allows synthesis of stylized facial geometry. In addition, we can directly extract the MPEG-4 Facial Ani-mation Parameters (FAPs) from the synthesized data, which allows using any animation engine that supports FAPs to ani-mate new synthesized expressions. 1
06241 Abstracts Collection -- Human Motion - Understanding, Modeling, Capture and Animation. 13th Workshop
From 11.06.06 to 16.06.06, the Dagstuhl Seminar 06241 ``Human Motion - Understanding, Modeling, Capture and Animation. 13th Workshop "Theoretical Foundations of Computer Vision"\u27\u27 was held
in the International Conference and Research Center (IBFI),
Schloss Dagstuhl.
During the seminar, several participants presented their current
research, and ongoing work and open problems were discussed. Abstracts of
the presentations given during the seminar as well as abstracts of
seminar results and ideas are put together in this paper. The first section
describes the seminar topics and goals in general
Going Deeper into Action Recognition: A Survey
Understanding human actions in visual data is tied to advances in
complementary research areas including object recognition, human dynamics,
domain adaptation and semantic segmentation. Over the last decade, human action
analysis evolved from earlier schemes that are often limited to controlled
environments to nowadays advanced solutions that can learn from millions of
videos and apply to almost all daily activities. Given the broad range of
applications from video surveillance to human-computer interaction, scientific
milestones in action recognition are achieved more rapidly, eventually leading
to the demise of what used to be good in a short time. This motivated us to
provide a comprehensive review of the notable steps taken towards recognizing
human actions. To this end, we start our discussion with the pioneering methods
that use handcrafted representations, and then, navigate into the realm of deep
learning based approaches. We aim to remain objective throughout this survey,
touching upon encouraging improvements as well as inevitable fallbacks, in the
hope of raising fresh questions and motivating new research directions for the
reader
An Efficient Boosted Classifier Tree-Based Feature Point Tracking System for Facial Expression Analysis
The study of facial movement and expression has been a prominent area of research since the early work of Charles Darwin. The Facial Action Coding System (FACS), developed by Paul Ekman, introduced the first universal method of coding and measuring facial movement. Human-Computer Interaction seeks to make human interaction with computer systems more effective, easier, safer, and more seamless. Facial expression recognition can be broken down into three distinctive subsections: Facial Feature Localization, Facial Action Recognition, and Facial Expression Classification. The first and most important stage in any facial expression analysis system is the localization of key facial features. Localization must be accurate and efficient to ensure reliable tracking and leave time for computation and comparisons to learned facial models while maintaining real-time performance. Two possible methods for localizing facial features are discussed in this dissertation.
The Active Appearance Model is a statistical model describing an object\u27s parameters through the use of both shape and texture models, resulting in appearance. Statistical model-based training for object recognition takes multiple instances of the object class of interest, or positive samples, and multiple negative samples, i.e., images that do not contain objects of interest. Viola and Jones present a highly robust real-time face detection system, and a statistically boosted attentional detection cascade composed of many weak feature detectors. A basic algorithm for the elimination of unnecessary sub-frames while using Viola-Jones face detection is presented to further reduce image search time.
A real-time emotion detection system is presented which is capable of identifying seven affective states (agreeing, concentrating, disagreeing, interested, thinking, unsure, and angry) from a near-infrared video stream. The Active Appearance Model is used to place 23 landmark points around key areas of the eyes, brows, and mouth. A prioritized binary decision tree then detects, based on the actions of these key points, if one of the seven emotional states occurs as frames pass. The completed system runs accurately and achieves a real-time frame rate of approximately 36 frames per second.
A novel facial feature localization technique utilizing a nested cascade classifier tree is proposed. A coarse-to-fine search is performed in which the regions of interest are defined by the response of Haar-like features comprising the cascade classifiers. The individual responses of the Haar-like features are also used to activate finer-level searches. A specially cropped training set derived from the Cohn-Kanade AU-Coded database is also developed and tested. Extensions of this research include further testing to verify the novel facial feature localization technique presented for a full 26-point face model, and implementation of a real-time intensity sensitive automated Facial Action Coding System
Towards perceptual intelligence : statistical modeling of human individual and interactive behaviors
Thesis (Ph.D.)--Massachusetts Institute of Technology, Dept. of Architecture, 2000.Includes bibliographical references (p. 279-297).This thesis presents a computational framework for the automatic recognition and prediction of different kinds of human behaviors from video cameras and other sensors, via perceptually intelligent systems that automatically sense and correctly classify human behaviors, by means of Machine Perception and Machine Learning techniques. In the thesis I develop the statistical machine learning algorithms (dynamic graphical models) necessary for detecting and recognizing individual and interactive behaviors. In the case of the interactions two Hidden Markov Models (HMMs) are coupled in a novel architecture called Coupled Hidden Markov Models (CHMMs) that explicitly captures the interactions between them. The algorithms for learning the parameters from data as well as for doing inference with those models are developed and described. Four systems that experimentally evaluate the proposed paradigm are presented: (1) LAFTER, an automatic face detection and tracking system with facial expression recognition; (2) a Tai-Chi gesture recognition system; (3) a pedestrian surveillance system that recognizes typical human to human interactions; (4) and a SmartCar for driver maneuver recognition. These systems capture human behaviors of different nature and increasing complexity: first, isolated, single-user facial expressions, then, two-hand gestures and human-to-human interactions, and finally complex behaviors where human performance is mediated by a machine, more specifically, a car. The metric that is used for quantifying the quality of the behavior models is their accuracy: how well they are able to recognize the behaviors on testing data. Statistical machine learning usually suffers from lack of data for estimating all the parameters in the models. In order to alleviate this problem, synthetically generated data are used to bootstrap the models creating 'prior models' that are further trained using much less real data than otherwise it would be required. The Bayesian nature of the approach let us do so. The predictive power of these models lets us categorize human actions very soon after the beginning of the action. Because of the generic nature of the typical behaviors of each of the implemented systems there is a reason to believe that this approach to modeling human behavior would generalize to other dynamic human-machine systems. This would allow us to recognize automatically people's intended action, and thus build control systems that dynamically adapt to suit the human's purposes better.by Nuria M. Oliver.Ph.D
Recommended from our members
Image based human body rendering via regression & MRF energy minimization
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.A machine learning method for synthesising human images is explored to create new images without relying on 3D modelling. Machine learning allows the creation of new images through prediction from existing data based on the use of training images. In the present study, image synthesis is performed at two levels: contour and pixel. A class of learning-based methods is formulated to create object contours from the training image for the synthetic image that allow pixel synthesis within the contours in the second level. The methods rely on applying robust object descriptions, dynamic learning models after appropriate motion segmentation, and machine learning-based frameworks.
Image-based human image synthesis using machine learning is a research focus that has recently gained considerable attention in the field of computer graphics. It makes use of techniques from image/motion analysis in computer vision. The problem lies in the estimation of methods for image-based object configuration (i.e. segmentation, contour outline). Using the results of these analysis methods as bases, the research adopts the machine learning approach, in which human images are synthesised by executing the synthesis of contour and pixels through the learning from training image.
Firstly, thesis shows how an accurate silhouette is distilled using developed background subtraction for accuracy and efficiency. The traditional vector machine approach is used to avoid ambiguities within the regression process. Images can be represented as a class of accurate and efficient vectors for single images as well as sequences. Secondly, the framework is explored using a unique view of machine learning methods, i.e., support vector regression (SVR), to obtain the convergence result of vectors for contour allocation. The changing relationship between the synthetic image and the training image is expressed as a vector and represented in functions. Finally, a pixel synthesis is performed based on belief propagation.
This thesis proposes a novel image-based rendering method for colour image synthesis using SVR and belief propagation for generalisation to enable the prediction of contour and colour information from input colour images. The methods rely on using appropriately defined and robust input colour images, optimising the input contour images within a sparse SVR framework. Firstly, the thesis shows how contour can effectively and efficiently be predicted from small numbers of input contour images. In addition, the thesis exploits the sparse properties of SVR efficiency, and makes use of SVR to estimate regression function. The image-based rendering method employed in this study enables contour synthesis for the prediction of small numbers of input source images. This procedure avoids the use of complex models and geometry information. Secondly, the method used for human body contour colouring is extended to define eight differently connected pixels, and construct a link distance field via the belief propagation method. The link distance, which acts as the message in propagation, is transformed by improving the low-envelope method in fast distance transform. Finally, the methodology is tested by considering human facial and human body clothing information. The accuracy of the test results for the human body model confirms the efficiency of the proposed method
On the Design, Implementation and Application of Novel Multi-disciplinary Techniques for explaining Artificial Intelligence Models
284 p.Artificial Intelligence is a non-stopping field of research that has experienced some incredible growth lastdecades. Some of the reasons for this apparently exponential growth are the improvements incomputational power, sensing capabilities and data storage which results in a huge increment on dataavailability. However, this growth has been mostly led by a performance-based mindset that has pushedmodels towards a black-box nature. The performance prowess of these methods along with the risingdemand for their implementation has triggered the birth of a new research field. Explainable ArtificialIntelligence. As any new field, XAI falls short in cohesiveness. Added the consequences of dealing withconcepts that are not from natural sciences (explanations) the tumultuous scene is palpable. This thesiscontributes to the field from two different perspectives. A theoretical one and a practical one. The formeris based on a profound literature review that resulted in two main contributions: 1) the proposition of anew definition for Explainable Artificial Intelligence and 2) the creation of a new taxonomy for the field.The latter is composed of two XAI frameworks that accommodate in some of the raging gaps found field,namely: 1) XAI framework for Echo State Networks and 2) XAI framework for the generation ofcounterfactual. The first accounts for the gap concerning Randomized neural networks since they havenever been considered within the field of XAI. Unfortunately, choosing the right parameters to initializethese reservoirs falls a bit on the side of luck and past experience of the scientist and less on that of soundreasoning. The current approach for assessing whether a reservoir is suited for a particular task is toobserve if it yields accurate results, either by handcrafting the values of the reservoir parameters or byautomating their configuration via an external optimizer. All in all, this poses tough questions to addresswhen developing an ESN for a certain application, since knowing whether the created structure is optimalfor the problem at hand is not possible without actually training it. However, some of the main concernsfor not pursuing their application is related to the mistrust generated by their black-box" nature. Thesecond presents a new paradigm to treat counterfactual generation. Among the alternatives to reach auniversal understanding of model explanations, counterfactual examples is arguably the one that bestconforms to human understanding principles when faced with unknown phenomena. Indeed, discerningwhat would happen should the initial conditions differ in a plausible fashion is a mechanism oftenadopted by human when attempting at understanding any unknown. The search for counterfactualsproposed in this thesis is governed by three different objectives. Opposed to the classical approach inwhich counterfactuals are just generated following a minimum distance approach of some type, thisframework allows for an in-depth analysis of a target model by means of counterfactuals responding to:Adversarial Power, Plausibility and Change Intensity
Bridging the gap between reconstruction and synthesis
Aplicat embargament des de la data de defensa fins el 15 de gener de 20223D reconstruction and image synthesis are two of the main pillars in computer vision. Early works focused on simple tasks such as multi-view reconstruction and texture synthesis. With the spur of Deep Learning, the field has rapidly progressed, making it possible to achieve more complex and high level tasks. For example, the 3D reconstruction results of traditional multi-view approaches are currently obtained with single view methods. Similarly, early pattern based texture synthesis works have resulted in techniques that allow generating novel high-resolution images.
In this thesis we have developed a hierarchy of tools that cover all these range of problems, lying at the intersection of computer vision, graphics and machine learning. We tackle the problem of 3D reconstruction and synthesis in the wild. Importantly, we advocate for a paradigm in which not everything should be learned. Instead of applying Deep Learning naively we propose novel representations, layers and architectures that directly embed prior 3D geometric knowledge for the task of 3D reconstruction and synthesis. We apply these techniques to problems including scene/person reconstruction and photo-realistic rendering. We first address methods to reconstruct a scene and the clothed people in it while estimating the camera position. Then, we tackle image and video synthesis for clothed people in the wild. Finally, we bridge the gap between reconstruction and synthesis under the umbrella of a unique novel formulation. Extensive experiments conducted along this thesis show that the proposed techniques improve the performance of Deep Learning models in terms of the quality of the reconstructed 3D shapes / synthesised images, while reducing the amount of supervision and training data required to train them.
In summary, we provide a variety of low, mid and high level algorithms that can be used to incorporate prior knowledge into different stages of the Deep Learning pipeline and improve performance in tasks of 3D reconstruction and image synthesis.La reconstrucció 3D i la síntesi d'imatges són dos dels pilars fonamentals en visió per computador. Els estudis previs es centren en tasques senzilles com la reconstrucció amb informació multi-càmera i la síntesi de textures. Amb l'aparició del "Deep Learning", aquest camp ha progressat ràpidament, fent possible assolir tasques molt més complexes. Per exemple, per obtenir una reconstrucció 3D, tradicionalment s'utilitzaven mètodes multi-càmera, en canvi ara, es poden obtenir a partir d'una sola imatge. De la mateixa manera, els primers treballs de síntesi de textures basats en patrons han donat lloc a tècniques que permeten generar noves imatges completes en alta resolució. En aquesta tesi, hem desenvolupat una sèrie d'eines que cobreixen tot aquest ventall de problemes, situats en la intersecció entre la visió per computador, els gràfics i l'aprenentatge automàtic. Abordem el problema de la reconstrucció i la síntesi 3D en el món real. És important destacar que defensem un paradigma on no tot s'ha d'aprendre. Enlloc d'aplicar el "Deep Learning" de forma naïve, proposem representacions novedoses i arquitectures que incorporen directament els coneixements geomètrics ja existents per a aconseguir la reconstrucció 3D i la síntesi d'imatges. Nosaltres apliquem aquestes tècniques a problemes com ara la reconstrucció d'escenes/persones i a la renderització d'imatges fotorealistes. Primer abordem els mètodes per reconstruir una escena, les persones vestides que hi ha i la posició de la càmera. A continuació, abordem la síntesi d'imatges i vídeos de persones vestides en situacions quotidianes. I finalment, aconseguim, a través d'una nova formulació única, connectar la reconstrucció amb la síntesi. Els experiments realitzats al llarg d'aquesta tesi demostren que les tècniques proposades milloren el rendiment dels models de "Deepp Learning" pel que fa a la qualitat de les reconstruccions i les imatges sintetitzades alhora que redueixen la quantitat de dades necessàries per entrenar-los. En resum, proporcionem una varietat d'algoritmes de baix, mitjà i alt nivell que es poden utilitzar per incorporar els coneixements previs a les diferents etapes del "Deep Learning" i millorar el rendiment en tasques de reconstrucció 3D i síntesi d'imatges.Postprint (published version
- …