44 research outputs found

    {3D} Morphable Face Models -- Past, Present and Future

    No full text
    In this paper, we provide a detailed survey of 3D Morphable Face Models over the 20 years since they were first proposed. The challenges in building and applying these models, namely capture, modeling, image formation, and image analysis, are still active research topics, and we review the state-of-the-art in each of these areas. We also look ahead, identifying unsolved challenges, proposing directions for future research and highlighting the broad range of current and future applications

    Human Motion Analysis for Efficient Action Recognition

    Get PDF
    Automatic understanding of human actions is at the core of several application domains, such as content-based indexing, human-computer interaction, surveillance, and sports video analysis. The recent advances in digital platforms and the exponential growth of video and image data have brought an urgent quest for intelligent frameworks to automatically analyze human motion and predict their corresponding action based on visual data and sensor signals. This thesis presents a collection of methods that targets human action recognition using different action modalities. The first method uses the appearance modality and classifies human actions based on heterogeneous global- and local-based features of scene and humanbody appearances. The second method harnesses 2D and 3D articulated human poses and analyizes the body motion using a discriminative combination of the parts’ velocities, locations, and correlations histograms for action recognition. The third method presents an optimal scheme for combining the probabilistic predictions from different action modalities by solving a constrained quadratic optimization problem. In addition to the action classification task, we present a study that compares the utility of different pose variants in motion analysis for human action recognition. In particular, we compare the recognition performance when 2D and 3D poses are used. Finally, we demonstrate the efficiency of our pose-based method for action recognition in spotting and segmenting motion gestures in real time from a continuous stream of an input video for the recognition of the Italian sign gesture language

    Foundations and Recent Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

    Full text link
    Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design computer agents with intelligent capabilities such as understanding, reasoning, and learning through integrating multiple communicative modalities, including linguistic, acoustic, visual, tactile, and physiological messages. With the recent interest in video understanding, embodied autonomous agents, text-to-image generation, and multisensor fusion in application domains such as healthcare and robotics, multimodal machine learning has brought unique computational and theoretical challenges to the machine learning community given the heterogeneity of data sources and the interconnections often found between modalities. However, the breadth of progress in multimodal research has made it difficult to identify the common themes and open questions in the field. By synthesizing a broad range of application domains and theoretical frameworks from both historical and recent perspectives, this paper is designed to provide an overview of the computational and theoretical foundations of multimodal machine learning. We start by defining two key principles of modality heterogeneity and interconnections that have driven subsequent innovations, and propose a taxonomy of 6 core technical challenges: representation, alignment, reasoning, generation, transference, and quantification covering historical and recent trends. Recent technical achievements will be presented through the lens of this taxonomy, allowing researchers to understand the similarities and differences across new approaches. We end by motivating several open problems for future research as identified by our taxonomy

    Robust subspace learning for static and dynamic affect and behaviour modelling

    Get PDF
    Machine analysis of human affect and behavior in naturalistic contexts has witnessed a growing attention in the last decade from various disciplines ranging from social and cognitive sciences to machine learning and computer vision. Endowing machines with the ability to seamlessly detect, analyze, model, predict as well as simulate and synthesize manifestations of internal emotional and behavioral states in real-world data is deemed essential for the deployment of next-generation, emotionally- and socially-competent human-centered interfaces. In this thesis, we are primarily motivated by the problem of modeling, recognizing and predicting spontaneous expressions of non-verbal human affect and behavior manifested through either low-level facial attributes in static images or high-level semantic events in image sequences. Both visual data and annotations of naturalistic affect and behavior naturally contain noisy measurements of unbounded magnitude at random locations, commonly referred to as ‘outliers’. We present here machine learning methods that are robust to such gross, sparse noise. First, we deal with static analysis of face images, viewing the latter as a superposition of mutually-incoherent, low-complexity components corresponding to facial attributes, such as facial identity, expressions and activation of atomic facial muscle actions. We develop a robust, discriminant dictionary learning framework to extract these components from grossly corrupted training data and combine it with sparse representation to recognize the associated attributes. We demonstrate that our framework can jointly address interrelated classification tasks such as face and facial expression recognition. Inspired by the well-documented importance of the temporal aspect in perceiving affect and behavior, we direct the bulk of our research efforts into continuous-time modeling of dimensional affect and social behavior. Having identified a gap in the literature which is the lack of data containing annotations of social attitudes in continuous time and scale, we first curate a new audio-visual database of multi-party conversations from political debates annotated frame-by-frame in terms of real-valued conflict intensity and use it to conduct the first study on continuous-time conflict intensity estimation. Our experimental findings corroborate previous evidence indicating the inability of existing classifiers in capturing the hidden temporal structures of affective and behavioral displays. We present here a novel dynamic behavior analysis framework which models temporal dynamics in an explicit way, based on the natural assumption that continuous- time annotations of smoothly-varying affect or behavior can be viewed as outputs of a low-complexity linear dynamical system when behavioral cues (features) act as system inputs. A novel robust structured rank minimization framework is proposed to estimate the system parameters in the presence of gross corruptions and partially missing data. Experiments on prediction of dimensional conflict and affect as well as multi-object tracking from detection validate the effectiveness of our predictive framework and demonstrate that for the first time that complex human behavior and affect can be learned and predicted based on small training sets of person(s)-specific observations.Open Acces

    Weakly-Labeled Data and Identity-Normalization for Facial Image Analysis

    Get PDF
    RÉSUMÉ Cette thĂšse traite de l’amĂ©lioration de la reconnaissance faciale et de l’analyse de l’expression du visage en utilisant des sources d’informations faibles. Les donnĂ©es Ă©tiquetĂ©es sont souvent rares, mais les donnĂ©es non Ă©tiquetĂ©es contiennent souvent des informations utiles pour l’apprentissage d’un modĂšle. Cette thĂšse dĂ©crit deux exemples d’utilisation de cette idĂ©e. Le premier est une nouvelle mĂ©thode pour la reconnaissance faciale basĂ©e sur l’exploitation de donnĂ©es Ă©tiquetĂ©es faiblement ou bruyamment. Les donnĂ©es non Ă©tiquetĂ©es peuvent ĂȘtre acquises d’une maniĂšre qui offre des caractĂ©ristiques supplĂ©mentaires. Ces caractĂ©ristiques, tout en n’étant pas disponibles pour les donnĂ©es Ă©tiquetĂ©es, peuvent encore ĂȘtre utiles avec un peu de prĂ©voyance. Cette thĂšse traite de la combinaison d’un ensemble de donnĂ©es Ă©tiquetĂ©es pour la reconnaissance faciale avec des images des visages extraits de vidĂ©os sur YouTube et des images des visages obtenues Ă  partir d’un moteur de recherche. Le moteur de recherche web et le moteur de recherche vidĂ©o peuvent ĂȘtre considĂ©rĂ©s comme de classificateurs trĂšs faibles alternatifs qui fournissent des Ă©tiquettes faibles. En utilisant les rĂ©sultats de ces deux types de requĂȘtes de recherche comme des formes d’étiquettes faibles diffĂ©rents, une mĂ©thode robuste pour la classification peut ĂȘtre dĂ©veloppĂ©e. Cette mĂ©thode est basĂ©e sur des modĂšles graphiques, mais aussi incorporant une marge probabiliste. Plus prĂ©cisĂ©ment, en utilisant un modĂšle inspirĂ© par la variational relevance vector machine (RVM), une alternative probabiliste Ă  la support vector machine (SVM) est dĂ©veloppĂ©e. Contrairement aux formulations prĂ©cĂ©dentes de la RVM, le choix d’une probabilitĂ© a priori exponentielle est introduit pour produire une approximation de la pĂ©nalitĂ© L1. Les rĂ©sultats expĂ©rimentaux oĂč les Ă©tiquettes bruyantes sont simulĂ©es, et les deux expĂ©riences distinctes oĂč les Ă©tiquettes bruyantes de l’image et les rĂ©sultats de recherche vidĂ©o en utilisant des noms comme les requĂȘtes indiquent que l’information faible dans les Ă©tiquettes peut ĂȘtre exploitĂ©e avec succĂšs. Puisque le modĂšle dĂ©pend fortement des mĂ©thodes noyau de rĂ©gression clairsemĂ©es, ces mĂ©thodes sont examinĂ©es et discutĂ©es en dĂ©tail. Plusieurs algorithmes diffĂ©rents utilisant les distributions a priori pour encourager les modĂšles clairsemĂ©s sont dĂ©crits en dĂ©tail. Des expĂ©riences sont montrĂ©es qui illustrent le comportement de chacune de ces distributions. UtilisĂ©s en conjonction avec la rĂ©gression logistique, les effets de chaque distribution sur l’ajustement du modĂšle et la complexitĂ© du modĂšle sont montrĂ©s. Les extensions aux autres mĂ©thodes d’apprentissage machine sont directes, car l’approche est ancrĂ©e dans la probabilitĂ© bayĂ©sienne. Une expĂ©rience dans la prĂ©diction structurĂ©e utilisant un conditional random field pour une tĂąche d’imagerie mĂ©dicale est montrĂ©e pour illustrer comment ces distributions a priori peuvent ĂȘtre incorporĂ©es facilement Ă  d’autres tĂąches et peuvent donner de meilleurs rĂ©sultats. Les donnĂ©es Ă©tiquetĂ©es peuvent Ă©galement contenir des sources faibles d’informations qui ne peuvent pas nĂ©cessairement ĂȘtre utilisĂ©es pour un effet maximum. Par exemple les ensembles de donnĂ©es d’images des visages pour les tĂąches tels que, l’animation faciale contrĂŽlĂ©e par les performances des comĂ©diens, la reconnaissance des Ă©motions, et la prĂ©diction des points clĂ©s ou les repĂšres du visage contiennent souvent des Ă©tiquettes alternatives par rapport Ă  la tĂąche d’internet principale. Dans les donnĂ©es de reconnaissance des Ă©motions, par exemple, des Ă©tiquettes de l’émotion sont souvent rares. C’est peut-ĂȘtre parce que ces images sont extraites d’une vidĂ©o, dans laquelle seul un petit segment reprĂ©sente l’étiquette de l’émotion. En consĂ©quence, de nombreuses images de l’objet sont dans le mĂȘme contexte en utilisant le mĂȘme appareil photo ne sont pas utilisĂ©s. Toutefois, ces donnĂ©es peuvent ĂȘtre utilisĂ©es pour amĂ©liorer la capacitĂ© des techniques d’apprentissage de gĂ©nĂ©raliser pour des personnes nouvelles et pas encore vues en modĂ©lisant explicitement les variations vues prĂ©cĂ©demment liĂ©es Ă  l’identitĂ© et Ă  l’expression. Une fois l’identitĂ© et de la variation de l’expression sont sĂ©parĂ©es, les approches supervisĂ©es simples peuvent mieux gĂ©nĂ©raliser aux identitĂ©s de nouveau. Plus prĂ©cisĂ©ment, dans cette thĂšse, la modĂ©lisation probabiliste de ces sources de variation est utilisĂ©e pour identitĂ© normaliser et des diverses reprĂ©sentations d’images faciales. Une variĂ©tĂ© d’expĂ©riences sont dĂ©crites dans laquelle la performance est constamment amĂ©liorĂ©e, incluant la reconnaissance des Ă©motions, les animations faciales contrĂŽlĂ©es par des visages des comĂ©diens sans marqueurs et le suivi des points clĂ©s sur des visages. Dans de nombreux cas dans des images faciales, des sources d’information supplĂ©mentaire peuvent ĂȘtre disponibles qui peuvent ĂȘtre utilisĂ©es pour amĂ©liorer les tĂąches d’intĂ©rĂȘt. Cela comprend des Ă©tiquettes faibles qui sont prĂ©vues pendant la collecte des donnĂ©es, telles que la requĂȘte de recherche utilisĂ©e pour acquĂ©rir des donnĂ©es, ainsi que des informations d’identitĂ© dans le cas de plusieurs bases de donnĂ©es d’images expĂ©rimentales. Cette thĂšse soutient en principal que cette information doit ĂȘtre utilisĂ©e et dĂ©crit les mĂ©thodes pour le faire en utilisant les outils de la probabilitĂ©.----------ABSTRACT This thesis deals with improving facial recognition and facial expression analysis using weak sources of information. Labeled data is often scarce, but unlabeled data often contains information which is helpful to learning a model. This thesis describes two examples of using this insight. The first is a novel method for face-recognition based on leveraging weak or noisily labeled data. Unlabeled data can be acquired in a way which provides additional features. These features, while not being available for the labeled data, may still be useful with some foresight. This thesis discusses combining a labeled facial recognition dataset with face images extracted from videos on YouTube and face images returned from using a search engine. The web search engine and the video search engine can be viewed as very weak alternative classifier which provide “weak labels.” Using the results from these two different types of search queries as forms of weak labels, a robust method for classification can be developed. This method is based on graphical models, but also encorporates a probabilistic margin. More specifically, using a model inspired by the variational relevance vector machine (RVM), a probabilistic alternative to transductive support vector machines (TSVM) is further developed. In contrast to previous formulations of RVMs, the choice of an Exponential hyperprior is introduced to produce an approximation to the L1 penalty. Experimental results where noisy labels are simulated and separate experiments where noisy labels from image and video search results using names as queries both indicate that weak label information can be successfully leveraged. Since the model depends heavily on sparse kernel regression methods, these methods are reviewed and discussed in detail. Several different sparse priors algorithms are described in detail. Experiments are shown which illustrate the behavior of each of these sparse priors. Used in conjunction with logistic regression, each sparsity inducing prior is shown to have varying effects in terms of sparsity and model fit. Extending this to other machine learning methods is straight forward since it is grounded firmly in Bayesian probability. An experiment in structured prediction using Conditional Random Fields on a medical image task is shown to illustrate how sparse priors can easily be incorporated in other tasks, and can yield improved results. Labeled data may also contain weak sources of information that may not necessarily be used to maximum effect. For example, facial image datasets for the tasks of performance driven facial animation, emotion recognition, and facial key-point or landmark prediction often contain alternative labels from the task at hand. In emotion recognition data, for example, emotion labels are often scarce. This may be because these images are extracted from a video, in which only a small segment depicts the emotion label. As a result, many images of the subject in the same setting using the same camera are unused. However, this data can be used to improve the ability of learning techniques to generalize to new and unseen individuals by explicitly modeling previously seen variations related to identity and expression. Once identity and expression variation are separated, simpler supervised approaches can work quite well to generalize to unseen subjects. More specifically, in this thesis, probabilistic modeling of these sources of variation is used to “identity-normalize” various facial image representations. A variety of experiments are described in which performance on emotion recognition, markerless performance-driven facial animation and facial key-point tracking is consistently improved. This includes an algorithm which shows how this kind of normalization can be used for facial key-point localization. In many cases in facial images, sources of information may be available that can be used to improve tasks. This includes weak labels which are provided during data gathering, such as the search query used to acquire data, as well as identity information in the case of many experimental image databases. This thesis argues in main that this information should be used and describes methods for doing so using the tools of probability

    Non-acyclicity of coset lattices and generation of finite groups

    Get PDF

    Journal of Telecommunications and Information Technology, 2006, nr 1

    Get PDF
    kwartalni

    Efficient Design, Training, and Deployment of Artificial Neural Networks

    Get PDF
    Over the last decade, artificial neural networks, especially deep neural networks, have emerged as the main modeling tool in Machine Learning, allowing us to tackle an increasing number of real-world problems in various fields, most notably, in computer vision, natural language processing, biomedical and financial analysis. The success of deep neural networks can be attributed to many factors, namely the increasing amount of data available, the developments of dedicated hardware, the advancements in optimization techniques, and especially the invention of novel neural network architectures. Nowadays, state-of-the-arts neural networks that achieve the best performance in any field are usually formed by several layers, comprising millions, or even billions of parameters. Despite spectacular performances, optimizing a single state-of- the-arts neural network often requires a tremendous amount of computation, which can take several days using high-end hardware. More importantly, it took several years of experimentation for the community to gradually discover effective neural network architectures, moving from AlexNet, VGGNet, to ResNet, and then DenseNet. In addition to the expensive and time-consuming experimentation process, deep neural networks, which require powerful processors to operate during the deployment phase, cannot be easily deployed to mobile or embedded devices. For these reasons, improving the design, training, and deployment of deep neural networks has become an important area of research in the Machine Learning field. This thesis makes several contributions in the aforementioned research area, which can be grouped into two main categories. The first category consists of research works that focus on designing efficient neural network architectures not only in terms of accuracy but also computational complexity. In the first contribution under this category, the computational efficiency is first addressed at the filter level through the incorporation of a handcrafted design for convolutional neural networks, which are the basis of most deep neural networks. More specifically, the multilinear convolution filter is proposed to replace the linear convolution filter, which is a fundamental element in a convolutional neural network. The new filter design not only better captures multidimensional structures inherent in CNNs but also requires far fewer parameters to be estimated. While using efficient algebraic transforms and approximation techniques to tackle the design problem can significantly reduce the memory and computational footprint of neural network models, this approach requires a lot of trial and error. In addition, the simple neuron model used in most neural networks nowadays, which only performs a linear transformation followed by a nonlinear activation, cannot effectively mimic the diverse activities of biological neurons. For this reason, the second and third contributions transition from a handcrafted, manual design approach to an algorithmic approach in which the type of transformations performed by each neuron as well as the topology of neural networks are optimized in a systematic and completely data-dependent manner. As a result, the algorithms proposed in the second and third contributions are capable of designing highly accurate and compact neural networks while requiring minimal human efforts or intervention in the design process. Despite significant progress has been made to reduce the runtime complexity of neural network models on embedded devices, the majority of them have been demonstrated on powerful embedded devices, which are costly in applications that require large-scale deployment such as surveillance systems. In these scenarios, complete on-device processing solutions can be infeasible. On the contrary, hybrid solutions, where some preprocessing steps are conducted on the client side while the heavy computation takes place on the server side, are more practical. The second category of contributions made in this thesis focuses on efficient learning methodologies for hybrid solutions that take into ac- count both the signal acquisition and inference steps. More concretely, the first contribution under this category is the formulation of the Multilinear Compressive Learning framework in which multidimensional signals are compressively acquired, and inference is made based on the compressed signals, bypassing the signal reconstruction step. In the second contribution, the relationships be- tween the input signal resolution, the compression rate, and the learning performance of Multilinear Compressive Learning systems are empirically analyzed systematically, leading to the discovery of a surrogate performance indicator that can be used to approximately rank the learning performances of different sensor configurations without conducting the entire optimization process. Nowadays, many communication protocols provide support for adaptive data transmission to maximize the data throughput and minimize energy consumption depending on the network’s strength. The last contribution of this thesis proposes an extension of the Multilinear Compressive Learning framework with an adaptive compression capability, which enables us to take advantage of the adaptive rate transmission feature in existing communication protocols to maximize the informational content throughput of the whole system. Finally, all methodological contributions of this thesis are accompanied by extensive empirical analyses demonstrating their performance and computational advantages over existing methods in different computer vision applications such as object recognition, face verification, human activity classification, and visual information retrieval
    corecore