    Robust correlated and individual component analysis

    © 1979-2012 IEEE.Recovering correlated and individual components of two, possibly temporally misaligned, sets of data is a fundamental task in disciplines such as image, vision, and behavior computing, with application to problems such as multi-modal fusion (via correlated components), predictive analysis, and clustering (via the individual ones). Here, we study the extraction of correlated and individual components under real-world conditions, namely i) the presence of gross non-Gaussian noise and ii) temporally misaligned data. In this light, we propose a method for the Robust Correlated and Individual Component Analysis (RCICA) of two sets of data in the presence of gross, sparse errors. We furthermore extend RCICA in order to handle temporal incongruities arising in the data. To this end, two suitable optimization problems are solved. The generality of the proposed methods is demonstrated by applying them onto 4 applications, namely i) heterogeneous face recognition, ii) multi-modal feature fusion for human behavior analysis (i.e., audio-visual prediction of interest and conflict), iii) face clustering, and iv) thetemporal alignment of facial expressions. Experimental results on 2 synthetic and 7 real world datasets indicate the robustness and effectiveness of the proposed methodson these application domains, outperforming other state-of-the-art methods in the field

    Machine learning for automatic analysis of affective behaviour

    The automated analysis of affect has been gaining rapidly increasing attention by researchers over the past two decades, as it constitutes a fundamental step towards achieving next-generation computing technologies and integrating them into everyday life (e.g. via affect-aware, user-adaptive interfaces, medical imaging, health assessment, ambient intelligence etc.). The work presented in this thesis focuses on several fundamental problems manifesting in the course towards the achievement of reliable, accurate and robust affect sensing systems. In more detail, the motivation behind this work lies in recent developments in the field, namely (i) the creation of large, audiovisual databases for affect analysis in the so-called ''Big-Data`` era, along with (ii) the need to deploy systems under demanding, real-world conditions. These developments led to the requirement for the analysis of emotion expressions continuously in time, instead of merely processing static images, thus unveiling the wide range of temporal dynamics related to human behaviour to researchers. The latter entails another deviation from the traditional line of research in the field: instead of focusing on predicting posed, discrete basic emotions (happiness, surprise etc.), it became necessary to focus on spontaneous, naturalistic expressions captured under settings more proximal to real-world conditions, utilising more expressive emotion descriptions than a set of discrete labels. To this end, the main motivation of this thesis is to deal with challenges arising from the adoption of continuous dimensional emotion descriptions under naturalistic scenarios, considered to capture a much wider spectrum of expressive variability than basic emotions, and most importantly model emotional states which are commonly expressed by humans in their everyday life. In the first part of this thesis, we attempt to demystify the quite unexplored problem of predicting continuous emotional dimensions. This work is amongst the first to explore the problem of predicting emotion dimensions via multi-modal fusion, utilising facial expressions, auditory cues and shoulder gestures. A major contribution of the work presented in this thesis lies in proposing the utilisation of various relationships exhibited by emotion dimensions in order to improve the prediction accuracy of machine learning methods - an idea which has been taken on by other researchers in the field since. In order to experimentally evaluate this, we extend methods such as the Long Short-Term Memory Neural Networks (LSTM), the Relevance Vector Machine (RVM) and Canonical Correlation Analysis (CCA) in order to exploit output relationships in learning. As it is shown, this increases the accuracy of machine learning models applied to this task. The annotation of continuous dimensional emotions is a tedious task, highly prone to the influence of various types of noise. Performed real-time by several annotators (usually experts), the annotation process can be heavily biased by factors such as subjective interpretations of the emotional states observed, the inherent ambiguity of labels related to human behaviour, the varying reaction lags exhibited by each annotator as well as other factors such as input device noise and annotation errors. In effect, the annotations manifest a strong spatio-temporal annotator-specific bias. Failing to properly deal with annotation bias and noise leads to an inaccurate ground truth, and therefore to ill-generalisable machine learning models. This deems the proper fusion of multiple annotations, and the inference of a clean, corrected version of the ``ground truth'' as one of the most significant challenges in the area. A highly important contribution of this thesis lies in the introduction of Dynamic Probabilistic Canonical Correlation Analysis (DPCCA), a method aimed at fusing noisy continuous annotations. By adopting a private-shared space model, we isolate the individual characteristics that are annotator-specific and not shared, while most importantly we model the common, underlying annotation which is shared by annotators (i.e., the derived ground truth). By further learning temporal dynamics and incorporating a time-warping process, we are able to derive a clean version of the ground truth given multiple annotations, eliminating temporal discrepancies and other nuisances. The integration of the temporal alignment process within the proposed private-shared space model deems DPCCA suitable for the problem of temporally aligning human behaviour; that is, given temporally unsynchronised sequences (e.g., videos of two persons smiling), the goal is to generate the temporally synchronised sequences (e.g., the smile apex should co-occur in the videos). Temporal alignment is an important problem for many applications where multiple datasets need to be aligned in time. Furthermore, it is particularly suitable for the analysis of facial expressions, where the activation of facial muscles (Action Units) typically follows a set of predefined temporal phases. A highly challenging scenario is when the observations are perturbed by gross, non-Gaussian noise (e.g., occlusions), as is often the case when analysing data acquired under real-world conditions. To account for non-Gaussian noise, a robust variant of Canonical Correlation Analysis (RCCA) for robust fusion and temporal alignment is proposed. The model captures the shared, low-rank subspace of the observations, isolating the gross noise in a sparse noise term. RCCA is amongst the first robust variants of CCA proposed in literature, and as we show in related experiments outperforms other, state-of-the-art methods for related tasks such as the fusion of multiple modalities under gross noise. Beyond private-shared space models, Component Analysis (CA) is an integral component of most computer vision systems, particularly in terms of reducing the usually high-dimensional input spaces in a meaningful manner pertaining to the task-at-hand (e.g., prediction, clustering). A final, significant contribution of this thesis lies in proposing the first unifying framework for probabilistic component analysis. The proposed framework covers most well-known CA methods, such as Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), Locality Preserving Projections (LPP) and Slow Feature Analysis (SFA), providing further theoretical insights into the workings of CA. Moreover, the proposed framework is highly flexible, enabling novel CA methods to be generated by simply manipulating the connectivity of latent variables (i.e. the latent neighbourhood). As shown experimentally, methods derived via the proposed framework outperform other equivalents in several problems related to affect sensing and facial expression analysis, while providing advantages such as reduced complexity and explicit variance modelling.Open Acces

    Robust subspace learning for static and dynamic affect and behaviour modelling

    Machine analysis of human affect and behavior in naturalistic contexts has witnessed a growing attention in the last decade from various disciplines ranging from social and cognitive sciences to machine learning and computer vision. Endowing machines with the ability to seamlessly detect, analyze, model, predict as well as simulate and synthesize manifestations of internal emotional and behavioral states in real-world data is deemed essential for the deployment of next-generation, emotionally- and socially-competent human-centered interfaces. In this thesis, we are primarily motivated by the problem of modeling, recognizing and predicting spontaneous expressions of non-verbal human affect and behavior manifested through either low-level facial attributes in static images or high-level semantic events in image sequences. Both visual data and annotations of naturalistic affect and behavior naturally contain noisy measurements of unbounded magnitude at random locations, commonly referred to as ‘outliers’. We present here machine learning methods that are robust to such gross, sparse noise. First, we deal with static analysis of face images, viewing the latter as a superposition of mutually-incoherent, low-complexity components corresponding to facial attributes, such as facial identity, expressions and activation of atomic facial muscle actions. We develop a robust, discriminant dictionary learning framework to extract these components from grossly corrupted training data and combine it with sparse representation to recognize the associated attributes. We demonstrate that our framework can jointly address interrelated classification tasks such as face and facial expression recognition. Inspired by the well-documented importance of the temporal aspect in perceiving affect and behavior, we direct the bulk of our research efforts into continuous-time modeling of dimensional affect and social behavior. Having identified a gap in the literature which is the lack of data containing annotations of social attitudes in continuous time and scale, we first curate a new audio-visual database of multi-party conversations from political debates annotated frame-by-frame in terms of real-valued conflict intensity and use it to conduct the first study on continuous-time conflict intensity estimation. Our experimental findings corroborate previous evidence indicating the inability of existing classifiers in capturing the hidden temporal structures of affective and behavioral displays. We present here a novel dynamic behavior analysis framework which models temporal dynamics in an explicit way, based on the natural assumption that continuous- time annotations of smoothly-varying affect or behavior can be viewed as outputs of a low-complexity linear dynamical system when behavioral cues (features) act as system inputs. A novel robust structured rank minimization framework is proposed to estimate the system parameters in the presence of gross corruptions and partially missing data. Experiments on prediction of dimensional conflict and affect as well as multi-object tracking from detection validate the effectiveness of our predictive framework and demonstrate that for the first time that complex human behavior and affect can be learned and predicted based on small training sets of person(s)-specific observations.Open Acces

    Machine learning for efficient recognition of anatomical structures and abnormalities in biomedical images

    Three studies have been carried out to investigate new approaches to efficient image segmentation and anomaly detection. The first study investigates the use of deep learning in patch based segmentation. Current approaches to patch based segmentation use low level features such as the sum of squared differences between patches. We argue that better segmentation can be achieved by harnessing the power of deep neural networks. Currently these networks make extensive use of convolutional layers. However, we argue that in the context of patch based segmentation, convolutional layers have little advantage over the canonical artificial neural network architecture. This is because a patch is small, and does not need decomposition and thus will not benefit from convolution. Instead, we make use of the canonical architecture in which neurons only compute dot products, but also incorporate modern techniques of deep learning. The resulting classifier is much faster and less memory-hungry than convolution based networks. In a test application to the segmentation of hippocampus in human brain MR images, we significantly outperformed prior art with a median Dice score up to 90.98% at a near real-time speed (<1s). The second study is an investigation into mouse phenotyping, and develops a high-throughput framework to detect morphological abnormality in mouse embryo micro-CT images. Existing work in this line is centred on, either the detection of phenotype-specific features or comparative analytics. The former approach lacks generality and the latter can often fail, for example, when the abnormality is not associated with severe volume variation. Both these approaches often require image segmentation as a pre-requisite, which is very challenging when applied to embryo phenotyping. A new approach to this problem in which non-rigid registration is combined with robust principal component analysis (RPCA), is proposed. The new framework is able to efficiently perform abnormality detection in a batch of images. It is sensitive to both volumetric and non-volumetric variations, and does not require image segmentation. In a validation study, it successfully distinguished the abnormal VSD and polydactyly phenotypes from the normal, respectively, at 85.19% and 88.89% specificities, with 100% sensitivity in both cases. The third study investigates the RPCA technique in more depth. RPCA is an extension of PCA that tolerates certain levels of data distortion during feature extraction, and is able to decompose images into regular and singular components. It has previously been applied to many computer vision problems (e.g. video surveillance), attaining excellent performance. However these applications commonly rest on a critical condition: in the majority of images being processed, there is a background with very little variation. By contrast in biomedical imaging there is significant natural variation across different images, resulting from inter-subject variability and physiological movements. Non-rigid registration can go some way towards reducing this variance, but cannot eliminate it entirely. To address this problem we propose a modified framework (RPCA-P) that is able to incorporate natural variation priors and adjust outlier tolerance locally, so that voxels associated with structures of higher variability are compensated with a higher tolerance in regularity estimation. An experimental study was applied to the same mouse embryo micro-CT data, and notably improved the detection specificity to 94.12% for the VSD and 90.97% for the polydactyly, while maintaining the sensitivity at 100%.Open Acces

    Large-area visually augmented navigation for autonomous underwater vehicles

    Submitted to the Joint Program in Applied Ocean Science & Engineering in partial fulfillment of the requirements for the degree of Doctor of Philosophy at the Massachusetts Institute of Technology and the Woods Hole Oceanographic Institution June 2005This thesis describes a vision-based, large-area, simultaneous localization and mapping (SLAM) algorithm that respects the low-overlap imagery constraints typical of autonomous underwater vehicles (AUVs) while exploiting the inertial sensor information that is routinely available on such platforms. We adopt a systems-level approach exploiting the complementary aspects of inertial sensing and visual perception from a calibrated pose-instrumented platform. This systems-level strategy yields a robust solution to underwater imaging that overcomes many of the unique challenges of a marine environment (e.g., unstructured terrain, low-overlap imagery, moving light source). Our large-area SLAM algorithm recursively incorporates relative-pose constraints using a view-based representation that exploits exact sparsity in the Gaussian canonical form. This sparsity allows for efficient O(n) update complexity in the number of images composing the view-based map by utilizing recent multilevel relaxation techniques. We show that our algorithmic formulation is inherently sparse unlike other feature-based canonical SLAM algorithms, which impose sparseness via pruning approximations. In particular, we investigate the sparsification methodology employed by sparse extended information filters (SEIFs) and offer new insight as to why, and how, its approximation can lead to inconsistencies in the estimated state errors. Lastly, we present a novel algorithm for efficiently extracting consistent marginal covariances useful for data association from the information matrix. In summary, this thesis advances the current state-of-the-art in underwater visual navigation by demonstrating end-to-end automatic processing of the largest visually navigated dataset to date using data collected from a survey of the RMS Titanic (path length over 3 km and 3100 m2 of mapped area). This accomplishment embodies the summed contributions of this thesis to several current SLAM research issues including scalability, 6 degree of freedom motion, unstructured environments, and visual perception.This work was funded in part by the CenSSIS ERC of the National Science Foundation under grant EEC-9986821, in part by the Woods Hole Oceanographic Institution through a grant from the Penzance Foundation, and in part by a NDSEG Fellowship awarded through the Department of Defense

    Feature regression for continuous pose estimation of object categories

    Video-based face alignment using efficient sparse and low-rank approach.

    Wu, King Keung."August 2011."Thesis (M.Phil.)--Chinese University of Hong Kong, 2011.Includes bibliographical references (p. 119-126).Abstracts in English and Chinese.Abstract --- p.iAcknowledgement --- p.vChapter 1 --- Introduction --- p.1Chapter 1.1 --- Overview of Face Alignment Algorithms --- p.1Chapter 1.1.1 --- Objectives --- p.1Chapter 1.1.2 --- Motivation: Photo-realistic Talking Head --- p.2Chapter 1.1.3 --- Existing methods --- p.5Chapter 1.2 --- Contributions --- p.8Chapter 1.3 --- Outline of the Thesis --- p.11Chapter 2 --- Sparse Signal Representation --- p.13Chapter 2.1 --- Introduction --- p.13Chapter 2.2 --- Problem Formulation --- p.15Chapter 2.2.1 --- l0-nonn minimization --- p.15Chapter 2.2.2 --- Uniqueness --- p.16Chapter 2.3 --- Basis Pursuit --- p.18Chapter 2.3.1 --- From l0-norm to l1-norm --- p.19Chapter 2.3.2 --- l0-l1 Equivalence --- p.20Chapter 2.4 --- l1-Regularized Least Squares --- p.21Chapter 2.4.1 --- Noisy case --- p.22Chapter 2.4.2 --- Over-determined systems of linear equations --- p.22Chapter 2.5 --- Summary --- p.24Chapter 3 --- Sparse Corruptions and Principal Component Pursuit --- p.25Chapter 3.1 --- Introduction --- p.25Chapter 3.2 --- Sparse Corruptions --- p.26Chapter 3.2.1 --- Sparse Corruptions and l1-Error --- p.26Chapter 3.2.2 --- l1-Error and Least Absolute Deviations --- p.28Chapter 3.2.3 --- l1-Regularized l1-Error --- p.29Chapter 3.3 --- Robust Principal Component Analysis (RPCA) and Principal Component Pursuit --- p.31Chapter 3.3.1 --- Principal Component Analysis (PCA) and RPCA --- p.31Chapter 3.3.2 --- Principal Component Pursuit --- p.33Chapter 3.4 --- Experiments of Sparse and Low-rank Approach on Surveillance Video --- p.34Chapter 3.4.1 --- Least Squares --- p.35Chapter 3.4.2 --- l1-Regularized Least Squares --- p.35Chapter 3.4.3 --- l1-Error --- p.36Chapter 3.4.4 --- l1-Regularized l1-Error --- p.36Chapter 3.5 --- Summary --- p.37Chapter 4 --- Split Bregman Algorithm for l1-Problem --- p.45Chapter 4.1 --- Introduction --- p.45Chapter 4.2 --- Bregman Distance --- p.46Chapter 4.3 --- Bregman Iteration for Constrained Optimization --- p.47Chapter 4.4 --- Split Bregman Iteration for l1-Regularized Problem --- p.50Chapter 4.4.1 --- Formulation --- p.51Chapter 4.4.2 --- Advantages of Split Bregman Iteration . . --- p.52Chapter 4.5 --- Fast l1 Algorithms --- p.54Chapter 4.5.1 --- l1-Regularized Least Squares --- p.54Chapter 4.5.2 --- l1-Error --- p.55Chapter 4.5.3 --- l1-Regularized l1-Error --- p.57Chapter 4.6 --- Summary --- p.58Chapter 5 --- Face Alignment Using Sparse and Low-rank Decomposition --- p.61Chapter 5.1 --- Robust Alignment by Sparse and Low-rank Decomposition for Linearly Correlated Images (RASL) --- p.61Chapter 5.2 --- Problem Formulation --- p.62Chapter 5.2.1 --- Theory --- p.62Chapter 5.2.2 --- Algorithm --- p.64Chapter 5.3 --- Direct Extension of RASL: Multi-RASL --- p.66Chapter 5.3.1 --- Formulation --- p.66Chapter 5.3.2 --- Algorithm --- p.67Chapter 5.4 --- Matlab Implementation Details --- p.68Chapter 5.4.1 --- Preprocessing --- p.70Chapter 5.4.2 --- Transformation --- p.73Chapter 5.4.3 --- Jacobian Ji --- p.74Chapter 5.5 --- Experiments --- p.75Chapter 5.5.1 --- Qualitative Evaluations Using Small Dataset --- p.76Chapter 5.5.2 --- Large Dataset Test --- p.81Chapter 5.5.3 --- Conclusion --- p.85Chapter 5.6 --- Sensitivity analysis on selection of references --- p.87Chapter 5.6.1 --- References from consecutive frames --- p.88Chapter 5.6.2 --- References from RASL-aligned images --- p.91Chapter 5.7 --- Summary --- p.92Chapter 6 --- Extension of RASL for video: One-by-One Approach --- p.96Chapter 6.1 --- One-by-One Approach --- p.96Chapter 6.1.1 --- Motivation --- p.97Chapter 6.1.2 --- Algorithm --- p.97Chapter 6.2 --- Choices of Optimization --- p.101Chapter 6.2.1 --- l1-Regularized Least Squares --- p.101Chapter 6.2.2 --- l1-Error --- p.102Chapter 6.2.3 --- l1-Regularized l1-Error --- p.103Chapter 6.3 --- Experiments --- p.104Chapter 6.3.1 --- Evaluation for Different l1 Algorithms --- p.104Chapter 6.3.2 --- Conclusion --- p.108Chapter 6.4 --- Exploiting Property of Video --- p.109Chapter 6.5 --- Summary --- p.110Chapter 7 --- Conclusion and Future Work --- p.112Chapter A --- Appendix --- p.117Bibliography --- p.11