    Online Geometric Human Interaction Segmentation and Recognition

    The goal of this work is the temporal localization and recognition of binary people interactions in video. Human-human interaction detection is one of the core problems in video analysis. It has many applications such as in video surveillance, video search and retrieval, human-computer interaction, and behavior analysis for safety and security. Despite the sizeable literature in the area of activity and action modeling and recognition, the vast majority of the approaches make the assumption that the beginning and the end of the video portion containing the action or the activity of interest is known. In other words, while a significant effort has been placed on the recognition, the spatial and temporal localization of activities, i.e. the detection problem, has received considerably less attention. Even more so, if the detection has to be made in an online fashion, as opposed to offline. The latter condition is imposed by almost the totality of the state-of-the-art, which makes it intrinsically unsuited for real-time processing. In this thesis, the problem of event localization and recognition is addressed in an online fashion. The main assumption is that an interaction, or an activity is modeled by a temporal sequence. One of the main challenges is the development of a modeling framework able to capture the complex variability of activities, described by high dimensional features. This is addressed by the combination of linear models with kernel methods. In particular, the parity space theory for detection, based on Euclidean geometry, is augmented to be able to work with kernels, through the use of geometric operators in Hilbert space. While this approach is general, here it is applied to the detection of human interactions. It is tested on a publicly available dataset and on a large and challenging, newly collected dataset. An extensive testing of the approach indicates that it sets a new state-of-the-art under several performance measures, and that it holds the promise to become an effective building block for the analysis in real-time of human behavior from video

    Time-slice analysis of dyadic human activity

    La reconnaissance d’activités humaines à partir de données vidéo est utilisée pour la surveillance ainsi que pour des applications d’interaction homme-machine. Le principal objectif est de classer les vidéos dans l’une des k classes d’actions à partir de vidéos entièrement observées. Cependant, de tout temps, les systèmes intelligents sont améliorés afin de prendre des décisions basées sur des incertitudes et ou des informations incomplètes. Ce besoin nous motive à introduire le problème de l’analyse de l’incertitude associée aux activités humaines et de pouvoir passer à un nouveau niveau de généralité lié aux problèmes d’analyse d’actions. Nous allons également présenter le problème de reconnaissance d’activités par intervalle de temps, qui vise à explorer l’activité humaine dans un intervalle de temps court. Il a été démontré que l’analyse par intervalle de temps est utile pour la caractérisation des mouvements et en général pour l’analyse de contenus vidéo. Ces études nous encouragent à utiliser ces intervalles de temps afin d’analyser l’incertitude associée aux activités humaines. Nous allons détailler à quel degré de certitude chaque activité se produit au cours de la vidéo. Dans cette thèse, l’analyse par intervalle de temps d’activités humaines avec incertitudes sera structurée en 3 parties. i) Nous présentons une nouvelle famille de descripteurs spatiotemporels optimisés pour la prédiction précoce avec annotations d’intervalle de temps. Notre représentation prédictive du point d’intérêt spatiotemporel (Predict-STIP) est basée sur l’idée de la contingence entre intervalles de temps. ii) Nous exploitons des techniques de pointe pour extraire des points d’intérêts afin de représenter ces intervalles de temps. iii) Nous utilisons des relations (uniformes et par paires) basées sur les réseaux neuronaux convolutionnels entre les différentes parties du corps de l’individu dans chaque intervalle de temps. Les relations uniformes enregistrent l’apparence locale de la partie du corps tandis que les relations par paires captent les relations contextuelles locales entre les parties du corps. Nous extrayons les spécificités de chaque image dans l’intervalle de temps et examinons différentes façons de les agréger temporellement afin de générer un descripteur pour tout l’intervalle de temps. En outre, nous créons une nouvelle base de données qui est annotée à de multiples intervalles de temps courts, permettant la modélisation de l’incertitude inhérente à la reconnaissance d’activités par intervalle de temps. Les résultats expérimentaux montrent l’efficience de notre stratégie dans l’analyse des mouvements humains avec incertitude.Recognizing human activities from video data is routinely leveraged for surveillance and human-computer interaction applications. The main focus has been classifying videos into one of k action classes from fully observed videos. However, intelligent systems must to make decisions under uncertainty, and based on incomplete information. This need motivates us to introduce the problem of analysing the uncertainty associated with human activities and move to a new level of generality in the action analysis problem. We also present the problem of time-slice activity recognition which aims to explore human activity at a small temporal granularity. Time-slice recognition is able to infer human behaviours from a short temporal window. It has been shown that temporal slice analysis is helpful for motion characterization and for video content representation in general. These studies motivate us to consider timeslices for analysing the uncertainty associated with human activities. We report to what degree of certainty each activity is occurring throughout the video from definitely not occurring to definitely occurring. In this research, we propose three frameworks for time-slice analysis of dyadic human activity under uncertainty. i) We present a new family of spatio-temporal descriptors which are optimized for early prediction with time-slice action annotations. Our predictive spatiotemporal interest point (Predict-STIP) representation is based on the intuition of temporal contingency between time-slices. ii) we exploit state-of-the art techniques to extract interest points in order to represent time-slices. We also present an accumulative uncertainty to depict the uncertainty associated with partially observed videos for the task of early activity recognition. iii) we use Convolutional Neural Networks-based unary and pairwise relations between human body joints in each time-slice. The unary term captures the local appearance of the joints while the pairwise term captures the local contextual relations between the parts. We extract these features from each frame in a time-slice and examine different temporal aggregations to generate a descriptor for the whole time-slice. Furthermore, we create a novel dataset which is annotated at multiple short temporal windows, allowing the modelling of the inherent uncertainty in time-slice activity recognition. All the three methods have been evaluated on TAP dataset. Experimental results demonstrate the effectiveness of our framework in the analysis of dyadic activities under uncertaint

    Subspace Representations and Learning for Visual Recognition

    Pervasive and affordable sensor and storage technology enables the acquisition of an ever-rising amount of visual data. The ability to extract semantic information by interpreting, indexing and searching visual data is impacting domains such as surveillance, robotics, intelligence, human- computer interaction, navigation, healthcare, and several others. This further stimulates the investigation of automated extraction techniques that are more efficient, and robust against the many sources of noise affecting the already complex visual data, which is carrying the semantic information of interest. We address the problem by designing novel visual data representations, based on learning data subspace decompositions that are invariant against noise, while being informative for the task at hand. We use this guiding principle to tackle several visual recognition problems, including detection and recognition of human interactions from surveillance video, face recognition in unconstrained environments, and domain generalization for object recognition.;By interpreting visual data with a simple additive noise model, we consider the subspaces spanned by the model portion (model subspace) and the noise portion (variation subspace). We observe that decomposing the variation subspace against the model subspace gives rise to the so-called parity subspace. Decomposing the model subspace against the variation subspace instead gives rise to what we name invariant subspace. We extend the use of kernel techniques for the parity subspace. This enables modeling the highly non-linear temporal trajectories describing human behavior, and performing detection and recognition of human interactions. In addition, we introduce supervised low-rank matrix decomposition techniques for learning the invariant subspace for two other tasks. We learn invariant representations for face recognition from grossly corrupted images, and we learn object recognition classifiers that are invariant to the so-called domain bias.;Extensive experiments using the benchmark datasets publicly available for each of the three tasks, show that learning representations based on subspace decompositions invariant to the sources of noise lead to results comparable or better than the state-of-the-art

    Learning Explainable Facial Features from Noisy Unconstrained Visual Data

    Attributes are semantic features of objects, people, and activities. They allow computers to describe people and things in the way humans would, which makes them very useful for recognition. Facial attributes - gender, hair color, makeup, eye color, etc. - are useful for a variety of different tasks, including face verification and recognition, user interface applications, and surveillance, to name a few. The problem of predicting facial attributes is still relatively new in computer vision. Because facial attribute recognition is not a long-studied problem, a lack of publicly available data is a major challenge. As with many problems in computer vision, a large portion of facial attribute research is dedicated to improving performance on benchmark datasets. However, it has been shown that research progress on a benchmark dataset does not necessarily translate to a genuine solution for the problem. This dissertation focuses on learning models for facial attributes that are robust to changes in data, i.e. the models perform well on unseen data. We do this by taking cues from human recognition, and translating these ideas into deep learning techniques for robust facial attribute recognition. Towards this goal, we introduce several techniques for learning from noisy unconstrained visual data: utilizing relationships among attributes, a selective learning approach for multi-label balancing, a temporal coherence constraint and a motion-attention mechanism for recognizing attributes in video, and parsing faces according to attributes for improved localization. We know that facial attributes are related, e.g. heavy makeup and wearing lipstick or male and goatee. Humans are capable of recognizing and taking advantage of these relationships. For example, if a face of a subject is occluded, and facial hair can be seen, then the likelihood that the subject being male should increase. We introduce several methods for implicitly and explicitly utilizing attribute relationships for improved prediction. Some attributes are more common than others in the real world, e.g. male v. bald. These disparities are even more pronounced in datasets consisting of posed celebrities on the red carpet (i.e. there are very few celebrities not wearing makeup). These imbalances can cause a facial attribute model to learn the bias in the dataset, rather than a true representation for the attribute. To alleviate this problem, we introduce selective learning, a method of balancing each batch in a deep learning algorithm according to each attribute given a target distribution. Selective learning allows a deep learning algorithm to learn from a balanced set of data at each iteration during training, removing the bias from the label imbalance. Learning a facial attribute model from image data, and testing on video data gives unexpected results (e.g. gender changing between frames). When working with video, it is important to account for the temporal and motion aspects of the data. In order to stabilize attribute predictions in video, we utilized weakly-labeled data and introduced time and motion constraints in the model learning process. Introducing temporal coherence and motion-attention constraints during learning of an attribute model allows the use of weakly-labeled data, which is essential when working with video. Framing the problem of facial attribute recognition as one of semantic segmentation, where the goal is to predict attributes at each pixel, we are able to reduce the effect of unwanted relationships between attributes (e.g. high cheekbones and smiling ). Robust facial attribute recognition algorithms are necessary for improving the applications which use these attributes. Given limited data for training, we develop several methods for learning explainable facial features from noisy unconstrained visual data, introducing several new datasets labeled with facial attributes and improving over the state-of-the-art

    Domain Transfer Learning for Object and Action Recognition

    Visual recognition has always been a fundamental problem in computer vision. Its task is to learn visual categories using labeled training data and then identify unlabeled new instances of those categories. However, due to the large variations in visual data, visual recognition is still a challenging problem. Handling the variations in captured images is important for real-world applications where unconstrained data acquisition scenarios are widely prevalent. In this dissertation, we first address the variations between training and testing data. Particularly, for cross-domain object recognition, we propose a Grassmann manifold-based domain adaptation approach to model the domain shift using the geodesic connecting the source and target domains. We further measure the distance between two data points from different domains by integrating the distance of their projections through all the intermediate subspaces along the geodesic. Our proposed approach that exploits all the intermediate subspaces along the geodesic produces a more accurate metric. For cross-view action recognition, we present two effective approaches to learn transferable dictionaries and view-invariant sparse representations. In the first approach, we learn a set of transferable dictionaries where each dictionary corresponds to one camera view. The set of dictionaries is learned simultaneously from sets of correspondence videos taken at different views with the aim of encouraging each video in the set to have the same sparse representation. In the second approach, we relaxes this constraint by encouraging correspondence videos to have similar sparse representations. In addition, we learn a common dictionary that is incoherent to view-specific dictionaries for cross-view action recognition. The set of view-specific dictionaries is learned for specific views while the common dictionary is shared across different views. In this way, we can align view-specific features in the sparse feature spaces spanned by the view-specific dictionary set and transfer the view-shared features in the sparse feature space spanned by the common dictionary. In order to handle the more general variations in captured images, we also exploit the semantic information to learn discriminative feature representations for visual recognition. Class labels are often organized in a hierarchical taxonomy based on their semantic meanings. We propose a novel multi-layer hierarchical dictionary learning framework for region tagging. Specifically, we learn a node-specific dictionary for each semantic label in the taxonomy and preserve the hierarchial semantic structure in the relationship among these node-dictionaries. Our approach can also transfer knowledge from semantic label at higher levels to help learn the classifiers for semantic labels at lower levels. Moreover, we exploit the semantic attributes for boosting the performance of visual recognition. We encode objects or actions based on attributes that describe them as high-level concepts. We consider two types of attributes. One type of attributes is generated by humans, while the second type is data-driven attributes extracted from data using dictionary learning methods. Attribute-based representation may exhibit variations due to noisy and redundant attributes. We propose a discriminative and compact attribute-based representation by selecting a subset of discriminative attributes from a large attribute set. Three attribute selection criteria are proposed and formulated as a submodular optimization problem. A greedy optimization algorithm is presented and its solution is guaranteed to be at least (1-1/e)-approximation to the optimum

    Deep Open Representative Learning for Image and Text Classification

    An essential goal of artificial intelligence is to support the knowledge discovery process from data to the knowledge that is useful in decision making. The challenges in the knowledge discovery process are typically due to the following reasons: First, the real-world data are typically noise, sparse, or derived from heterogeneous sources. Second, it is neither easy to build robust predictive models nor to validate them with such real-world data. Third, the `black-box' approach to deep learning models makes it hard to interpret what they produce. It is essential to bridge the gap between the models and their support in decisions with something potentially understandable and interpretable. To address the gap, we focus on designing critical representatives of the discovery process from data to the knowledge that can be used to perform reasoning. In this dissertation, a novel model named Class Representative Learning (CRL) is proposed, a class-based classifier designed with the following unique contributions in machine learning, specifically for image and text classification, i) The unique design of a latent feature vector, i.e., class representative, represents the abstract embedding space projects with the features extracted from a deep neural network learned from either images or text, ii) Parallel ZSL algorithms with class representative learning; iii) A novel projection-based inferencing method uses the vector space model to reconcile the dominant difference between the seen classes and unseen classes; iv) The relationships between CRs (Class Representatives) are represented as a CR Graph where a node represents a CR, and an edge represents the similarity between two CRs.Furthermore, we designed the CR-Graph model that aims to make the models explainable that is crucial for decision-making. Although this CR-Graph does not have full reasoning capability, it is equipped with the class representatives and their inter-dependent network formed through similar neighboring classes. Additionally, semantic information and external information are added to CR-Graph to make the decision more capable of dealing with real-world data. The automated semantic information's ability to the graph is illustrated with a case study of biomedical research through the ontology generation from text and ontology-to-ontology mapping.