26 research outputs found
Representation and recognition of human actions in video
PhDAutomated human action recognition plays a critical role in the development of human-machine
communication, by aiming for a more natural interaction between artificial intelligence and the
human society. Recent developments in technology have permitted a shift from a traditional
human action recognition performed in a well-constrained laboratory environment to realistic
unconstrained scenarios. This advancement has given rise to new problems and challenges still
not addressed by the available methods. Thus, the aim of this thesis is to study innovative approaches
that address the challenging problems of human action recognition from video captured
in unconstrained scenarios. To this end, novel action representations, feature selection methods,
fusion strategies and classification approaches are formulated.
More specifically, a novel interest points based action representation is firstly introduced, this
representation seeks to describe actions as clouds of interest points accumulated at different temporal
scales. The idea behind this method consists of extracting holistic features from the point
clouds and explicitly and globally describing the spatial and temporal action dynamic. Since
the proposed clouds of points representation exploits alternative and complementary information
compared to the conventional interest points-based methods, a more solid representation is then
obtained by fusing the two representations, adopting a Multiple Kernel Learning strategy. The
validity of the proposed approach in recognising action from a well-known benchmark dataset is
demonstrated as well as the superior performance achieved by fusing representations.
Since the proposed method appears limited by the presence of a dynamic background and fast
camera movements, a novel trajectory-based representation is formulated. Different from interest
points, trajectories can simultaneously retain motion and appearance information even in noisy
and crowded scenarios. Additionally, they can handle drastic camera movements and a robust
region of interest estimation. An equally important contribution is the proposed collaborative
feature selection performed to remove redundant and noisy components. In particular, a novel
feature selection method based on Multi-Class Delta Latent Dirichlet Allocation (MC-DLDA)
is introduced. Crucial, to enrich the final action representation, the trajectory representation is
adaptively fused with a conventional interest point representation. The proposed approach is
extensively validated on different datasets, and the reported performances are comparable with
the best state-of-the-art. The obtained results also confirm the fundamental contribution of both
collaborative feature selection and adaptive fusion.
Finally, the problem of realistic human action classification in very ambiguous scenarios is
taken into account. In these circumstances, standard feature selection methods and multi-class
classifiers appear inadequate due to: sparse training set, high intra-class variation and inter-class
similarity. Thus, both the feature selection and classification problems need to be redesigned.
The proposed idea is to iteratively decompose the classification task in subtasks and select the
optimal feature set and classifier in accordance with the subtask context. To this end, a cascaded
feature selection and action classification approach is introduced. The proposed cascade aims to
classify actions by exploiting as much information as possible, and at the same time trying to
simplify the multi-class classification in a cascade of binary separations. Specifically, instead of
separating multiple action classes simultaneously, the overall task is automatically divided into
easier binary sub-tasks. Experiments have been carried out using challenging public datasets;
the obtained results demonstrate that with identical action representation, the cascaded classifier
significantly outperforms standard multi-class classifiers
Improving less constrained iris recognition
The iris has been one of the most reliable biometric traits for automatic human authentication due to its highly stable and distinctive patterns. Traditional iris recognition algorithms have achieved remarkable performance in strictly constrained environments, with the subject standing still and with the iris captured at a close distance. This enables the wide deployment of iris recognition systems in applications such as border control and access control. However, in less constrained environments with the subject at-a-distance and on-the-move, the iris recognition performance is significantly deteriorated, since such environments induce noise and degradations in iris captures. This restricts the applicability and practicality of iris recognition technology for some real-world applications with more open capturing conditions, such as surveillance, forensic and mobile device security applications. Therefore, robust algorithms for less constrained iris recognition are desirable for the wider deployment of iris recognition systems.
This thesis focuses on improving less constrained iris recognition. Five methods are proposed to improve the performance of different stages in less constrained iris recognition. First, a robust iris segmentation algorithm is developed using l1-norm regression and model selection. This algorithm formulates iris segmentation as robust l1-norm regression problems. To further enhance the robustness, multiple segmentation results are produced by applying l1-norm regression to different models, and a model selection technique is used to select the most reliable result. Second, an iris liveness detection method using regional features is investigated. This method seeks not only low level features, but also high level feature distributions for more accurate and robust iris liveness detection. Third, a signal-level information fusion algorithm is presented to mitigate the noise in less constrained iris captures. With multiple noisy iris captures, this algorithm proposes a sparse-error low rank matrix factorization model to separate noiseless iris structures and noise. The noiseless structures are preserved and emphasised during the fusion process, while the noise is suppressed, in order to obtain more reliable signals for recognition. Fourth, a method to generate optimal iris codes is proposed. This method considers iris code generation from the perspective of optimization. It formulates traditional iris code generation method as an optimization problem; an additional objective term modelling the spatial correlations in iris codes is applied to this optimization problem to produce more effective iris codes. Fifth, an iris weight map method is studied for robust iris matching. This method considers both intra-class bit stability and inter-class bit discriminability in iris codes. It emphasises highly stable and discriminative bits for iris matching, enhancing the robustness of iris matching.
Comprehensive experimental analysis are performed on benchmark datasets for each of the above methods. The results indicate that the presented methods are effective for less constrained iris recognition, generally improving state-of-the-art performance
Visual object category discovery in images and videos
textThe current trend in visual recognition research is to place a strict division between the supervised and unsupervised learning paradigms, which is problematic for two main reasons. On the one hand, supervised methods require training data for each and every category that the system learns; training data may not always be available and is expensive to obtain. On the other hand, unsupervised methods must determine the optimal visual cues and distance metrics that distinguish one category from another to group images into semantically meaningful categories; however, for unlabeled data, these are unknown a priori.
I propose a visual category discovery framework that transcends the two paradigms and learns accurate models with few labeled exemplars. The main insight is to automatically focus on the prevalent objects in images and videos, and learn models from them for category grouping, segmentation, and summarization.
To implement this idea, I first present a context-aware category discovery framework that discovers novel categories by leveraging context from previously learned categories. I devise a novel object-graph descriptor to model the interaction between a set of known categories and the unknown to-be-discovered categories, and group regions that have similar appearance and similar object-graphs. I then present a collective segmentation framework that simultaneously discovers the segmentations and groupings of objects by leveraging the shared patterns in the unlabeled image collection. It discovers an ensemble of representative instances for each unknown category, and builds top-down models from them to refine the segmentation of the remaining instances. Finally, building on these techniques, I show how to produce compact visual summaries for first-person egocentric videos that focus on the important people and objects. The system leverages novel egocentric and high-level saliency features to predict important regions in the video, and produces a concise visual summary that is driven by those regions.
I compare against existing state-of-the-art methods for category discovery and segmentation on several challenging benchmark datasets. I demonstrate that we can discover visual concepts more accurately by focusing on the prevalent objects in images and videos, and show clear advantages of departing from the status quo division between the supervised and unsupervised learning paradigms. The main impact of my thesis is that it lays the groundwork for building large-scale visual discovery systems that can automatically discover visual concepts with minimal human supervision.Electrical and Computer Engineerin
Real-world Human Re-identification: Attributes and Beyond.
PhDSurveillance systems capable of performing a diverse range of tasks that support human intelligence
and analytical efforts are becoming widespread and crucial due to increasing threats
upon national infrastructure and evolving business and governmental analytical requirements.
Surveillance data can be critical for crime-prevention, forensic analysis, and counter-terrorism
activities in both civilian and governmental agencies alike. However, visual surveillance data
must currently be parsed by trained human operators and therefore any utility is offset by the
inherent training and staffing costs as a result. The automated analysis of surveillance video is
therefore of great scientific interest. One of the open problems within this area is that of reliably
matching humans between disjoint surveillance camera views, termed re-identification.
Automated re-identification facilitates human operational efficiency in the grouping of disparate
and fragmented people observations through space and time into individual personal identities,
a pre-requisite for higher-level surveillance tasks. However, due to the complex nature of realworld
scenes and the highly variable nature of human appearance, reliably re-identifying people
is non-trivial.
Most re-identification approaches developed so far rely on low-level visual feature matching
approaches that aim to match human detections against a known gallery of potential matches.
However, for many applications an initial detection of a human may be unavailable or a low-level
feature representation may not be sufficiently invariant to photometric or geometric variability
inherent between camera views. This thesis begins by proposing a âmid-levelâ human-semantic
representation that exploits expert human knowledge of surveillance task execution to the task
of re-identifying people in order to compute an attribute-based description of a human. It further
shows how this attribute-based description is synergistic with low-level data-derived features
to enhance re-identification accuracy and subsequently gain further performance benefits
by employing a discriminatively learned distance metric. Finally, a novel âzero-shotâ scenario is
proposed in which a visual probe is unavailable but re-identification is still possible via a manually
provided semantic attribute description. The approach is extensively evaluated using several
public benchmark datasets.
One challenge in constructing an attribute-based and human-semantic representation is the
requirement for extensive annotation. Mitigating this annotation cost in order to present a realistic
and scalable re-identification system, is motivation for the second technical area of this thesis,
where transfer-learning and data-mining are investigatedin two different approaches. Discriminative
methods trade annotation cost for enhanced performance. Because discriminative person
re-identification models operate between two camera views, annotation cost therefore scales
quadratically on the number of cameras in the entire network. For practical re-identification, this
4
is an unreasonable expectation and prohibitively expensive. By leveraging flexible multi-source
transfer of re-identification models, part of this cost may be alleviated. Specifically, it is possible
to leverage prior re-identification models learned for a set of source-view pairs (domains), and
flexibly combine those to obtain good re-identification performance for a given target-view pair
with greatly reduced annotation requirements.
The volume of exhaustive annotation effort required for attribute-driven re-identification
scales linearly on the number of cameras and attributes. Real-world operation of an attributeenabled,
distributed camera network would also require prohibitive quantities of annotation effort
by human experts. This effort is completely avoided by taking a data-driven approach to attribute
computation, by learning an effective associated representation by crawling large volumes of
Internet data. By training on a larger and more diverse array of examples, this representation
is more view-invariant and generalisable than attributes trained on conventional scales. These
automatically discovered attributes are shown to provide a valuable representation that significantly
improves re-identification performance. Moreover, a method to map them onto existing
expert-annotated-ontologies is contributed.
In the final contribution of this thesis, the underlying assumptions about visual surveillance
equipment and re-identification are challenged and the thesis motivates a novel research area
using dynamic, mobile platforms. Such platforms violate the common assumption shared by
most previous research, namely that surveillance devices are always stationary, relative to the
observed scene. The most important new challenge discovered in this exciting area is that the unconstrained
video is too challenging for traditional approaches to applying discriminative methods
that rely on the explicit modelling of appearance translations when modelling view-pairs,
or even a single view. A new dataset was collected by a remote-operated vehicle using control
software developed to simulate a fully-autonomous re-identification unmanned aerial vehicle programmed
to fly in proximity with humans until images of sufficient quality for re-identification
are obtained. Variations of the standard re-identification model are investigated in an enhanced
re-identification paradigm, and new challenges with this distinct form of re-identification are elucidated.
Finally, conventional wisdom regarding re-identification in light of these observations is
re-examined
Fisher Motion Descriptor for Multiview Gait Recognition
The goal of this paper is to identify individuals by analyzing their gait. Instead of using binary silhouettes
as input data (as done in many previous works) we propose and evaluate the use of motion descriptors based
on densely sampled short-term trajectories. We take advantage of state-of-the-art people detectors to de ne
custom spatial con gurations of the descriptors around the target person, obtaining a rich representation of
the gait motion. The local motion features (described by the Divergence-Curl-Shear descriptor [1]) extracted
on the di erent spatial areas of the person are combined into a single high-level gait descriptor by using
the Fisher Vector encoding [2]. The proposed approach, coined Pyramidal Fisher Motion, is experimentally
validated on `CASIA' dataset [3] (parts B and C), `TUM GAID' dataset [4], `CMU MoBo' dataset [5] and the
recent `AVA Multiview Gait' dataset [6]. The results show that this new approach achieves state-of-the-art
results in the problem of gait recognition, allowing to recognize walking people from diverse viewpoints on
single and multiple camera setups, wearing di erent clothes, carrying bags, walking at diverse speeds and
not limited to straight walking paths
GCoNet+: A Stronger Group Collaborative Co-Salient Object Detector
In this paper, we present a novel end-to-end group collaborative learning
network, termed GCoNet+, which can effectively and efficiently (250 fps)
identify co-salient objects in natural scenes. The proposed GCoNet+ achieves
the new state-of-the-art performance for co-salient object detection (CoSOD)
through mining consensus representations based on the following two essential
criteria: 1) intra-group compactness to better formulate the consistency among
co-salient objects by capturing their inherent shared attributes using our
novel group affinity module (GAM); 2) inter-group separability to effectively
suppress the influence of noisy objects on the output by introducing our new
group collaborating module (GCM) conditioning on the inconsistent consensus. To
further improve the accuracy, we design a series of simple yet effective
components as follows: i) a recurrent auxiliary classification module (RACM)
promoting the model learning at the semantic level; ii) a confidence
enhancement module (CEM) helping the model to improve the quality of the final
predictions; and iii) a group-based symmetric triplet (GST) loss guiding the
model to learn more discriminative features. Extensive experiments on three
challenging benchmarks, i.e., CoCA, CoSOD3k, and CoSal2015, demonstrate that
our GCoNet+ outperforms the existing 12 cutting-edge models. Code has been
released at https://github.com/ZhengPeng7/GCoNet_plus
Camera Pose Estimation from Street-view Snapshots and Point Clouds
This PhD thesis targets on two research problems: (1) How to efďŹciently and robustly estimate the camera pose of a query image with a map that contains street-view snapshots and point clouds; (2) Given the estimated camera pose of a query image, how to create meaningful and intuitive applications with the map data.
To conquer the ďŹrst research problem, we systematically investigated indirect, direct and hybrid camera pose estimation strategies. We implemented state-of-the-art methods and performed comprehensive experiments in two public benchmark datasets considering outdoor environmental changes from ideal to extremely challenging cases. Our key ďŹndings are: (1) the indirect method is usually more accurate than the direct method when there are enough consistent feature correspondences; (2) The direct method is sensitive to initialization, but under extreme outdoor environmental changes, the mutual-information-based direct method is more robust than the feature-based methods; (3) The hybrid method combines the strength from both direct and indirect method and outperforms them in challenging datasets.
To explore the second research problem, we considered inspiring and useful applications by exploiting the camera pose together with the map data. Firstly, we invented a 3D-map augmented photo gallery application, where imagesâ geo-meta data are extracted with an indirect camera pose estimation method and photo sharing experience is improved with the augmentation of 3D map. Secondly, we designed an interactive video playback application, where an indirect method estimates video framesâ camera pose and the video playback is augmented with a 3D map. Thirdly, we proposed a 3D visual primitive based indoor object and outdoor scene recognition method, where the 3D primitives are accumulated from the multiview images
Rich probabilistic models for semantic labeling
Das Ziel dieser Monographie ist es die Methoden und Anwendungen des semantischen Labelings zu erforschen. Unsere Beiträge zu diesem sich rasch entwickelten Thema sind bestimmte Aspekte der Modellierung und der Inferenz in probabilistischen Modellen und ihre Anwendungen in den interdisziplinären Bereichen der Computer Vision sowie medizinischer Bildverarbeitung und Fernerkundung