26 research outputs found

    Representation and recognition of human actions in video

    Get PDF
    PhDAutomated human action recognition plays a critical role in the development of human-machine communication, by aiming for a more natural interaction between artificial intelligence and the human society. Recent developments in technology have permitted a shift from a traditional human action recognition performed in a well-constrained laboratory environment to realistic unconstrained scenarios. This advancement has given rise to new problems and challenges still not addressed by the available methods. Thus, the aim of this thesis is to study innovative approaches that address the challenging problems of human action recognition from video captured in unconstrained scenarios. To this end, novel action representations, feature selection methods, fusion strategies and classification approaches are formulated. More specifically, a novel interest points based action representation is firstly introduced, this representation seeks to describe actions as clouds of interest points accumulated at different temporal scales. The idea behind this method consists of extracting holistic features from the point clouds and explicitly and globally describing the spatial and temporal action dynamic. Since the proposed clouds of points representation exploits alternative and complementary information compared to the conventional interest points-based methods, a more solid representation is then obtained by fusing the two representations, adopting a Multiple Kernel Learning strategy. The validity of the proposed approach in recognising action from a well-known benchmark dataset is demonstrated as well as the superior performance achieved by fusing representations. Since the proposed method appears limited by the presence of a dynamic background and fast camera movements, a novel trajectory-based representation is formulated. Different from interest points, trajectories can simultaneously retain motion and appearance information even in noisy and crowded scenarios. Additionally, they can handle drastic camera movements and a robust region of interest estimation. An equally important contribution is the proposed collaborative feature selection performed to remove redundant and noisy components. In particular, a novel feature selection method based on Multi-Class Delta Latent Dirichlet Allocation (MC-DLDA) is introduced. Crucial, to enrich the final action representation, the trajectory representation is adaptively fused with a conventional interest point representation. The proposed approach is extensively validated on different datasets, and the reported performances are comparable with the best state-of-the-art. The obtained results also confirm the fundamental contribution of both collaborative feature selection and adaptive fusion. Finally, the problem of realistic human action classification in very ambiguous scenarios is taken into account. In these circumstances, standard feature selection methods and multi-class classifiers appear inadequate due to: sparse training set, high intra-class variation and inter-class similarity. Thus, both the feature selection and classification problems need to be redesigned. The proposed idea is to iteratively decompose the classification task in subtasks and select the optimal feature set and classifier in accordance with the subtask context. To this end, a cascaded feature selection and action classification approach is introduced. The proposed cascade aims to classify actions by exploiting as much information as possible, and at the same time trying to simplify the multi-class classification in a cascade of binary separations. Specifically, instead of separating multiple action classes simultaneously, the overall task is automatically divided into easier binary sub-tasks. Experiments have been carried out using challenging public datasets; the obtained results demonstrate that with identical action representation, the cascaded classifier significantly outperforms standard multi-class classifiers

    Improving less constrained iris recognition

    Get PDF
    The iris has been one of the most reliable biometric traits for automatic human authentication due to its highly stable and distinctive patterns. Traditional iris recognition algorithms have achieved remarkable performance in strictly constrained environments, with the subject standing still and with the iris captured at a close distance. This enables the wide deployment of iris recognition systems in applications such as border control and access control. However, in less constrained environments with the subject at-a-distance and on-the-move, the iris recognition performance is significantly deteriorated, since such environments induce noise and degradations in iris captures. This restricts the applicability and practicality of iris recognition technology for some real-world applications with more open capturing conditions, such as surveillance, forensic and mobile device security applications. Therefore, robust algorithms for less constrained iris recognition are desirable for the wider deployment of iris recognition systems. This thesis focuses on improving less constrained iris recognition. Five methods are proposed to improve the performance of different stages in less constrained iris recognition. First, a robust iris segmentation algorithm is developed using l1-norm regression and model selection. This algorithm formulates iris segmentation as robust l1-norm regression problems. To further enhance the robustness, multiple segmentation results are produced by applying l1-norm regression to different models, and a model selection technique is used to select the most reliable result. Second, an iris liveness detection method using regional features is investigated. This method seeks not only low level features, but also high level feature distributions for more accurate and robust iris liveness detection. Third, a signal-level information fusion algorithm is presented to mitigate the noise in less constrained iris captures. With multiple noisy iris captures, this algorithm proposes a sparse-error low rank matrix factorization model to separate noiseless iris structures and noise. The noiseless structures are preserved and emphasised during the fusion process, while the noise is suppressed, in order to obtain more reliable signals for recognition. Fourth, a method to generate optimal iris codes is proposed. This method considers iris code generation from the perspective of optimization. It formulates traditional iris code generation method as an optimization problem; an additional objective term modelling the spatial correlations in iris codes is applied to this optimization problem to produce more effective iris codes. Fifth, an iris weight map method is studied for robust iris matching. This method considers both intra-class bit stability and inter-class bit discriminability in iris codes. It emphasises highly stable and discriminative bits for iris matching, enhancing the robustness of iris matching. Comprehensive experimental analysis are performed on benchmark datasets for each of the above methods. The results indicate that the presented methods are effective for less constrained iris recognition, generally improving state-of-the-art performance

    Visual object category discovery in images and videos

    Get PDF
    textThe current trend in visual recognition research is to place a strict division between the supervised and unsupervised learning paradigms, which is problematic for two main reasons. On the one hand, supervised methods require training data for each and every category that the system learns; training data may not always be available and is expensive to obtain. On the other hand, unsupervised methods must determine the optimal visual cues and distance metrics that distinguish one category from another to group images into semantically meaningful categories; however, for unlabeled data, these are unknown a priori. I propose a visual category discovery framework that transcends the two paradigms and learns accurate models with few labeled exemplars. The main insight is to automatically focus on the prevalent objects in images and videos, and learn models from them for category grouping, segmentation, and summarization. To implement this idea, I first present a context-aware category discovery framework that discovers novel categories by leveraging context from previously learned categories. I devise a novel object-graph descriptor to model the interaction between a set of known categories and the unknown to-be-discovered categories, and group regions that have similar appearance and similar object-graphs. I then present a collective segmentation framework that simultaneously discovers the segmentations and groupings of objects by leveraging the shared patterns in the unlabeled image collection. It discovers an ensemble of representative instances for each unknown category, and builds top-down models from them to refine the segmentation of the remaining instances. Finally, building on these techniques, I show how to produce compact visual summaries for first-person egocentric videos that focus on the important people and objects. The system leverages novel egocentric and high-level saliency features to predict important regions in the video, and produces a concise visual summary that is driven by those regions. I compare against existing state-of-the-art methods for category discovery and segmentation on several challenging benchmark datasets. I demonstrate that we can discover visual concepts more accurately by focusing on the prevalent objects in images and videos, and show clear advantages of departing from the status quo division between the supervised and unsupervised learning paradigms. The main impact of my thesis is that it lays the groundwork for building large-scale visual discovery systems that can automatically discover visual concepts with minimal human supervision.Electrical and Computer Engineerin

    Real-world Human Re-identification: Attributes and Beyond.

    Get PDF
    PhDSurveillance systems capable of performing a diverse range of tasks that support human intelligence and analytical efforts are becoming widespread and crucial due to increasing threats upon national infrastructure and evolving business and governmental analytical requirements. Surveillance data can be critical for crime-prevention, forensic analysis, and counter-terrorism activities in both civilian and governmental agencies alike. However, visual surveillance data must currently be parsed by trained human operators and therefore any utility is offset by the inherent training and staffing costs as a result. The automated analysis of surveillance video is therefore of great scientific interest. One of the open problems within this area is that of reliably matching humans between disjoint surveillance camera views, termed re-identification. Automated re-identification facilitates human operational efficiency in the grouping of disparate and fragmented people observations through space and time into individual personal identities, a pre-requisite for higher-level surveillance tasks. However, due to the complex nature of realworld scenes and the highly variable nature of human appearance, reliably re-identifying people is non-trivial. Most re-identification approaches developed so far rely on low-level visual feature matching approaches that aim to match human detections against a known gallery of potential matches. However, for many applications an initial detection of a human may be unavailable or a low-level feature representation may not be sufficiently invariant to photometric or geometric variability inherent between camera views. This thesis begins by proposing a “mid-level” human-semantic representation that exploits expert human knowledge of surveillance task execution to the task of re-identifying people in order to compute an attribute-based description of a human. It further shows how this attribute-based description is synergistic with low-level data-derived features to enhance re-identification accuracy and subsequently gain further performance benefits by employing a discriminatively learned distance metric. Finally, a novel “zero-shot” scenario is proposed in which a visual probe is unavailable but re-identification is still possible via a manually provided semantic attribute description. The approach is extensively evaluated using several public benchmark datasets. One challenge in constructing an attribute-based and human-semantic representation is the requirement for extensive annotation. Mitigating this annotation cost in order to present a realistic and scalable re-identification system, is motivation for the second technical area of this thesis, where transfer-learning and data-mining are investigatedin two different approaches. Discriminative methods trade annotation cost for enhanced performance. Because discriminative person re-identification models operate between two camera views, annotation cost therefore scales quadratically on the number of cameras in the entire network. For practical re-identification, this 4 is an unreasonable expectation and prohibitively expensive. By leveraging flexible multi-source transfer of re-identification models, part of this cost may be alleviated. Specifically, it is possible to leverage prior re-identification models learned for a set of source-view pairs (domains), and flexibly combine those to obtain good re-identification performance for a given target-view pair with greatly reduced annotation requirements. The volume of exhaustive annotation effort required for attribute-driven re-identification scales linearly on the number of cameras and attributes. Real-world operation of an attributeenabled, distributed camera network would also require prohibitive quantities of annotation effort by human experts. This effort is completely avoided by taking a data-driven approach to attribute computation, by learning an effective associated representation by crawling large volumes of Internet data. By training on a larger and more diverse array of examples, this representation is more view-invariant and generalisable than attributes trained on conventional scales. These automatically discovered attributes are shown to provide a valuable representation that significantly improves re-identification performance. Moreover, a method to map them onto existing expert-annotated-ontologies is contributed. In the final contribution of this thesis, the underlying assumptions about visual surveillance equipment and re-identification are challenged and the thesis motivates a novel research area using dynamic, mobile platforms. Such platforms violate the common assumption shared by most previous research, namely that surveillance devices are always stationary, relative to the observed scene. The most important new challenge discovered in this exciting area is that the unconstrained video is too challenging for traditional approaches to applying discriminative methods that rely on the explicit modelling of appearance translations when modelling view-pairs, or even a single view. A new dataset was collected by a remote-operated vehicle using control software developed to simulate a fully-autonomous re-identification unmanned aerial vehicle programmed to fly in proximity with humans until images of sufficient quality for re-identification are obtained. Variations of the standard re-identification model are investigated in an enhanced re-identification paradigm, and new challenges with this distinct form of re-identification are elucidated. Finally, conventional wisdom regarding re-identification in light of these observations is re-examined

    Fisher Motion Descriptor for Multiview Gait Recognition

    Get PDF
    The goal of this paper is to identify individuals by analyzing their gait. Instead of using binary silhouettes as input data (as done in many previous works) we propose and evaluate the use of motion descriptors based on densely sampled short-term trajectories. We take advantage of state-of-the-art people detectors to de ne custom spatial con gurations of the descriptors around the target person, obtaining a rich representation of the gait motion. The local motion features (described by the Divergence-Curl-Shear descriptor [1]) extracted on the di erent spatial areas of the person are combined into a single high-level gait descriptor by using the Fisher Vector encoding [2]. The proposed approach, coined Pyramidal Fisher Motion, is experimentally validated on `CASIA' dataset [3] (parts B and C), `TUM GAID' dataset [4], `CMU MoBo' dataset [5] and the recent `AVA Multiview Gait' dataset [6]. The results show that this new approach achieves state-of-the-art results in the problem of gait recognition, allowing to recognize walking people from diverse viewpoints on single and multiple camera setups, wearing di erent clothes, carrying bags, walking at diverse speeds and not limited to straight walking paths

    GCoNet+: A Stronger Group Collaborative Co-Salient Object Detector

    Full text link
    In this paper, we present a novel end-to-end group collaborative learning network, termed GCoNet+, which can effectively and efficiently (250 fps) identify co-salient objects in natural scenes. The proposed GCoNet+ achieves the new state-of-the-art performance for co-salient object detection (CoSOD) through mining consensus representations based on the following two essential criteria: 1) intra-group compactness to better formulate the consistency among co-salient objects by capturing their inherent shared attributes using our novel group affinity module (GAM); 2) inter-group separability to effectively suppress the influence of noisy objects on the output by introducing our new group collaborating module (GCM) conditioning on the inconsistent consensus. To further improve the accuracy, we design a series of simple yet effective components as follows: i) a recurrent auxiliary classification module (RACM) promoting the model learning at the semantic level; ii) a confidence enhancement module (CEM) helping the model to improve the quality of the final predictions; and iii) a group-based symmetric triplet (GST) loss guiding the model to learn more discriminative features. Extensive experiments on three challenging benchmarks, i.e., CoCA, CoSOD3k, and CoSal2015, demonstrate that our GCoNet+ outperforms the existing 12 cutting-edge models. Code has been released at https://github.com/ZhengPeng7/GCoNet_plus

    Camera Pose Estimation from Street-view Snapshots and Point Clouds

    Get PDF
    This PhD thesis targets on two research problems: (1) How to efficiently and robustly estimate the camera pose of a query image with a map that contains street-view snapshots and point clouds; (2) Given the estimated camera pose of a query image, how to create meaningful and intuitive applications with the map data. To conquer the first research problem, we systematically investigated indirect, direct and hybrid camera pose estimation strategies. We implemented state-of-the-art methods and performed comprehensive experiments in two public benchmark datasets considering outdoor environmental changes from ideal to extremely challenging cases. Our key findings are: (1) the indirect method is usually more accurate than the direct method when there are enough consistent feature correspondences; (2) The direct method is sensitive to initialization, but under extreme outdoor environmental changes, the mutual-information-based direct method is more robust than the feature-based methods; (3) The hybrid method combines the strength from both direct and indirect method and outperforms them in challenging datasets. To explore the second research problem, we considered inspiring and useful applications by exploiting the camera pose together with the map data. Firstly, we invented a 3D-map augmented photo gallery application, where images’ geo-meta data are extracted with an indirect camera pose estimation method and photo sharing experience is improved with the augmentation of 3D map. Secondly, we designed an interactive video playback application, where an indirect method estimates video frames’ camera pose and the video playback is augmented with a 3D map. Thirdly, we proposed a 3D visual primitive based indoor object and outdoor scene recognition method, where the 3D primitives are accumulated from the multiview images

    Rich probabilistic models for semantic labeling

    Get PDF
    Das Ziel dieser Monographie ist es die Methoden und Anwendungen des semantischen Labelings zu erforschen. Unsere Beiträge zu diesem sich rasch entwickelten Thema sind bestimmte Aspekte der Modellierung und der Inferenz in probabilistischen Modellen und ihre Anwendungen in den interdisziplinären Bereichen der Computer Vision sowie medizinischer Bildverarbeitung und Fernerkundung
    corecore