1,768 research outputs found

    Capturing Hands in Action using Discriminative Salient Points and Physics Simulation

    Full text link
    Hand motion capture is a popular research field, recently gaining more attention due to the ubiquity of RGB-D sensors. However, even most recent approaches focus on the case of a single isolated hand. In this work, we focus on hands that interact with other hands or objects and present a framework that successfully captures motion in such interaction scenarios for both rigid and articulated objects. Our framework combines a generative model with discriminatively trained salient points to achieve a low tracking error and with collision detection and physics simulation to achieve physically plausible estimates even in case of occlusions and missing visual data. Since all components are unified in a single objective function which is almost everywhere differentiable, it can be optimized with standard optimization techniques. Our approach works for monocular RGB-D sequences as well as setups with multiple synchronized RGB cameras. For a qualitative and quantitative evaluation, we captured 29 sequences with a large variety of interactions and up to 150 degrees of freedom.Comment: Accepted for publication by the International Journal of Computer Vision (IJCV) on 16.02.2016 (submitted on 17.10.14). A combination into a single framework of an ECCV'12 multicamera-RGB and a monocular-RGBD GCPR'14 hand tracking paper with several extensions, additional experiments and detail

    Metaheuristic Optimization Techniques for Articulated Human Tracking

    Get PDF
    Four adaptive metaheuristic optimization algorithms are proposed and demonstrated: Adaptive Parameter Particle Swarm Optimization (AP-PSO), Modified Artificial Bat (MAB), Differential Mutated Artificial Immune System (DM-AIS) and hybrid Particle Swarm Accelerated Artificial Immune System (PSO-AIS). The algorithms adapt their search parameters on the basis of the fitness of obtained solutions such that a good fitness value favors local search, while a poor fitness value favors global search. This efficient feedback of the solution quality, imparts excellent global and local search characteristic to the proposed algorithms. The algorithms are tested on the challenging Articulated Human Tracking (AHT) problem whose objective is to infer human pose, expressed in terms of joint angles, from a continuous video stream. The Particle Filter (PF) algorithms, widely applied in generative model based AHT, suffer from the 'curse of dimensionality' and 'degeneracy' challenges. The four proposed algorithms show stable performance throughout the course of numerical experiments. DM-AIS performs best among the proposed algorithms followed in order by PSO-AIS, AP-PSO, and MBA in terms of Most Appropriate Pose (MAP) tracking error. The MAP tracking error of the proposed algorithms is compared with four heuristic approaches: generic PF, Annealed Particle Filter (APF), Partitioned Sampled Annealed Particle Filter (PSAPF) and Hierarchical Particle Swarm Optimization (HPSO). They are found to outperform generic PF with a confidence level of 95%, PSAPF and HPSO with a confidence level of 85%. While DM-AIS and PSO-AIS outperform APF with a confidence level of 80%. Further, it is noted that the proposed algorithms outperform PSAPF and HPSO using a significantly lower number of function evaluations, 2500 versus 7200. The proposed algorithms demonstrate reduced particle requirements, hence improving computational efficiency and helping to alleviate the 'curse of dimensionality'. The adaptive nature of the algorithms is found to guide the whole swarm towards the optimal solution by sharing information and exploring a wider solution space which resolves the 'degeneracy' challenge. Furthermore, the decentralized structure of the algorithms renders them insensitive to accumulation of error and allows them to recover from catastrophic failures due to loss of image data, sudden change in motion pattern or discrete instances of algorithmic failure. The performance enhancements demonstrated by the proposed algorithms, attributed to their balanced local and global search capabilities, makes real-time AHT applications feasible. Finally, the utility of the proposed algorithms in low-dimensional system identification problems as well as high-dimensional AHT problems demonstrates their applicability in various problem domains

    Robust and real-time hand detection and tracking in monocular video

    Get PDF
    In recent years, personal computing devices such as laptops, tablets and smartphones have become ubiquitous. Moreover, intelligent sensors are being integrated into many consumer devices such as eyeglasses, wristwatches and smart televisions. With the advent of touchscreen technology, a new human-computer interaction (HCI) paradigm arose that allows users to interface with their device in an intuitive manner. Using simple gestures, such as swipe or pinch movements, a touchscreen can be used to directly interact with a virtual environment. Nevertheless, touchscreens still form a physical barrier between the virtual interface and the real world. An increasingly popular field of research that tries to overcome this limitation, is video based gesture recognition, hand detection and hand tracking. Gesture based interaction allows the user to directly interact with the computer in a natural manner by exploring a virtual reality using nothing but his own body language. In this dissertation, we investigate how robust hand detection and tracking can be accomplished under real-time constraints. In the context of human-computer interaction, real-time is defined as both low latency and low complexity, such that a complete video frame can be processed before the next one becomes available. Furthermore, for practical applications, the algorithms should be robust to illumination changes, camera motion, and cluttered backgrounds in the scene. Finally, the system should be able to initialize automatically, and to detect and recover from tracking failure. We study a wide variety of existing algorithms, and propose significant improvements and novel methods to build a complete detection and tracking system that meets these requirements. Hand detection, hand tracking and hand segmentation are related yet technically different challenges. Whereas detection deals with finding an object in a static image, tracking considers temporal information and is used to track the position of an object over time, throughout a video sequence. Hand segmentation is the task of estimating the hand contour, thereby separating the object from its background. Detection of hands in individual video frames allows us to automatically initialize our tracking algorithm, and to detect and recover from tracking failure. Human hands are highly articulated objects, consisting of finger parts that are connected with joints. As a result, the appearance of a hand can vary greatly, depending on the assumed hand pose. Traditional detection algorithms often assume that the appearance of the object of interest can be described using a rigid model and therefore can not be used to robustly detect human hands. Therefore, we developed an algorithm that detects hands by exploiting their articulated nature. Instead of resorting to a template based approach, we probabilistically model the spatial relations between different hand parts, and the centroid of the hand. Detecting hand parts, such as fingertips, is much easier than detecting a complete hand. Based on our model of the spatial configuration of hand parts, the detected parts can be used to obtain an estimate of the complete hand's position. To comply with the real-time constraints, we developed techniques to speed-up the process by efficiently discarding unimportant information in the image. Experimental results show that our method is competitive with the state-of-the-art in object detection while providing a reduction in computational complexity with a factor 1 000. Furthermore, we showed that our algorithm can also be used to detect other articulated objects such as persons or animals and is therefore not restricted to the task of hand detection. Once a hand has been detected, a tracking algorithm can be used to continuously track its position in time. We developed a probabilistic tracking method that can cope with uncertainty caused by image noise, incorrect detections, changing illumination, and camera motion. Furthermore, our tracking system automatically determines the number of hands in the scene, and can cope with hands entering or leaving the video canvas. We introduced several novel techniques that greatly increase tracking robustness, and that can also be applied in other domains than hand tracking. To achieve real-time processing, we investigated several techniques to reduce the search space of the problem, and deliberately employ methods that are easily parallelized on modern hardware. Experimental results indicate that our methods outperform the state-of-the-art in hand tracking, while providing a much lower computational complexity. One of the methods used by our probabilistic tracking algorithm, is optical flow estimation. Optical flow is defined as a 2D vector field describing the apparent velocities of objects in a 3D scene, projected onto the image plane. Optical flow is known to be used by many insects and birds to visually track objects and to estimate their ego-motion. However, most optical flow estimation methods described in literature are either too slow to be used in real-time applications, or are not robust to illumination changes and fast motion. We therefore developed an optical flow algorithm that can cope with large displacements, and that is illumination independent. Furthermore, we introduce a regularization technique that ensures a smooth flow-field. This regularization scheme effectively reduces the number of noisy and incorrect flow-vector estimates, while maintaining the ability to handle motion discontinuities caused by object boundaries in the scene. The above methods are combined into a hand tracking framework which can be used for interactive applications in unconstrained environments. To demonstrate the possibilities of gesture based human-computer interaction, we developed a new type of computer display. This display is completely transparent, allowing multiple users to perform collaborative tasks while maintaining eye contact. Furthermore, our display produces an image that seems to float in thin air, such that users can touch the virtual image with their hands. This floating imaging display has been showcased on several national and international events and tradeshows. The research that is described in this dissertation has been evaluated thoroughly by comparing detection and tracking results with those obtained by state-of-the-art algorithms. These comparisons show that the proposed methods outperform most algorithms in terms of accuracy, while achieving a much lower computational complexity, resulting in a real-time implementation. Results are discussed in depth at the end of each chapter. This research further resulted in an international journal publication; a second journal paper that has been submitted and is under review at the time of writing this dissertation; nine international conference publications; a national conference publication; a commercial license agreement concerning the research results; two hardware prototypes of a new type of computer display; and a software demonstrator

    REPRESENTATION LEARNING FOR ACTION RECOGNITION

    Get PDF
    The objective of this research work is to develop discriminative representations for human actions. The motivation stems from the fact that there are many issues encountered while capturing actions in videos like intra-action variations (due to actors, viewpoints, and duration), inter-action similarity, background motion, and occlusion of actors. Hence, obtaining a representation which can address all the variations in the same action while maintaining discrimination with other actions is a challenging task. In literature, actions have been represented either using either low-level or high-level features. Low-level features describe the motion and appearance in small spatio-temporal volumes extracted from a video. Due to the limited space-time volume used for extracting low-level features, they are not able to account for viewpoint and actor variations or variable length actions. On the other hand, high-level features handle variations in actors, viewpoints, and duration but the resulting representation is often high-dimensional which introduces the curse of dimensionality. In this thesis, we propose new representations for describing actions by combining the advantages of both low-level and high-level features. Specifically, we investigate various linear and non-linear decomposition techniques to extract meaningful attributes in both high-level and low-level features. In the first approach, the sparsity of high-level feature descriptors is leveraged to build action-specific dictionaries. Each dictionary retains only the discriminative information for a particular action and hence reduces inter-action similarity. Then, a sparsity-based classification method is proposed to classify the low-rank representation of clips obtained using these dictionaries. We show that this representation based on dictionary learning improves the classification performance across actions. Also, a few of the actions consist of rapid body deformations that hinder the extraction of local features from body movements. Hence, we propose to use a dictionary which is trained on convolutional neural network (CNN) features of the human body in various poses to reliably identify actors from the background. Particularly, we demonstrate the efficacy of sparse representation in the identification of the human body under rapid and substantial deformation. In the first two approaches, sparsity-based representation is developed to improve discriminability using class-specific dictionaries that utilize action labels. However, developing an unsupervised representation of actions is more beneficial as it can be used to both recognize similar actions and localize actions. We propose to exploit inter-action similarity to train a universal attribute model (UAM) in order to learn action attributes (common and distinct) implicitly across all the actions. Using maximum aposteriori (MAP) adaptation, a high-dimensional super action-vector (SAV) for each clip is extracted. As this SAV contains redundant attributes of all other actions, we use factor analysis to extract a novel lowvi dimensional action-vector representation for each clip. Action-vectors are shown to suppress background motion and highlight actions of interest in both trimmed and untrimmed clips that contributes to action recognition without the help of any classifiers. It is observed during our experiments that action-vector cannot effectively discriminate between actions which are visually similar to each other. Hence, we subject action-vectors to supervised linear embedding using linear discriminant analysis (LDA) and probabilistic LDA (PLDA) to enforce discrimination. Particularly, we show that leveraging complimentary information across action-vectors using different local features followed by discriminative embedding provides the best classification performance. Further, we explore non-linear embedding of action-vectors using Siamese networks especially for fine-grained action recognition. A visualization of the hidden layer output in Siamese networks shows its ability to effectively separate visually similar actions. This leads to better classification performance than linear embedding on fine-grained action recognition. All of the above approaches are presented on large unconstrained datasets with hundreds of examples per action. However, actions in surveillance videos like snatch thefts are difficult to model because of the diverse variety of scenarios in which they occur and very few labeled examples. Hence, we propose to utilize the universal attribute model (UAM) trained on large action datasets to represent such actions. Specifically, we show that there are similarities between certain actions in the large datasets with snatch thefts which help in extracting a representation for snatch thefts using the attributes from the UAM. This representation is shown to be effective in distinguishing snatch thefts from regular actions with high accuracy.In summary, this thesis proposes both supervised and unsupervised approaches for representing actions which provide better discrimination than existing representations. The first approach presents a dictionary learning based sparse representation for effective discrimination of actions. Also, we propose a sparse representation for the human body based on dictionaries in order to recognize actions with rapid body deformations. In the next approach, a low-dimensional representation called action-vector for unsupervised action recognition is presented. Further, linear and non-linear embedding of action-vectors is proposed for addressing inter-action similarity and fine-grained action recognition, respectively. Finally, we propose a representation for locating snatch thefts among thousands of regular interactions in surveillance videos
    corecore