74 research outputs found

    GrabCut-Based Human Segmentation in Video Sequences

    Get PDF
    In this paper, we present a fully-automatic Spatio-Temporal GrabCut human segmentation methodology that combines tracking and segmentation. GrabCut initialization is performed by a HOG-based subject detection, face detection, and skin color model. Spatial information is included by Mean Shift clustering whereas temporal coherence is considered by the historical of Gaussian Mixture Models. Moreover, full face and pose recovery is obtained by combining human segmentation with Active Appearance Models and Conditional Random Fields. Results over public datasets and in a new Human Limb dataset show a robust segmentation and recovery of both face and pose using the presented methodology

    From pixels to gestures: learning visual representations for human analysis in color and depth data sequences

    Get PDF
    [cat] L’anàlisi visual de persones a partir d'imatges és un tema de recerca molt important, atesa la rellevància que té a una gran quantitat d'aplicacions dins la visió per computador, com per exemple: detecció de vianants, monitorització i vigilància,interacció persona-màquina, “e-salut” o sistemes de recuperació d’matges a partir de contingut, entre d'altres. En aquesta tesi volem aprendre diferents representacions visuals del cos humà, que siguin útils per a la anàlisi visual de persones en imatges i vídeos. Per a tal efecte, analitzem diferents modalitats d'imatge com són les imatges de color RGB i les imatges de profunditat, i adrecem el problema a diferents nivells d'abstracció, des dels píxels fins als gestos: segmentació de persones, estimació de la pose humana i reconeixement de gestos. Primer, mostrem com la segmentació binària (objecte vs. fons) del cos humà en seqüències d'imatges ajuda a eliminar soroll pertanyent al fons de l'escena en qüestió. El mètode presentat, basat en optimització “Graph cuts”, imposa consistència espai-temporal a Ies màscares de segmentació obtingudes en “frames” consecutius. En segon lloc, presentem un marc metodològic per a la segmentació multi-classe, amb la qual podem obtenir una descripció més detallada del cos humà, en comptes d'obtenir una simple representació binària separant el cos humà del fons, podem obtenir màscares de segmentació més detallades, separant i categoritzant les diferents parts del cos. A un nivell d'abstraccíó més alt, tenim com a objectiu obtenir representacions del cos humà més simples, tot i ésser suficientment descriptives. Els mètodes d'estimació de la pose humana sovint es basen en models esqueletals del cos humà, formats per segments (o rectangles) que representen les extremitats del cos, connectades unes amb altres seguint les restriccions cinemàtiques del cos humà. A la pràctica, aquests models esqueletals han de complir certes restriccions per tal de poder aplicar mètodes d'inferència que permeten trobar la solució òptima de forma eficient, però a la vegada aquestes restriccions suposen una gran limitació en l'expressivitat que aques.ts models son capaços de capturar. Per tal de fer front a aquest problema, proposem un enfoc “top-down” per a predir la posició de les parts del cos del model esqueletal, introduïnt una representació de parts de mig nivell basada en “Poselets”. Finalment. proposem un marc metodològic per al reconeixement de gestos, basat en els “bag of visual words”. Aprofitem els avantatges de les imatges RGB i les imatges; de profunditat combinant vocabularis visuals específiques per a cada modalitat, emprant late fusion. Proposem un nou descriptor per a imatges de profunditat invariant a rotació, que millora l'estat de l'art, i fem servir piràmides espai-temporals per capturar certa estructura espaial i temporal dels gestos. Addicionalment, presentem una reformulació probabilística del mètode “Dynamic Time Warping” per al reconeixement de gestos en seqüències d'imatges. Més específicament, modelem els gestos amb un model probabilistic gaussià que implícitament codifica possibles deformacions tant en el domini espaial com en el temporal.[eng] The visual analysis of humans from images is an important topic of interest due to its relevance to many computer vision applications like pedestrian detection, monitoring and surveillance, human-computer interaction, e-health or content-based image retrieval, among others. In this dissertation in learning different visual representations of the human body that are helpful for the visual analysis of humans in images and video sequences. To that end, we analyze both RCB and depth image modalities and address the problem from three different research lines, at different levels of abstraction; from pixels to gestures: human segmentation, human pose estimation and gesture recognition. First, we show how binary segmentation (object vs. background) of the human body in image sequences is helpful to remove all the background clutter present in the scene. The presented method, based on “Graph cuts” optimization, enforces spatio-temporal consistency of the produced segmentation masks among consecutive frames. Secondly, we present a framework for multi-label segmentation for obtaining much more detailed segmentation masks: instead of just obtaining a binary representation separating the human body from the background, finer segmentation masks can be obtained separating the different body parts. At a higher level of abstraction, we aim for a simpler yet descriptive representation of the human body. Human pose estimation methods usually rely on skeletal models of the human body, formed by segments (or rectangles) that represent the body limbs, appropriately connected following the kinematic constraints of the human body, In practice, such skeletal models must fulfill some constraints in order to allow for efficient inference, while actually Iimiting the expressiveness of the model. In order to cope with this, we introduce a top-down approach for predicting the position of the body parts in the model, using a mid-level part representation based on Poselets. Finally, we propose a framework for gesture recognition based on the bag of visual words framework. We leverage the benefits of RGB and depth image modalities by combining modality-specific visual vocabularies in a late fusion fashion. A new rotation-variant depth descriptor is presented, yielding better results than other state-of-the-art descriptors. Moreover, spatio-temporal pyramids are used to encode rough spatial and temporal structure. In addition, we present a probabilistic reformulation of Dynamic Time Warping for gesture segmentation in video sequences, A Gaussian-based probabilistic model of a gesture is learnt, implicitly encoding possible deformations in both spatial and time domains

    MonoPerfCap: Human Performance Capture from Monocular Video

    Full text link
    We present the first marker-less approach for temporally coherent 3D performance capture of a human with general clothing from monocular video. Our approach reconstructs articulated human skeleton motion as well as medium-scale non-rigid surface deformations in general scenes. Human performance capture is a challenging problem due to the large range of articulation, potentially fast motion, and considerable non-rigid deformations, even from multi-view data. Reconstruction from monocular video alone is drastically more challenging, since strong occlusions and the inherent depth ambiguity lead to a highly ill-posed reconstruction problem. We tackle these challenges by a novel approach that employs sparse 2D and 3D human pose detections from a convolutional neural network using a batch-based pose estimation strategy. Joint recovery of per-batch motion allows to resolve the ambiguities of the monocular reconstruction problem based on a low dimensional trajectory subspace. In addition, we propose refinement of the surface geometry based on fully automatically extracted silhouettes to enable medium-scale non-rigid alignment. We demonstrate state-of-the-art performance capture results that enable exciting applications such as video editing and free viewpoint video, previously infeasible from monocular video. Our qualitative and quantitative evaluation demonstrates that our approach significantly outperforms previous monocular methods in terms of accuracy, robustness and scene complexity that can be handled.Comment: Accepted to ACM TOG 2018, to be presented on SIGGRAPH 201

    A Methodology for Extracting Human Bodies from Still Images

    Get PDF
    Monitoring and surveillance of humans is one of the most prominent applications of today and it is expected to be part of many future aspects of our life, for safety reasons, assisted living and many others. Many efforts have been made towards automatic and robust solutions, but the general problem is very challenging and remains still open. In this PhD dissertation we examine the problem from many perspectives. First, we study the performance of a hardware architecture designed for large-scale surveillance systems. Then, we focus on the general problem of human activity recognition, present an extensive survey of methodologies that deal with this subject and propose a maturity metric to evaluate them. One of the numerous and most popular algorithms for image processing found in the field is image segmentation and we propose a blind metric to evaluate their results regarding the activity at local regions. Finally, we propose a fully automatic system for segmenting and extracting human bodies from challenging single images, which is the main contribution of the dissertation. Our methodology is a novel bottom-up approach relying mostly on anthropometric constraints and is facilitated by our research in the fields of face, skin and hands detection. Experimental results and comparison with state-of-the-art methodologies demonstrate the success of our approach

    Survey on 2D and 3D human pose recovery

    Get PDF
    Human Pose Recovery approaches have been studied in the eld of Computer Vision for the last 40 years. Several approaches have been reported, and signi cant improvements have been obtained in both data representation and model design. However, the problem of Human Pose Recovery in uncontrolled environments is far from being solved. In this paper, we de ne a global taxonomy to group the model based methods and discuss their main advantages and drawbacks.Peer ReviewedPostprint (published version

    2D Articulated Human Pose Estimation and Retrieval in (Almost) Unconstrained Still Images

    Get PDF
    We present a technique for estimating the spatial layout of humans in still images—the position of the head, torso and arms. The theme we explore is that once a person is localized using an upper body detector, the search for their body parts can be considerably simplified using weak constraints on position and appearance arising from that detection. Our approach is capable of estimating upper body pose in highly challenging uncontrolled images, without prior knowledge of background, clothing, lighting, or the location and scale of the person in the image. People are only required to be upright and seen from the front or the back (not side). We evaluate the stages of our approach experimentally using ground truth layout annotation on a variety of challenging material, such as images from the PASCAL VOC 2008 challenge and video frames from TV shows and feature films. We also propose and evaluate techniques for searching a video dataset for people in a specific pose. To this end, we develop three new pose descriptors and compare their classification and retrieval performance to two baselines built on state-of-the-art object detection model

    Analysis of the hands in egocentric vision: A survey

    Full text link
    Egocentric vision (a.k.a. first-person vision - FPV) applications have thrived over the past few years, thanks to the availability of affordable wearable cameras and large annotated datasets. The position of the wearable camera (usually mounted on the head) allows recording exactly what the camera wearers have in front of them, in particular hands and manipulated objects. This intrinsic advantage enables the study of the hands from multiple perspectives: localizing hands and their parts within the images; understanding what actions and activities the hands are involved in; and developing human-computer interfaces that rely on hand gestures. In this survey, we review the literature that focuses on the hands using egocentric vision, categorizing the existing approaches into: localization (where are the hands or parts of them?); interpretation (what are the hands doing?); and application (e.g., systems that used egocentric hand cues for solving a specific problem). Moreover, a list of the most prominent datasets with hand-based annotations is provided

    Depth-Assisted Semantic Segmentation, Image Enhancement and Parametric Modeling

    Get PDF
    This dissertation addresses the problem of employing 3D depth information on solving a number of traditional challenging computer vision/graphics problems. Humans have the abilities of perceiving the depth information in 3D world, which enable humans to reconstruct layouts, recognize objects and understand the geometric space and semantic meanings of the visual world. Therefore it is significant to explore how the 3D depth information can be utilized by computer vision systems to mimic such abilities of humans. This dissertation aims at employing 3D depth information to solve vision/graphics problems in the following aspects: scene understanding, image enhancements and 3D reconstruction and modeling. In addressing scene understanding problem, we present a framework for semantic segmentation and object recognition on urban video sequence only using dense depth maps recovered from the video. Five view-independent 3D features that vary with object class are extracted from dense depth maps and used for segmenting and recognizing different object classes in street scene images. We demonstrate a scene parsing algorithm that uses only dense 3D depth information to outperform using sparse 3D or 2D appearance features. In addressing image enhancement problem, we present a framework to overcome the imperfections of personal photographs of tourist sites using the rich information provided by large-scale internet photo collections (IPCs). By augmenting personal 2D images with 3D information reconstructed from IPCs, we address a number of traditionally challenging image enhancement techniques and achieve high-quality results using simple and robust algorithms. In addressing 3D reconstruction and modeling problem, we focus on parametric modeling of flower petals, the most distinctive part of a plant. The complex structure, severe occlusions and wide variations make the reconstruction of their 3D models a challenging task. We overcome these challenges by combining data driven modeling techniques with domain knowledge from botany. Taking a 3D point cloud of an input flower scanned from a single view, each segmented petal is fitted with a scale-invariant morphable petal shape model, which is constructed from individually scanned 3D exemplar petals. Novel constraints based on botany studies are incorporated into the fitting process for realistically reconstructing occluded regions and maintaining correct 3D spatial relations. The main contribution of the dissertation is in the intelligent usage of 3D depth information on solving traditional challenging vision/graphics problems. By developing some advanced algorithms either automatically or with minimum user interaction, the goal of this dissertation is to demonstrate that computed 3D depth behind the multiple images contains rich information of the visual world and therefore can be intelligently utilized to recognize/ understand semantic meanings of scenes, efficiently enhance and augment single 2D images, and reconstruct high-quality 3D models

    2D Articulated Human Pose Estimation and Retrieval in (Almost) Unconstrained Still Images

    Get PDF
    Abstract We present a technique for estimating the spatial layout of humans in still images—the position of the head, torso and arms. The theme we explore is that once a person is localized using an upper body detector, the search for their body parts can be considerably simplified using weak constraints on position and appearance arising from that detection. Our approach is capable of estimating upper body pose in highly challenging uncontrolled images, without prior knowledge of background, clothing, lighting, or the location and scale of the person in the image. People are only required to be upright and seen from the front or the back (not side). We evaluate the stages of our approach experimentally using ground truth layout annotation on a variety of challenging material, such as images from the PASCAL VOC 2008 challenge and video frames from TV shows and feature films. We also propose and evaluate techniques for searching a video dataset for people in a specific pose. To this end, we develop three new pose descriptors and compare their clas
    corecore