8 research outputs found

    From pixels to gestures: learning visual representations for human analysis in color and depth data sequences

    Get PDF
    [cat] L’anàlisi visual de persones a partir d'imatges és un tema de recerca molt important, atesa la rellevància que té a una gran quantitat d'aplicacions dins la visió per computador, com per exemple: detecció de vianants, monitorització i vigilància,interacció persona-màquina, “e-salut” o sistemes de recuperació d’matges a partir de contingut, entre d'altres. En aquesta tesi volem aprendre diferents representacions visuals del cos humà, que siguin útils per a la anàlisi visual de persones en imatges i vídeos. Per a tal efecte, analitzem diferents modalitats d'imatge com són les imatges de color RGB i les imatges de profunditat, i adrecem el problema a diferents nivells d'abstracció, des dels píxels fins als gestos: segmentació de persones, estimació de la pose humana i reconeixement de gestos. Primer, mostrem com la segmentació binària (objecte vs. fons) del cos humà en seqüències d'imatges ajuda a eliminar soroll pertanyent al fons de l'escena en qüestió. El mètode presentat, basat en optimització “Graph cuts”, imposa consistència espai-temporal a Ies màscares de segmentació obtingudes en “frames” consecutius. En segon lloc, presentem un marc metodològic per a la segmentació multi-classe, amb la qual podem obtenir una descripció més detallada del cos humà, en comptes d'obtenir una simple representació binària separant el cos humà del fons, podem obtenir màscares de segmentació més detallades, separant i categoritzant les diferents parts del cos. A un nivell d'abstraccíó més alt, tenim com a objectiu obtenir representacions del cos humà més simples, tot i ésser suficientment descriptives. Els mètodes d'estimació de la pose humana sovint es basen en models esqueletals del cos humà, formats per segments (o rectangles) que representen les extremitats del cos, connectades unes amb altres seguint les restriccions cinemàtiques del cos humà. A la pràctica, aquests models esqueletals han de complir certes restriccions per tal de poder aplicar mètodes d'inferència que permeten trobar la solució òptima de forma eficient, però a la vegada aquestes restriccions suposen una gran limitació en l'expressivitat que aques.ts models son capaços de capturar. Per tal de fer front a aquest problema, proposem un enfoc “top-down” per a predir la posició de les parts del cos del model esqueletal, introduïnt una representació de parts de mig nivell basada en “Poselets”. Finalment. proposem un marc metodològic per al reconeixement de gestos, basat en els “bag of visual words”. Aprofitem els avantatges de les imatges RGB i les imatges; de profunditat combinant vocabularis visuals específiques per a cada modalitat, emprant late fusion. Proposem un nou descriptor per a imatges de profunditat invariant a rotació, que millora l'estat de l'art, i fem servir piràmides espai-temporals per capturar certa estructura espaial i temporal dels gestos. Addicionalment, presentem una reformulació probabilística del mètode “Dynamic Time Warping” per al reconeixement de gestos en seqüències d'imatges. Més específicament, modelem els gestos amb un model probabilistic gaussià que implícitament codifica possibles deformacions tant en el domini espaial com en el temporal.[eng] The visual analysis of humans from images is an important topic of interest due to its relevance to many computer vision applications like pedestrian detection, monitoring and surveillance, human-computer interaction, e-health or content-based image retrieval, among others. In this dissertation in learning different visual representations of the human body that are helpful for the visual analysis of humans in images and video sequences. To that end, we analyze both RCB and depth image modalities and address the problem from three different research lines, at different levels of abstraction; from pixels to gestures: human segmentation, human pose estimation and gesture recognition. First, we show how binary segmentation (object vs. background) of the human body in image sequences is helpful to remove all the background clutter present in the scene. The presented method, based on “Graph cuts” optimization, enforces spatio-temporal consistency of the produced segmentation masks among consecutive frames. Secondly, we present a framework for multi-label segmentation for obtaining much more detailed segmentation masks: instead of just obtaining a binary representation separating the human body from the background, finer segmentation masks can be obtained separating the different body parts. At a higher level of abstraction, we aim for a simpler yet descriptive representation of the human body. Human pose estimation methods usually rely on skeletal models of the human body, formed by segments (or rectangles) that represent the body limbs, appropriately connected following the kinematic constraints of the human body, In practice, such skeletal models must fulfill some constraints in order to allow for efficient inference, while actually Iimiting the expressiveness of the model. In order to cope with this, we introduce a top-down approach for predicting the position of the body parts in the model, using a mid-level part representation based on Poselets. Finally, we propose a framework for gesture recognition based on the bag of visual words framework. We leverage the benefits of RGB and depth image modalities by combining modality-specific visual vocabularies in a late fusion fashion. A new rotation-variant depth descriptor is presented, yielding better results than other state-of-the-art descriptors. Moreover, spatio-temporal pyramids are used to encode rough spatial and temporal structure. In addition, we present a probabilistic reformulation of Dynamic Time Warping for gesture segmentation in video sequences, A Gaussian-based probabilistic model of a gesture is learnt, implicitly encoding possible deformations in both spatial and time domains

    Analyzing Complex Events and Human Actions in "in-the-wild" Videos

    Get PDF
    We are living in a world where it is easy to acquire videos of events ranging from private picnics to public concerts, and to share them publicly via websites such as YouTube. The ability of smart-phones to create these videos and upload them to the internet has led to an explosion of video data, which in turn has led to interesting research directions involving the analysis of ``in-the-wild'' videos. To process these types of videos, various recognition tasks such as pose estimation, action recognition, and event recognition become important in computer vision. This thesis presents various recognition problems and proposes mid-level models to address them. First, a discriminative deformable part model is presented for the recovery of qualitative pose, inferring coarse pose labels (e:g: left, front-right, back), a task more robust to common confounding factors that hinder the inference of exact 2D or 3D joint locations. Our approach automatically selects parts that are predictive of qualitative pose and trains their appearance and deformation costs to best discriminate between qualitative poses. Unlike previous approaches, our parts are both selected and trained to improve qualitative pose discrimination and are shared by all the qualitative pose models. This leads to both increased accuracy and higher efficiency, since fewer parts models are evaluated for each image. In comparisons with two state-of-the-art approaches on a public dataset, our model shows superior performance. Second, the thesis proposes the use of a robust pose feature based on part based human detectors (Poselets) for the task of action recognition in relatively unconstrained videos, i.e., collected from the web. This feature, based on the original poselets activation vector, coarsely models pose and its transitions over time. Our main contributions are that we improve the original feature's compactness and discriminability by greedy set cover over subsets of joint configurations, and incorporate it into a unified video-based action recognition framework. Experiments shows that the pose feature alone is extremely informative, yielding performance that matches most state-of-the-art approaches but only using our proposed improvements to its compactness and discriminability. By combining our pose feature with motion and shape, the proposed method outperforms state-of-the-art approaches on two public datasets. Third, clauselets, sets of concurrent actions and their temporal relationships, are proposed and explored their application to video event analysis. Clauselets are trained in two stages. Initially, clauselet detectors that find a limited set of actions in particular qualitative temporal configurations based on Allen's interval relations is trained. In the second stage, the first level detectors are applied to training videos, and discriminatively learn temporal patterns between activations that involve more actions over longer durations and lead to improved second level clauselet models. The utility of clauselets is demonstrated by applying them to the task of ``in-the-wild'' video event recognition on the TRECVID MED 11 dataset. Not only do clauselets achieve state-of-the-art results on this task, but qualitative results suggest that they may also lead to semantically meaningful descriptions of videos in terms of detected actions and their temporal relationships. Finally, the thesis addresses the task of searching for videos given text queries that are not known at training time, which typically involves zero-shot learning, where detectors for a large set of concepts, attributes, or objects parts are learned under the assumption that, once the search query is known, they can be combined to detect novel complex visual categories. These detectors are typically trained on annotated training data that is time-consuming and expensive to obtain, and a successful system requires many of them to generalize well at test time. In addition, these detectors are so general that they are not well-tuned to the specific query or target data, since neither is known at training. Our approach addresses the annotation problem by searching the web to discover visual examples of short text phrases. Top ranked search results are used to learn general, potentially noisy, visual phrase detectors. Given a search query and a target dataset, the visual phrase detectors are adapted to both the query and unlabeled target data to remove the influence of incorrect training examples or correct examples that are irrelevant to the search query. Our adaptation process exploits the spatio-temporal coocurrence of visual phrases that are found in the target data and which are relevant to the search query by iteratively refining both the visual phrase detectors and spatio-temporally grouped phrase detections (`clauselets'). Our approach is demonstrated on to the challenging TRECVID MED13 EK0 dataset and show that, using visual features alone, our approach outperforms state-of-the-art approaches that use visual, audio, and text (OCR) features

    Estimación de la pose humana 2D en imágenes estéreo

    Get PDF
    La Estimación de la Pose Humana es el proceso de obtener la configuración espacial de las partes del cuerpo en imágenes. Frente a los métodos monoculares, que recuperan la pose a partir de una sola imagen, los métodos estéreo usan un par de imágenes para realizar el proceso, siendo capaces de aprovechar la redundancia de información y así mejorar la precisión. Este trabajo de Tesis se centra en adaptar técnicas monoculares de estimación de la pose ya existentes para que sean capaces de aprovechar las ventajas del uso de información estéreo. La primera contribución de esta tesis es una nueva técnica para estimar la pose 2D de personas en imágenes estéreo basado en una restricción de similitud que permite la colaboración entre dos estimadores de pose. Nuestra propuesta mejora la precisión de las poses estimadas en comparación con técnicas monoculares de estimación de la pose ejecutadas de forma independiente en cada vista de la imagen estéreo. La segunda contribución es una base de datos para el problema de la estimación de la pose humana en imágenes estéreo. Para validar experimentalmente nuestras propuestas, hemos creado una nueva base de datos anotada de 630 imágenes estéreo que muestran personas en entornos diferentes, con ropa variada y diversa iluminación. La base de datos muestra a las personas en posición vertical con una gran variedad de poses de brazos que cubren todo el espacio de posibles configuraciones de poses. La tercera contribución es un nuevo método para estimar la pose 2D de personas en secuencias de video estéreo. El método comienza con una reducción de las posibles localizaciones de las partes del cuerpo usando información de color y de disparidad. A continuación, se utiliza información a priori para la localización de las partes del cuerpo más estructuradas. Por último, un método de recombinación de partes del cuerpo se aplica en la secuencia estéreo para obtener la mejor configuración de las partes del cuerpo. Los experimentos demuestran que la propuesta consigue mejores resultados que el actual estado del arte.Human Pose Estimation (HPE) is the task of obtaining the spatial con guration of human body parts from images. Methods recovering the human pose from a single image are called monocular approaches while those using image pairs are called stereo approaches. Stereo images provide extra information that can be employed to improve the results obtained by monocular approaches. This Thesis considers the problem of 2D human pose estimation on stereo images. To this end, three contributions are provided. The rst contribution of this thesis is a new technique to automatically detect and estimate the 2D pose of humans in stereo images. The proposed method is based on a similarity constraint that promotes a collaboration between two pose estimators. We show experimentally that our proposal improves the accuracy of the estimated poses when compared to standard HPE techniques running independently on each image. The second contribution is a dataset for the problem of human pose estimation in stereo image. To experimentally validate our approach, we have created a new annotated dataset of 630 stereo image from stereo videos depicting people in di erent backgrounds, clothing, lighting or locations in the image frames. The dataset contains upright people in a great variety of arms poses, covering the space of possible con gurations quite uniformly. The third contribution is a new method to estimate the 2D pose of humans in stereo videos sequences. The proposed pipeline starts by constraining the posible location of body joints by exploiting color and disparity information, and adding location priors to the most structured joints. Finally, a body limb recombination method is applied along the stereo sequence to obtain the best con guration of the body joints. The experiments show that our method obtains better average results than the state-of-the-art

    Vision for Social Robots: Human Perception and Pose Estimation

    Get PDF
    In order to extract the underlying meaning from a scene captured from the surrounding world in a single still image, social robots will need to learn the human ability to detect different objects, understand their arrangement and relationships relative both to their own parts and to each other, and infer the dynamics under which they are evolving. Furthermore, they will need to develop and hold a notion of context to allow assigning different meanings (semantics) to the same visual configuration (syntax) of a scene. The underlying thread of this Thesis is the investigation of new ways for enabling interactions between social robots and humans, by advancing the visual perception capabilities of robots when they process images and videos in which humans are the main focus of attention. First, we analyze the general problem of scene understanding, as social robots moving through the world need to be able to interpret scenes without having been assigned a specific preset goal. Throughout this line of research, i) we observe that human actions and interactions which can be visually discriminated from an image follow a very heavy-tailed distribution; ii) we develop an algorithm that can obtain a spatial understanding of a scene by only using cues arising from the effect of perspective on a picture of a person’s face; and iii) we define a novel taxonomy of errors for the task of estimating the 2D body pose of people in images to better explain the behavior of algorithms and highlight their underlying causes of error. Second, we focus on the specific task of 3D human pose and motion estimation from monocular 2D images using weakly supervised training data, as accurately predicting human pose will open up the possibility of richer interactions between humans and social robots. We show that when 3D ground-truth data is only available in small quantities, or not at all, it is possible to leverage knowledge about the physical properties of the human body, along with additional constraints related to alternative types of supervisory signals, to learn models that can regress the full 3D pose of the human body and predict its motions from monocular 2D images. Taken in its entirety, the intent of this Thesis is to highlight the importance of, and provide novel methodologies for, social robots' ability to interpret their surrounding environment, learn in a way that is robust to low data availability, and generalize previously observed behaviors to unknown situations in a similar way to humans.</p

    Deep learning for visual understanding

    Get PDF
    With the dramatic growth of the image data on the web, there is an increasing demand of the algorithms capable of understanding the visual information automatically. Deep learning, served as one of the most significant breakthroughs, has brought revolutionary success in diverse visual applications, including image classification, object detection, image segmentation, image captioning and etc. The purpose of this thesis is to explore and design new deep learning algorithms for better visual understanding. The main purpose of the thesis is to develop new algorithms which can improve the understanding of images. To fulfill this, it focuses on two visual applications: image classification and image captioning. Image classification aims to classify images into pre-defined categories, and helps people to know what objects the images contain. Image captioning attempts to generate a sentence to describe the images. In addition to the object, the generated sentence should also contain the action, relation and etc. China Scholarship CouncilComputer Systems, Imagery and Medi

    From motion capture to interactive virtual worlds : towards unconstrained motion-capture algorithms for real-time performance-driven character animation

    Get PDF
    This dissertation takes performance-driven character animation as a representative application and advances motion capture algorithms and animation methods to meet its high demands. Existing approaches have either coarse resolution and restricted capture volume, require expensive and complex multi-camera systems, or use intrusive suits and controllers. For motion capture, set-up time is reduced using fewer cameras, accuracy is increased despite occlusions and general environments, initialization is automated, and free roaming is enabled by egocentric cameras. For animation, increased robustness enables the use of low-cost sensors input, custom control gesture definition is guided to support novice users, and animation expressiveness is increased. The important contributions are: 1) an analytic and differentiable visibility model for pose optimization under strong occlusions, 2) a volumetric contour model for automatic actor initialization in general scenes, 3) a method to annotate and augment image-pose databases automatically, 4) the utilization of unlabeled examples for character control, and 5) the generalization and disambiguation of cyclical gestures for faithful character animation. In summary, the whole process of human motion capture, processing, and application to animation is advanced. These advances on the state of the art have the potential to improve many interactive applications, within and outside virtual reality.Diese Arbeit befasst sich mit Performance-driven Character Animation, insbesondere werden Motion Capture-Algorithmen entwickelt um den hohen Anforderungen dieser Beispielanwendung gerecht zu werden. Existierende Methoden haben entweder eine geringe Genauigkeit und einen eingeschränkten Aufnahmebereich oder benötigen teure Multi-Kamera-Systeme, oder benutzen störende Controller und spezielle Anzüge. Für Motion Capture wird die Setup-Zeit verkürzt, die Genauigkeit für Verdeckungen und generelle Umgebungen erhöht, die Initialisierung automatisiert, und Bewegungseinschränkung verringert. Für Character Animation wird die Robustheit für ungenaue Sensoren erhöht, Hilfe für benutzerdefinierte Gestendefinition geboten, und die Ausdrucksstärke der Animation verbessert. Die wichtigsten Beiträge sind: 1) ein analytisches und differenzierbares Sichtbarkeitsmodell für Rekonstruktionen unter starken Verdeckungen, 2) ein volumetrisches Konturenmodell für automatische Körpermodellinitialisierung in genereller Umgebung, 3) eine Methode zur automatischen Annotation von Posen und Augmentation von Bildern in großen Datenbanken, 4) das Nutzen von Beispielbewegungen für Character Animation, und 5) die Generalisierung und Übertragung von zyklischen Gesten für genaue Charakteranimation. Es wird der gesamte Prozess erweitert, von Motion Capture bis hin zu Charakteranimation. Die Verbesserungen sind für viele interaktive Anwendungen geeignet, innerhalb und außerhalb von virtueller Realität

    From motion capture to interactive virtual worlds : towards unconstrained motion-capture algorithms for real-time performance-driven character animation

    Get PDF
    This dissertation takes performance-driven character animation as a representative application and advances motion capture algorithms and animation methods to meet its high demands. Existing approaches have either coarse resolution and restricted capture volume, require expensive and complex multi-camera systems, or use intrusive suits and controllers. For motion capture, set-up time is reduced using fewer cameras, accuracy is increased despite occlusions and general environments, initialization is automated, and free roaming is enabled by egocentric cameras. For animation, increased robustness enables the use of low-cost sensors input, custom control gesture definition is guided to support novice users, and animation expressiveness is increased. The important contributions are: 1) an analytic and differentiable visibility model for pose optimization under strong occlusions, 2) a volumetric contour model for automatic actor initialization in general scenes, 3) a method to annotate and augment image-pose databases automatically, 4) the utilization of unlabeled examples for character control, and 5) the generalization and disambiguation of cyclical gestures for faithful character animation. In summary, the whole process of human motion capture, processing, and application to animation is advanced. These advances on the state of the art have the potential to improve many interactive applications, within and outside virtual reality.Diese Arbeit befasst sich mit Performance-driven Character Animation, insbesondere werden Motion Capture-Algorithmen entwickelt um den hohen Anforderungen dieser Beispielanwendung gerecht zu werden. Existierende Methoden haben entweder eine geringe Genauigkeit und einen eingeschränkten Aufnahmebereich oder benötigen teure Multi-Kamera-Systeme, oder benutzen störende Controller und spezielle Anzüge. Für Motion Capture wird die Setup-Zeit verkürzt, die Genauigkeit für Verdeckungen und generelle Umgebungen erhöht, die Initialisierung automatisiert, und Bewegungseinschränkung verringert. Für Character Animation wird die Robustheit für ungenaue Sensoren erhöht, Hilfe für benutzerdefinierte Gestendefinition geboten, und die Ausdrucksstärke der Animation verbessert. Die wichtigsten Beiträge sind: 1) ein analytisches und differenzierbares Sichtbarkeitsmodell für Rekonstruktionen unter starken Verdeckungen, 2) ein volumetrisches Konturenmodell für automatische Körpermodellinitialisierung in genereller Umgebung, 3) eine Methode zur automatischen Annotation von Posen und Augmentation von Bildern in großen Datenbanken, 4) das Nutzen von Beispielbewegungen für Character Animation, und 5) die Generalisierung und Übertragung von zyklischen Gesten für genaue Charakteranimation. Es wird der gesamte Prozess erweitert, von Motion Capture bis hin zu Charakteranimation. Die Verbesserungen sind für viele interaktive Anwendungen geeignet, innerhalb und außerhalb von virtueller Realität
    corecore