6,979 research outputs found

    Deep Autoencoder for Combined Human Pose Estimation and body Model Upscaling

    Get PDF
    We present a method for simultaneously estimating 3D human pose and body shape from a sparse set of wide-baseline camera views. We train a symmetric convolutional autoencoder with a dual loss that enforces learning of a latent representation that encodes skeletal joint positions, and at the same time learns a deep representation of volumetric body shape. We harness the latter to up-scale input volumetric data by a factor of 4×4 \times, whilst recovering a 3D estimate of joint positions with equal or greater accuracy than the state of the art. Inference runs in real-time (25 fps) and has the potential for passive human behaviour monitoring where there is a requirement for high fidelity estimation of human body shape and pose

    Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation

    Full text link
    This paper proposes a new hybrid architecture that consists of a deep Convolutional Network and a Markov Random Field. We show how this architecture is successfully applied to the challenging problem of articulated human pose estimation in monocular images. The architecture can exploit structural domain constraints such as geometric relationships between body joint locations. We show that joint training of these two model paradigms improves performance and allows us to significantly outperform existing state-of-the-art techniques

    3D point cloud video segmentation oriented to the analysis of interactions

    Get PDF
    Given the widespread availability of point cloud data from consumer depth sensors, 3D point cloud segmentation becomes a promising building block for high level applications such as scene understanding and interaction analysis. It benefits from the richer information contained in real world 3D data compared to 2D images. This also implies that the classical color segmentation challenges have shifted to RGBD data, and new challenges have also emerged as the depth information is usually noisy, sparse and unorganized. Meanwhile, the lack of 3D point cloud ground truth labeling also limits the development and comparison among methods in 3D point cloud segmentation. In this paper, we present two contributions: a novel graph based point cloud segmentation method for RGBD stream data with interacting objects and a new ground truth labeling for a previously published data set. This data set focuses on interaction (merge and split between ’object’ point clouds), which differentiates itself from the few existing labeled RGBD data sets which are more oriented to Simultaneous Localization And Mapping (SLAM) tasks. The proposed point cloud segmentation method is evaluated with the 3D point cloud ground truth labeling. Experiments show the promising result of our approach.Postprint (published version

    Bridging the gap between reconstruction and synthesis

    Get PDF
    Aplicat embargament des de la data de defensa fins el 15 de gener de 20223D reconstruction and image synthesis are two of the main pillars in computer vision. Early works focused on simple tasks such as multi-view reconstruction and texture synthesis. With the spur of Deep Learning, the field has rapidly progressed, making it possible to achieve more complex and high level tasks. For example, the 3D reconstruction results of traditional multi-view approaches are currently obtained with single view methods. Similarly, early pattern based texture synthesis works have resulted in techniques that allow generating novel high-resolution images. In this thesis we have developed a hierarchy of tools that cover all these range of problems, lying at the intersection of computer vision, graphics and machine learning. We tackle the problem of 3D reconstruction and synthesis in the wild. Importantly, we advocate for a paradigm in which not everything should be learned. Instead of applying Deep Learning naively we propose novel representations, layers and architectures that directly embed prior 3D geometric knowledge for the task of 3D reconstruction and synthesis. We apply these techniques to problems including scene/person reconstruction and photo-realistic rendering. We first address methods to reconstruct a scene and the clothed people in it while estimating the camera position. Then, we tackle image and video synthesis for clothed people in the wild. Finally, we bridge the gap between reconstruction and synthesis under the umbrella of a unique novel formulation. Extensive experiments conducted along this thesis show that the proposed techniques improve the performance of Deep Learning models in terms of the quality of the reconstructed 3D shapes / synthesised images, while reducing the amount of supervision and training data required to train them. In summary, we provide a variety of low, mid and high level algorithms that can be used to incorporate prior knowledge into different stages of the Deep Learning pipeline and improve performance in tasks of 3D reconstruction and image synthesis.La reconstrucció 3D i la síntesi d'imatges són dos dels pilars fonamentals en visió per computador. Els estudis previs es centren en tasques senzilles com la reconstrucció amb informació multi-càmera i la síntesi de textures. Amb l'aparició del "Deep Learning", aquest camp ha progressat ràpidament, fent possible assolir tasques molt més complexes. Per exemple, per obtenir una reconstrucció 3D, tradicionalment s'utilitzaven mètodes multi-càmera, en canvi ara, es poden obtenir a partir d'una sola imatge. De la mateixa manera, els primers treballs de síntesi de textures basats en patrons han donat lloc a tècniques que permeten generar noves imatges completes en alta resolució. En aquesta tesi, hem desenvolupat una sèrie d'eines que cobreixen tot aquest ventall de problemes, situats en la intersecció entre la visió per computador, els gràfics i l'aprenentatge automàtic. Abordem el problema de la reconstrucció i la síntesi 3D en el món real. És important destacar que defensem un paradigma on no tot s'ha d'aprendre. Enlloc d'aplicar el "Deep Learning" de forma naïve, proposem representacions novedoses i arquitectures que incorporen directament els coneixements geomètrics ja existents per a aconseguir la reconstrucció 3D i la síntesi d'imatges. Nosaltres apliquem aquestes tècniques a problemes com ara la reconstrucció d'escenes/persones i a la renderització d'imatges fotorealistes. Primer abordem els mètodes per reconstruir una escena, les persones vestides que hi ha i la posició de la càmera. A continuació, abordem la síntesi d'imatges i vídeos de persones vestides en situacions quotidianes. I finalment, aconseguim, a través d'una nova formulació única, connectar la reconstrucció amb la síntesi. Els experiments realitzats al llarg d'aquesta tesi demostren que les tècniques proposades milloren el rendiment dels models de "Deepp Learning" pel que fa a la qualitat de les reconstruccions i les imatges sintetitzades alhora que redueixen la quantitat de dades necessàries per entrenar-los. En resum, proporcionem una varietat d'algoritmes de baix, mitjà i alt nivell que es poden utilitzar per incorporar els coneixements previs a les diferents etapes del "Deep Learning" i millorar el rendiment en tasques de reconstrucció 3D i síntesi d'imatges.Postprint (published version

    VIBE: Video Inference for Human Body Pose and Shape Estimation

    Full text link
    Human motion is fundamental to understanding behavior. Despite progress on single-image 3D pose and shape estimation, existing video-based state-of-the-art methods fail to produce accurate and natural motion sequences due to a lack of ground-truth 3D motion data for training. To address this problem, we propose Video Inference for Body Pose and Shape Estimation (VIBE), which makes use of an existing large-scale motion capture dataset (AMASS) together with unpaired, in-the-wild, 2D keypoint annotations. Our key novelty is an adversarial learning framework that leverages AMASS to discriminate between real human motions and those produced by our temporal pose and shape regression networks. We define a temporal network architecture and show that adversarial training, at the sequence level, produces kinematically plausible motion sequences without in-the-wild ground-truth 3D labels. We perform extensive experimentation to analyze the importance of motion and demonstrate the effectiveness of VIBE on challenging 3D pose estimation datasets, achieving state-of-the-art performance. Code and pretrained models are available at https://github.com/mkocabas/VIBE.Comment: CVPR-2020 camera ready. Code is available at https://github.com/mkocabas/VIB

    3DHumanGAN: 3D-Aware Human Image Generation with 3D Pose Mapping

    Full text link
    We present 3DHumanGAN, a 3D-aware generative adversarial network that synthesizes photorealistic images of full-body humans with consistent appearances under different view-angles and body-poses. To tackle the representational and computational challenges in synthesizing the articulated structure of human bodies, we propose a novel generator architecture in which a 2D convolutional backbone is modulated by a 3D pose mapping network. The 3D pose mapping network is formulated as a renderable implicit function conditioned on a posed 3D human mesh. This design has several merits: i) it leverages the strength of 2D GANs to produce high-quality images; ii) it generates consistent images under varying view-angles and poses; iii) the model can incorporate the 3D human prior and enable pose conditioning. Project page: https://3dhumangan.github.io/.Comment: 9 pages, 8 figure
    • …
    corecore