7,057 research outputs found

    Learning to Predict Diverse Human Motions from a Single Image via Mixture Density Networks

    Full text link
    Human motion prediction, which plays a key role in computer vision, generally requires a past motion sequence as input. However, in real applications, a complete and correct past motion sequence can be too expensive to achieve. In this paper, we propose a novel approach to predicting future human motions from a much weaker condition, i.e., a single image, with mixture density networks (MDN) modeling. Contrary to most existing deep human motion prediction approaches, the multimodal nature of MDN enables the generation of diverse future motion hypotheses, which well compensates for the strong stochastic ambiguity aggregated by the single input and human motion uncertainty. In designing the loss function, we further introduce the energy-based formulation to flexibly impose prior losses over the learnable parameters of MDN to maintain motion coherence as well as improve the prediction accuracy by customizing the energy functions. Our trained model directly takes an image as input and generates multiple plausible motions that satisfy the given condition. Extensive experiments on two standard benchmark datasets demonstrate the effectiveness of our method in terms of prediction diversity and accuracy

    DiffuPose: Monocular 3D Human Pose Estimation via Denoising Diffusion Probabilistic Model

    Full text link
    Thanks to the development of 2D keypoint detectors, monocular 3D human pose estimation (HPE) via 2D-to-3D uplifting approaches have achieved remarkable improvements. Still, monocular 3D HPE is a challenging problem due to the inherent depth ambiguities and occlusions. To handle this problem, many previous works exploit temporal information to mitigate such difficulties. However, there are many real-world applications where frame sequences are not accessible. This paper focuses on reconstructing a 3D pose from a single 2D keypoint detection. Rather than exploiting temporal information, we alleviate the depth ambiguity by generating multiple 3D pose candidates which can be mapped to an identical 2D keypoint. We build a novel diffusion-based framework to effectively sample diverse 3D poses from an off-the-shelf 2D detector. By considering the correlation between human joints by replacing the conventional denoising U-Net with graph convolutional network, our approach accomplishes further performance improvements. We evaluate our method on the widely adopted Human3.6M and HumanEva-I datasets. Comprehensive experiments are conducted to prove the efficacy of the proposed method, and they confirm that our model outperforms state-of-the-art multi-hypothesis 3D HPE methods

    MHR-Net: Multiple-Hypothesis Reconstruction of Non-Rigid Shapes from 2D Views

    Full text link
    We propose MHR-Net, a novel method for recovering Non-Rigid Shapes from Motion (NRSfM). MHR-Net aims to find a set of reasonable reconstructions for a 2D view, and it also selects the most likely reconstruction from the set. To deal with the challenging unsupervised generation of non-rigid shapes, we develop a new Deterministic Basis and Stochastic Deformation scheme in MHR-Net. The non-rigid shape is first expressed as the sum of a coarse shape basis and a flexible shape deformation, then multiple hypotheses are generated with uncertainty modeling of the deformation part. MHR-Net is optimized with reprojection loss on the basis and the best hypothesis. Furthermore, we design a new Procrustean Residual Loss, which reduces the rigid rotations between similar shapes and further improves the performance. Experiments show that MHR-Net achieves state-of-the-art reconstruction accuracy on Human3.6M, SURREAL and 300-VW datasets.Comment: Accepted to ECCV 202

    Generative Approach for Probabilistic Human Mesh Recovery using Diffusion Models

    Full text link
    This work focuses on the problem of reconstructing a 3D human body mesh from a given 2D image. Despite the inherent ambiguity of the task of human mesh recovery, most existing works have adopted a method of regressing a single output. In contrast, we propose a generative approach framework, called "Diffusion-based Human Mesh Recovery (Diff-HMR)" that takes advantage of the denoising diffusion process to account for multiple plausible outcomes. During the training phase, the SMPL parameters are diffused from ground-truth parameters to random distribution, and Diff-HMR learns the reverse process of this diffusion. In the inference phase, the model progressively refines the given random SMPL parameters into the corresponding parameters that align with the input image. Diff-HMR, being a generative approach, is capable of generating diverse results for the same input image as the input noise varies. We conduct validation experiments, and the results demonstrate that the proposed framework effectively models the inherent ambiguity of the task of human mesh recovery in a probabilistic manner. The code is available at https://github.com/hanbyel0105/Diff-HMRComment: Accepted to ICCV 2023 CV4Metaverse Worksho

    GloPro: Globally-Consistent Uncertainty-Aware 3D Human Pose Estimation & Tracking in the Wild

    Full text link
    An accurate and uncertainty-aware 3D human body pose estimation is key to enabling truly safe but efficient human-robot interactions. Current uncertainty-aware methods in 3D human pose estimation are limited to predicting the uncertainty of the body posture, while effectively neglecting the body shape and root pose. In this work, we present GloPro, which to the best of our knowledge the first framework to predict an uncertainty distribution of a 3D body mesh including its shape, pose, and root pose, by efficiently fusing visual clues with a learned motion model. We demonstrate that it vastly outperforms state-of-the-art methods in terms of human trajectory accuracy in a world coordinate system (even in the presence of severe occlusions), yields consistent uncertainty distributions, and can run in real-time.Comment: IEEE International Conference on Intelligent Robots and Systems (IROS) 202

    Continuous 3D Multi-Channel Sign Language Production via Progressive Transformers and Mixture Density Networks

    Full text link
    Sign languages are multi-channel visual languages, where signers use a continuous 3D space to communicate.Sign Language Production (SLP), the automatic translation from spoken to sign languages, must embody both the continuous articulation and full morphology of sign to be truly understandable by the Deaf community. Previous deep learning-based SLP works have produced only a concatenation of isolated signs focusing primarily on the manual features, leading to a robotic and non-expressive production. In this work, we propose a novel Progressive Transformer architecture, the first SLP model to translate from spoken language sentences to continuous 3D multi-channel sign pose sequences in an end-to-end manner. Our transformer network architecture introduces a counter decoding that enables variable length continuous sequence generation by tracking the production progress over time and predicting the end of sequence. We present extensive data augmentation techniques to reduce prediction drift, alongside an adversarial training regime and a Mixture Density Network (MDN) formulation to produce realistic and expressive sign pose sequences. We propose a back translation evaluation mechanism for SLP, presenting benchmark quantitative results on the challenging PHOENIX14T dataset and setting baselines for future research. We further provide a user evaluation of our SLP model, to understand the Deaf reception of our sign pose productions

    3D hand pose estimation using convolutional neural networks

    Get PDF
    3D hand pose estimation plays a fundamental role in natural human computer interactions. The problem is challenging due to complicated variations caused by complex articulations, multiple viewpoints, self-similar parts, severe self-occlusions, different shapes and sizes. To handle these challenges, the thesis makes the following contributions. First, the problem of the multiple viewpoints and complex articulations of hand pose estimation is tackled by decomposing and transforming the input and output space by spatial transformations following the hand structure. By the transformation, both the variation of the input space and output is reduced, which makes the learning easier. The second contribution is a probabilistic framework integrating all the hierarchical regressions. Variants with/without sampling, using different regressors and optimization methods are constructed and compared to provide an insight of the components under this framework. The third contribution is based on the observation that for images with occlusions, there exist multiple plausible configurations for the occluded parts. A hierarchical mixture density network is proposed to handle the multi-modality of the locations for occluded hand joints. It leverages the state-of-the-art hand pose estimators based on Convolutional Neural Networks to facilitate feature learning while models the multiple modes in a two-level hierarchy to reconcile single-valued (for visible joints) and multi-valued (for occluded joints) mapping in its output. In addition, a complete labeled real hand datasets is collected by a tracking system with six 6D magnetic sensors and inverse kinematics to automatically obtain 21-joints hand pose annotations of depth maps.Open Acces

    Efficient Belief Propagation for Perception and Manipulation in Clutter

    Full text link
    Autonomous service robots are required to perform tasks in common human indoor environments. To achieve goals associated with these tasks, the robot should continually perceive, reason its environment, and plan to manipulate objects, which we term as goal-directed manipulation. Perception remains the most challenging aspect of all stages, as common indoor environments typically pose problems in recognizing objects under inherent occlusions with physical interactions among themselves. Despite recent progress in the field of robot perception, accommodating perceptual uncertainty due to partial observations remains challenging and needs to be addressed to achieve the desired autonomy. In this dissertation, we address the problem of perception under uncertainty for robot manipulation in cluttered environments using generative inference methods. Specifically, we aim to enable robots to perceive partially observable environments by maintaining an approximate probability distribution as a belief over possible scene hypotheses. This belief representation captures uncertainty resulting from inter-object occlusions and physical interactions, which are inherently present in clutterred indoor environments. The research efforts presented in this thesis are towards developing appropriate state representations and inference techniques to generate and maintain such belief over contextually plausible scene states. We focus on providing the following features to generative inference while addressing the challenges due to occlusions: 1) generating and maintaining plausible scene hypotheses, 2) reducing the inference search space that typically grows exponentially with respect to the number of objects in a scene, 3) preserving scene hypotheses over continual observations. To generate and maintain plausible scene hypotheses, we propose physics informed scene estimation methods that combine a Newtonian physics engine within a particle based generative inference framework. The proposed variants of our method with and without a Monte Carlo step showed promising results on generating and maintaining plausible hypotheses under complete occlusions. We show that estimating such scenarios would not be possible by the commonly adopted 3D registration methods without the notion of a physical context that our method provides. To scale up the context informed inference to accommodate a larger number of objects, we describe a factorization of scene state into object and object-parts to perform collaborative particle-based inference. This resulted in the Pull Message Passing for Nonparametric Belief Propagation (PMPNBP) algorithm that caters to the demands of the high-dimensional multimodal nature of cluttered scenes while being computationally tractable. We demonstrate that PMPNBP is orders of magnitude faster than the state-of-the-art Nonparametric Belief Propagation method. Additionally, we show that PMPNBP successfully estimates poses of articulated objects under various simulated occlusion scenarios. To extend our PMPNBP algorithm for tracking object states over continuous observations, we explore ways to propose and preserve hypotheses effectively over time. This resulted in an augmentation-selection method, where hypotheses are drawn from various proposals followed by the selection of a subset using PMPNBP that explained the current state of the objects. We discuss and analyze our augmentation-selection method with its counterparts in belief propagation literature. Furthermore, we develop an inference pipeline for pose estimation and tracking of articulated objects in clutter. In this pipeline, the message passing module with the augmentation-selection method is informed by segmentation heatmaps from a trained neural network. In our experiments, we show that our proposed pipeline can effectively maintain belief and track articulated objects over a sequence of observations under occlusion.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/163159/1/kdesingh_1.pd
    • …
    corecore