7,057 research outputs found
Learning to Predict Diverse Human Motions from a Single Image via Mixture Density Networks
Human motion prediction, which plays a key role in computer vision, generally
requires a past motion sequence as input. However, in real applications, a
complete and correct past motion sequence can be too expensive to achieve. In
this paper, we propose a novel approach to predicting future human motions from
a much weaker condition, i.e., a single image, with mixture density networks
(MDN) modeling. Contrary to most existing deep human motion prediction
approaches, the multimodal nature of MDN enables the generation of diverse
future motion hypotheses, which well compensates for the strong stochastic
ambiguity aggregated by the single input and human motion uncertainty. In
designing the loss function, we further introduce the energy-based formulation
to flexibly impose prior losses over the learnable parameters of MDN to
maintain motion coherence as well as improve the prediction accuracy by
customizing the energy functions. Our trained model directly takes an image as
input and generates multiple plausible motions that satisfy the given
condition. Extensive experiments on two standard benchmark datasets demonstrate
the effectiveness of our method in terms of prediction diversity and accuracy
DiffuPose: Monocular 3D Human Pose Estimation via Denoising Diffusion Probabilistic Model
Thanks to the development of 2D keypoint detectors, monocular 3D human pose
estimation (HPE) via 2D-to-3D uplifting approaches have achieved remarkable
improvements. Still, monocular 3D HPE is a challenging problem due to the
inherent depth ambiguities and occlusions. To handle this problem, many
previous works exploit temporal information to mitigate such difficulties.
However, there are many real-world applications where frame sequences are not
accessible. This paper focuses on reconstructing a 3D pose from a single 2D
keypoint detection. Rather than exploiting temporal information, we alleviate
the depth ambiguity by generating multiple 3D pose candidates which can be
mapped to an identical 2D keypoint. We build a novel diffusion-based framework
to effectively sample diverse 3D poses from an off-the-shelf 2D detector. By
considering the correlation between human joints by replacing the conventional
denoising U-Net with graph convolutional network, our approach accomplishes
further performance improvements. We evaluate our method on the widely adopted
Human3.6M and HumanEva-I datasets. Comprehensive experiments are conducted to
prove the efficacy of the proposed method, and they confirm that our model
outperforms state-of-the-art multi-hypothesis 3D HPE methods
MHR-Net: Multiple-Hypothesis Reconstruction of Non-Rigid Shapes from 2D Views
We propose MHR-Net, a novel method for recovering Non-Rigid Shapes from
Motion (NRSfM). MHR-Net aims to find a set of reasonable reconstructions for a
2D view, and it also selects the most likely reconstruction from the set. To
deal with the challenging unsupervised generation of non-rigid shapes, we
develop a new Deterministic Basis and Stochastic Deformation scheme in MHR-Net.
The non-rigid shape is first expressed as the sum of a coarse shape basis and a
flexible shape deformation, then multiple hypotheses are generated with
uncertainty modeling of the deformation part. MHR-Net is optimized with
reprojection loss on the basis and the best hypothesis. Furthermore, we design
a new Procrustean Residual Loss, which reduces the rigid rotations between
similar shapes and further improves the performance. Experiments show that
MHR-Net achieves state-of-the-art reconstruction accuracy on Human3.6M, SURREAL
and 300-VW datasets.Comment: Accepted to ECCV 202
Generative Approach for Probabilistic Human Mesh Recovery using Diffusion Models
This work focuses on the problem of reconstructing a 3D human body mesh from
a given 2D image. Despite the inherent ambiguity of the task of human mesh
recovery, most existing works have adopted a method of regressing a single
output. In contrast, we propose a generative approach framework, called
"Diffusion-based Human Mesh Recovery (Diff-HMR)" that takes advantage of the
denoising diffusion process to account for multiple plausible outcomes. During
the training phase, the SMPL parameters are diffused from ground-truth
parameters to random distribution, and Diff-HMR learns the reverse process of
this diffusion. In the inference phase, the model progressively refines the
given random SMPL parameters into the corresponding parameters that align with
the input image. Diff-HMR, being a generative approach, is capable of
generating diverse results for the same input image as the input noise varies.
We conduct validation experiments, and the results demonstrate that the
proposed framework effectively models the inherent ambiguity of the task of
human mesh recovery in a probabilistic manner. The code is available at
https://github.com/hanbyel0105/Diff-HMRComment: Accepted to ICCV 2023 CV4Metaverse Worksho
GloPro: Globally-Consistent Uncertainty-Aware 3D Human Pose Estimation & Tracking in the Wild
An accurate and uncertainty-aware 3D human body pose estimation is key to
enabling truly safe but efficient human-robot interactions. Current
uncertainty-aware methods in 3D human pose estimation are limited to predicting
the uncertainty of the body posture, while effectively neglecting the body
shape and root pose. In this work, we present GloPro, which to the best of our
knowledge the first framework to predict an uncertainty distribution of a 3D
body mesh including its shape, pose, and root pose, by efficiently fusing
visual clues with a learned motion model. We demonstrate that it vastly
outperforms state-of-the-art methods in terms of human trajectory accuracy in a
world coordinate system (even in the presence of severe occlusions), yields
consistent uncertainty distributions, and can run in real-time.Comment: IEEE International Conference on Intelligent Robots and Systems
(IROS) 202
Continuous 3D Multi-Channel Sign Language Production via Progressive Transformers and Mixture Density Networks
Sign languages are multi-channel visual languages, where signers use a
continuous 3D space to communicate.Sign Language Production (SLP), the
automatic translation from spoken to sign languages, must embody both the
continuous articulation and full morphology of sign to be truly understandable
by the Deaf community. Previous deep learning-based SLP works have produced
only a concatenation of isolated signs focusing primarily on the manual
features, leading to a robotic and non-expressive production.
In this work, we propose a novel Progressive Transformer architecture, the
first SLP model to translate from spoken language sentences to continuous 3D
multi-channel sign pose sequences in an end-to-end manner. Our transformer
network architecture introduces a counter decoding that enables variable length
continuous sequence generation by tracking the production progress over time
and predicting the end of sequence. We present extensive data augmentation
techniques to reduce prediction drift, alongside an adversarial training regime
and a Mixture Density Network (MDN) formulation to produce realistic and
expressive sign pose sequences.
We propose a back translation evaluation mechanism for SLP, presenting
benchmark quantitative results on the challenging PHOENIX14T dataset and
setting baselines for future research. We further provide a user evaluation of
our SLP model, to understand the Deaf reception of our sign pose productions
3D hand pose estimation using convolutional neural networks
3D hand pose estimation plays a fundamental role in natural human computer interactions. The problem is challenging due to complicated variations caused by complex articulations, multiple viewpoints, self-similar parts, severe self-occlusions, different shapes and sizes.
To handle these challenges, the thesis makes the following contributions. First, the problem of the multiple viewpoints and complex articulations of hand pose estimation is tackled by decomposing and transforming the input and output space by spatial transformations following the hand structure. By the transformation, both the variation of the input space and output is reduced, which makes the learning easier.
The second contribution is a probabilistic framework integrating all the hierarchical regressions. Variants with/without sampling, using different regressors and optimization methods are constructed and compared to provide an insight of the components under this framework.
The third contribution is based on the observation that for images with occlusions, there exist multiple plausible configurations for the occluded parts.
A hierarchical mixture density network is proposed to handle the multi-modality of the locations for occluded hand joints. It leverages the state-of-the-art hand pose estimators based on Convolutional Neural Networks to facilitate feature learning while models the multiple modes in a two-level hierarchy to reconcile single-valued (for visible joints) and multi-valued (for occluded joints) mapping in its output.
In addition, a complete labeled real hand datasets is collected by a tracking system with six 6D magnetic sensors and inverse kinematics to automatically obtain 21-joints hand pose annotations of depth maps.Open Acces
Efficient Belief Propagation for Perception and Manipulation in Clutter
Autonomous service robots are required to perform tasks in common human indoor environments. To achieve goals associated with these tasks, the robot should continually perceive, reason its environment, and plan to manipulate objects, which we term as goal-directed manipulation. Perception remains the most challenging aspect of all stages, as common indoor environments typically pose problems in recognizing objects under inherent occlusions with physical interactions among themselves. Despite recent progress in the field of robot perception, accommodating perceptual uncertainty due to partial observations remains challenging and needs to be addressed to achieve the desired autonomy.
In this dissertation, we address the problem of perception under uncertainty for robot manipulation in cluttered environments using generative inference methods. Specifically, we aim to enable robots to perceive partially observable environments by maintaining an approximate probability distribution as a belief over possible scene hypotheses. This belief representation captures uncertainty resulting from inter-object occlusions and physical interactions, which are inherently present in clutterred indoor environments. The research efforts presented in this thesis are towards developing appropriate state representations and inference techniques to generate and maintain such belief over contextually plausible scene states. We focus on providing the following features to generative inference while addressing the challenges due to occlusions: 1) generating and maintaining plausible scene hypotheses, 2) reducing the inference search space that typically grows exponentially with respect to the number of objects in a scene, 3) preserving scene hypotheses over continual observations.
To generate and maintain plausible scene hypotheses, we propose physics informed scene estimation methods that combine a Newtonian physics engine within a particle based generative inference framework. The proposed variants of our method with and without a Monte Carlo step showed promising results on generating and maintaining plausible hypotheses under complete occlusions. We show that estimating such scenarios would not be possible by the commonly adopted 3D registration methods without the notion of a physical context that our method provides.
To scale up the context informed inference to accommodate a larger number of objects, we describe a factorization of scene state into object and object-parts to perform collaborative particle-based inference. This resulted in the Pull Message Passing for Nonparametric Belief Propagation (PMPNBP) algorithm that caters to the demands of the high-dimensional multimodal nature of cluttered scenes while being computationally tractable. We demonstrate that PMPNBP is orders of magnitude faster than the state-of-the-art Nonparametric Belief Propagation method. Additionally, we show that PMPNBP successfully estimates poses of articulated objects under various simulated occlusion scenarios.
To extend our PMPNBP algorithm for tracking object states over continuous observations, we explore ways to propose and preserve hypotheses effectively over time. This resulted in an augmentation-selection method, where hypotheses are drawn from various proposals followed by the selection of a subset using PMPNBP that explained the current state of the objects. We discuss and analyze our augmentation-selection method with its counterparts in belief propagation literature. Furthermore, we develop an inference pipeline for pose estimation and tracking of articulated objects in clutter. In this pipeline, the message passing module with the augmentation-selection method is informed by segmentation heatmaps from a trained neural network. In our experiments, we show that our proposed pipeline can effectively maintain belief and track articulated objects over a sequence of observations under occlusion.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/163159/1/kdesingh_1.pd
- …