67 research outputs found
Depth Enhancement and Surface Reconstruction with RGB/D Sequence
Surface reconstruction and 3D modeling is a challenging task, which has been explored for decades by the computer vision, computer graphics, and machine learning communities. It is fundamental to many applications such as robot navigation, animation and scene understanding, industrial control and medical diagnosis. In this dissertation, I take advantage of the consumer depth sensors for surface reconstruction. Considering its limited performance on capturing detailed surface geometry, a depth enhancement approach is proposed in the first place to recovery small and rich geometric details with captured depth and color sequence. In addition to enhancing its spatial resolution, I present a hybrid camera to improve the temporal resolution of consumer depth sensor and propose an optimization framework to capture high speed motion and generate high speed depth streams. Given the partial scans from the depth sensor, we also develop a novel fusion approach to build up complete and watertight human models with a template guided registration method. Finally, the problem of surface reconstruction for non-Lambertian objects, on which the current depth sensor fails, is addressed by exploiting multi-view images captured with a hand-held color camera and we propose a visual hull based approach to recovery the 3D model
Multiple depth maps integration for 3D reconstruction using geodesic graph cuts
Depth images, in particular depth maps estimated from stereo vision, may have a substantial amount of outliers and result in inaccurate 3D modelling and reconstruction. To address this challenging issue, in this paper, a graph-cut based multiple depth maps integration approach is proposed to obtain smooth and watertight surfaces. First, confidence maps for the depth images are estimated to suppress noise, based on which reliable patches covering the object surface are determined. These patches are then exploited to estimate the path weight for 3D geodesic distance computation, where an adaptive regional term is introduced to deal with the “shorter-cuts” problem caused by the effect of the minimal surface bias. Finally, the adaptive regional term and the boundary term constructed using patches are combined in the graph-cut framework for more accurate and smoother 3D modelling. We demonstrate the superior performance of our algorithm on the well-known Middlebury multi-view database and additionally on real-world multiple depth images captured by Kinect. The experimental results have shown that our method is able to preserve the object protrusions and details while maintaining surface smoothness
A Generative Human-Robot Motion Retargeting Approach Using a Single RGBD Sensor
The goal of human-robot motion retargeting is to let a robot follow the movements performed by a human subject. Typically in previous approaches, the human poses are precomputed from a human pose tracking system, after which the explicit joint mapping strategies are specified to apply the estimated poses to a target robot. However, there is not any generic mapping strategy that we can use to map the human joint to robots with different kinds of configurations. In this paper, we present a novel motion retargeting approach that combines the human pose estimation and the motion retargeting procedure in a unified generative framework without relying on any explicit mapping. First, a 3D parametric human-robot (HUMROB) model is proposed which has the specific joint and stability configurations as the target robot while its shape conforms the source human subject. The robot configurations, including its skeleton proportions, joint limitations, and DoFs are enforced in the HUMROB model and get preserved during the tracking procedure. Using a single RGBD camera to monitor human pose, we use the raw RGB and depth sequence as input. The HUMROB model is deformed to fit the input point cloud, from which the joint angle of the model is calculated and applied to the target robots for retargeting. In this way, instead of fitted individually for each joint, we will get the joint angle of the robot fitted globally so that the surface of the deformed model is as consistent as possible to the input point cloud. In the end, no explicit or pre-defined joint mapping strategies are needed. To demonstrate its effectiveness for human-robot motion retargeting, the approach is tested under both simulations and on real robots which have a quite different skeleton configurations and joint degree of freedoms (DoFs) as compared with the source human subjects
Event-based Human Pose Tracking by Spiking Spatiotemporal Transformer
Event camera, as an emerging biologically-inspired vision sensor for
capturing motion dynamics, presents new potential for 3D human pose tracking,
or video-based 3D human pose estimation. However, existing works in pose
tracking either require the presence of additional gray-scale images to
establish a solid starting pose, or ignore the temporal dependencies all
together by collapsing segments of event streams to form static event frames.
Meanwhile, although the effectiveness of Artificial Neural Networks (ANNs,
a.k.a. dense deep learning) has been showcased in many event-based tasks, the
use of ANNs tends to neglect the fact that compared to the dense frame-based
image sequences, the occurrence of events from an event camera is
spatiotemporally much sparser. Motivated by the above mentioned issues, we
present in this paper a dedicated end-to-end sparse deep learning approach for
event-based pose tracking: 1) to our knowledge this is the first time that 3D
human pose tracking is obtained from events only, thus eliminating the need of
accessing to any frame-based images as part of input; 2) our approach is based
entirely upon the framework of Spiking Neural Networks (SNNs), which consists
of Spike-Element-Wise (SEW) ResNet and a novel Spiking Spatiotemporal
Transformer; 3) a large-scale synthetic dataset is constructed that features a
broad and diverse set of annotated 3D human motions, as well as longer hours of
event stream data, named SynEventHPD. Empirical experiments demonstrate that,
with superior performance over the state-of-the-art (SOTA) ANNs counterparts,
our approach also achieves a significant computation reduction of 80% in FLOPS.
Furthermore, our proposed method also outperforms SOTA SNNs in the regression
task of human pose tracking. Our implementation is available at
https://github.com/JimmyZou/HumanPoseTracking_SNN and dataset will be released
upon paper acceptance
Towards 4D Human Video Stylization
We present a first step towards 4D (3D and time) human video stylization,
which addresses style transfer, novel view synthesis and human animation within
a unified framework. While numerous video stylization methods have been
developed, they are often restricted to rendering images in specific viewpoints
of the input video, lacking the capability to generalize to novel views and
novel poses in dynamic scenes. To overcome these limitations, we leverage
Neural Radiance Fields (NeRFs) to represent videos, conducting stylization in
the rendered feature space. Our innovative approach involves the simultaneous
representation of both the human subject and the surrounding scene using two
NeRFs. This dual representation facilitates the animation of human subjects
across various poses and novel viewpoints. Specifically, we introduce a novel
geometry-guided tri-plane representation, significantly enhancing feature
representation robustness compared to direct tri-plane optimization. Following
the video reconstruction, stylization is performed within the NeRFs' rendered
feature space. Extensive experiments demonstrate that the proposed method
strikes a superior balance between stylized textures and temporal coherence,
surpassing existing approaches. Furthermore, our framework uniquely extends its
capabilities to accommodate novel poses and viewpoints, making it a versatile
tool for creative human video stylization.Comment: Under Revie
Dynamic Non-Rigid Objects Reconstruction with a Single RGB-D Sensor
This paper deals with the 3D reconstruction problem for dynamic non-rigid objects with a single RGB-D sensor. It is a challenging task as we consider the almost inevitable accumulation error issue in some previous sequential fusion methods and also the possible failure of surface tracking in a long sequence. Therefore, we propose a global non-rigid registration framework and tackle the drifting problem via an explicit loop closure. Our novel scheme starts with a fusion step to get multiple partial scans from the input sequence, followed by a pairwise non-rigid registration and loop detection step to obtain correspondences between neighboring partial pieces and those pieces that form a loop. Then, we perform a global registration procedure to align all those pieces together into a consistent canonical space as guided by those matches that we have established. Finally, our proposed model-update step helps fixing potential misalignments that still exist after the global registration. Both geometric and appearance constraints are enforced during our alignment; therefore, we are able to get the recovered model with accurate geometry as well as high fidelity color maps for the mesh. Experiments on both synthetic and various real datasets have demonstrated the capability of our approach to reconstruct complete and watertight deformable objects
TM2D: Bimodality Driven 3D Dance Generation via Music-Text Integration
We propose a novel task for generating 3D dance movements that simultaneously
incorporate both text and music modalities. Unlike existing works that generate
dance movements using a single modality such as music, our goal is to produce
richer dance movements guided by the instructive information provided by the
text. However, the lack of paired motion data with both music and text
modalities limits the ability to generate dance movements that integrate both.
To alleviate this challenge, we propose to utilize a 3D human motion VQ-VAE to
project the motions of the two datasets into a latent space consisting of
quantized vectors, which effectively mix the motion tokens from the two
datasets with different distributions for training. Additionally, we propose a
cross-modal transformer to integrate text instructions into motion generation
architecture for generating 3D dance movements without degrading the
performance of music-conditioned dance generation. To better evaluate the
quality of the generated motion, we introduce two novel metrics, namely Motion
Prediction Distance (MPD) and Freezing Score, to measure the coherence and
freezing percentage of the generated motion. Extensive experiments show that
our approach can generate realistic and coherent dance movements conditioned on
both text and music while maintaining comparable performance with the two
single modalities. Code will be available at:
https://garfield-kh.github.io/TM2D/
- …