11 research outputs found
Learning to Dress {3D} People in Generative Clothing
Three-dimensional human body models are widely used in the analysis of human
pose and motion. Existing models, however, are learned from minimally-clothed
3D scans and thus do not generalize to the complexity of dressed people in
common images and videos. Additionally, current models lack the expressive
power needed to represent the complex non-linear geometry of pose-dependent
clothing shapes. To address this, we learn a generative 3D mesh model of
clothed people from 3D scans with varying pose and clothing. Specifically, we
train a conditional Mesh-VAE-GAN to learn the clothing deformation from the
SMPL body model, making clothing an additional term in SMPL. Our model is
conditioned on both pose and clothing type, giving the ability to draw samples
of clothing to dress different body shapes in a variety of styles and poses. To
preserve wrinkle detail, our Mesh-VAE-GAN extends patchwise discriminators to
3D meshes. Our model, named CAPE, represents global shape and fine local
structure, effectively extending the SMPL body model to clothing. To our
knowledge, this is the first generative model that directly dresses 3D human
body meshes and generalizes to different poses. The model, code and data are
available for research purposes at https://cape.is.tue.mpg.de.Comment: CVPR-2020 camera ready. Code and data are available at
https://cape.is.tue.mpg.d
Multi-person Implicit Reconstruction from a Single Image
We present a new end-to-end learning framework to obtain detailed and spatially coherent reconstructions of multiple people from a single image. Existing multi-person methods suffer from two main drawbacks: they are often model-based and therefore cannot capture accurate 3D models of people with loose clothing and hair; or they require manual intervention to resolve occlusions or interactions. Our method addresses both limitations by introducing the first end-to-end learning approach to perform model-free implicit reconstruction for realistic 3D capture of multiple clothed people in arbitrary poses (with occlusions) from a single image. Our network simultaneously estimates the 3D geometry of each person and their 6DOF spatial locations, to obtain a coherent multi-human reconstruction. In addition, we introduce a new synthetic dataset that depicts images with a varying number of inter-occluded humans and a variety of clothing and hair styles. We demonstrate robust, high-resolution reconstructions on images of multiple humans with complex occlusions, loose clothing and a large variety of poses and scenes. Our quantitative evaluation on both synthetic and real world datasets demonstrates state-of-the-art performance with significant improvements in the accuracy and completeness of the reconstructions over competing approaches
SportsSloMo: A New Benchmark and Baselines for Human-centric Video Frame Interpolation
Human-centric video frame interpolation has great potential for improving
people's entertainment experiences and finding commercial applications in the
sports analysis industry, e.g., synthesizing slow-motion videos. Although there
are multiple benchmark datasets available in the community, none of them is
dedicated for human-centric scenarios. To bridge this gap, we introduce
SportsSloMo, a benchmark consisting of more than 130K video clips and 1M video
frames of high-resolution (720p) slow-motion sports videos crawled from
YouTube. We re-train several state-of-the-art methods on our benchmark, and the
results show a decrease in their accuracy compared to other datasets. It
highlights the difficulty of our benchmark and suggests that it poses
significant challenges even for the best-performing methods, as human bodies
are highly deformable and occlusions are frequent in sports videos. To improve
the accuracy, we introduce two loss terms considering the human-aware priors,
where we add auxiliary supervision to panoptic segmentation and human keypoints
detection, respectively. The loss terms are model agnostic and can be easily
plugged into any video frame interpolation approaches. Experimental results
validate the effectiveness of our proposed loss terms, leading to consistent
performance improvement over 5 existing models, which establish strong baseline
models on our benchmark. The dataset and code can be found at:
https://neu-vi.github.io/SportsSlomo/.Comment: Project Page: https://neu-vi.github.io/SportsSlomo
SHARE: Single-view Human Adversarial REconstruction
The accuracy of 3D Human Pose and Shape reconstruction (HPS) from an image is
progressively improving. Yet, no known method is robust across all image
distortion. To address issues due to variations of camera poses, we introduce
SHARE, a novel fine-tuning method that utilizes adversarial data augmentation
to enhance the robustness of existing HPS techniques. We perform a
comprehensive analysis on the impact of camera poses on HPS reconstruction
outcomes. We first generated large-scale image datasets captured systematically
from diverse camera perspectives. We then established a mapping between camera
poses and reconstruction errors as a continuous function that characterizes the
relationship between camera poses and HPS quality. Leveraging this
representation, we introduce RoME (Regions of Maximal Error), a novel sampling
technique for our adversarial fine-tuning method.
The SHARE framework is generalizable across various single-view HPS methods
and we demonstrate its performance on HMR, SPIN, PARE, CLIFF and ExPose. Our
results illustrate a reduction in mean joint errors across single-view HPS
techniques, for images captured from multiple camera positions without
compromising their baseline performance. In many challenging cases, our method
surpasses the performance of existing models, highlighting its practical
significance for diverse real-world applications
3D Segmentation of Humans in Point Clouds with Synthetic Data
Segmenting humans in 3D indoor scenes has become increasingly important with
the rise of human-centered robotics and AR/VR applications. To this end, we
propose the task of joint 3D human semantic segmentation, instance segmentation
and multi-human body-part segmentation. Few works have attempted to directly
segment humans in cluttered 3D scenes, which is largely due to the lack of
annotated training data of humans interacting with 3D scenes. We address this
challenge and propose a framework for generating training data of synthetic
humans interacting with real 3D scenes. Furthermore, we propose a novel
transformer-based model, Human3D, which is the first end-to-end model for
segmenting multiple human instances and their body-parts in a unified manner.
The key advantage of our synthetic data generation framework is its ability to
generate diverse and realistic human-scene interactions, with highly accurate
ground truth. Our experiments show that pre-training on synthetic data improves
performance on a wide variety of 3D human segmentation tasks. Finally, we
demonstrate that Human3D outperforms even task-specific state-of-the-art 3D
segmentation methods.Comment: project page: https://human-3d.github.io
Towards Geometric Understanding of Motion
The motion of the world is inherently dependent on the spatial structure of the world and its geometry. Therefore, classical optical flow methods try to model this geometry to solve for the motion. However, recent deep learning methods take a completely different approach. They try to predict optical flow by learning from labelled data. Although deep networks have shown state-of-the-art performance on classification problems in computer vision, they have not been as effective in solving optical flow. The key reason is that deep learning methods do not explicitly model the structure of the world in a neural network, and instead expect the network to learn about the structure from data. We hypothesize that it is difficult for a network to learn about motion without any constraint on the structure of the world. Therefore, we explore several approaches to explicitly model the geometry of the world and its spatial structure in deep neural networks.
The spatial structure in images can be captured by representing it at multiple scales. To represent multiple scales of images in deep neural nets, we introduce a Spatial Pyramid Network (SpyNet). Such a network can leverage global information for estimating large motions and local information for estimating small motions. We show that SpyNet significantly improves over previous optical flow networks while also being the smallest and fastest neural network for motion estimation. SPyNet achieves a 97% reduction in model parameters over previous methods and is more accurate.
The spatial structure of the world extends to people and their motion. Humans have a very well-defined structure, and this information is useful in estimating optical flow for humans. To leverage this information, we create a synthetic dataset for human optical flow using a statistical human body model and motion capture sequences. We use this dataset to train deep networks and see significant improvement in the ability of the networks to estimate human optical flow.
The structure and geometry of the world affects the motion. Therefore, learning about the structure of the scene together with the motion can benefit both problems. To facilitate this, we introduce Competitive Collaboration, where several neural networks are constrained by geometry and can jointly learn about structure and motion in the scene without any labels. To this end, we show that jointly learning single view depth prediction, camera motion, optical flow and motion segmentation using Competitive Collaboration achieves state-of-the-art results among unsupervised approaches.
Our findings provide support for our hypothesis that explicit constraints on structure and geometry of the world lead to better methods for motion estimation
Modeling Humans at Rest with Applications to Robot Assistance
Humans spend a large part of their lives resting. Machine perception of this class of body poses would be beneficial to numerous applications, but it is complicated by line-of-sight occlusion from bedding. Pressure sensing mats are a promising alternative, but data is challenging to collect at scale. To overcome this, we use modern physics engines to simulate bodies resting on a soft bed with a pressure sensing mat. This method can efficiently generate data at scale for training deep neural networks. We present a deep model trained on this data that infers 3D human pose and body shape from a pressure image, and show that it transfers well to real world data. We also present a model that infers pose, shape and contact pressure from a depth image facing the person in bed, and it does so in the presence of blankets. This model similarly benefits from synthetic data, which is created by simulating blankets on the bodies in bed. We evaluate this model on real world data and compare it to an existing method that requires RGB, depth, thermal and pressure imagery in the input. Our model only requires an input depth image, yet it is 12% more accurate. Our methods are relevant to applications in healthcare, including patient acuity monitoring and pressure injury prevention. We demonstrate this work in the context of robotic caregiving assistance, by using it to control a robot to move to locations on a person’s body in bed.Ph.D
Autonome Autos
Verkehr ist Kultur. Er bestimmt darüber, was sich wo und auf welchen Wegen befindet, wer aufeinandertrifft und wer nicht - er bildet die Grundlage der Netzwerke, die Menschen und Dinge miteinander eingehen. Mit der Automatisierung des Verkehrs, der Etablierung von Fahrassistenzsystemen und der Entwicklung selbstfahrender Autos stehen nicht nur die Verhältnisse menschlicher und nicht-menschlicher Verkehrsteilnehmender in Frage. Nicht nur die ethischen und juristischen Grundlagen des Straßenverkehrs, sondern auch die basalen Bedingungen des Umgangs miteinander müssen neu verhandelt werden. Die Beiträger*innen des Bandes analysieren aus medien- und kulturwissenschaftlicher Perspektive die Transformationen der Mobilität, die Verkehrswende und das vielfach aufgeladene Objekt Auto