45 research outputs found
GroomGen: A High-Quality Generative Hair Model Using Hierarchical Latent Representations
Despite recent successes in hair acquisition that fits a high-dimensional
hair model to a specific input subject, generative hair models, which establish
general embedding spaces for encoding, editing, and sampling diverse
hairstyles, are way less explored. In this paper, we present GroomGen, the
first generative model designed for hair geometry composed of highly-detailed
dense strands. Our approach is motivated by two key ideas. First, we construct
hair latent spaces covering both individual strands and hairstyles. The latent
spaces are compact, expressive, and well-constrained for high-quality and
diverse sampling. Second, we adopt a hierarchical hair representation that
parameterizes a complete hair model to three levels: single strands, sparse
guide hairs, and complete dense hairs. This representation is critical to the
compactness of latent spaces, the robustness of training, and the efficiency of
inference. Based on this hierarchical latent representation, our proposed
pipeline consists of a strand-VAE and a hairstyle-VAE that encode an individual
strand and a set of guide hairs to their respective latent spaces, and a hybrid
densification step that populates sparse guide hairs to a dense hair model.
GroomGen not only enables novel hairstyle sampling and plausible hairstyle
interpolation, but also supports interactive editing of complex hairstyles, or
can serve as strong data-driven prior for hairstyle reconstruction from images.
We demonstrate the superiority of our approach with qualitative examples of
diverse sampled hairstyles and quantitative evaluation of generation quality
regarding every single component and the entire pipeline.Comment: SIGGRAPH Asia 202
Synthesizing Diverse Human Motions in 3D Indoor Scenes
We present a novel method for populating 3D indoor scenes with virtual humans
that can navigate the environment and interact with objects in a realistic
manner. Existing approaches rely on high-quality training sequences that
capture a diverse range of human motions in 3D scenes. However, such motion
data is costly, difficult to obtain and can never cover the full range of
plausible human-scene interactions in complex indoor environments. To address
these challenges, we propose a reinforcement learning-based approach to learn
policy networks that predict latent variables of a powerful generative motion
model that is trained on a large-scale motion capture dataset (AMASS). For
navigating in a 3D environment, we propose a scene-aware policy training scheme
with a novel collision avoidance reward function. Combined with the powerful
generative motion model, we can synthesize highly diverse human motions
navigating 3D indoor scenes, meanwhile effectively avoiding obstacles. For
detailed human-object interactions, we carefully curate interaction-aware
reward functions by leveraging a marker-based body representation and the
signed distance field (SDF) representation of the 3D scene. With a number of
important training design schemes, our method can synthesize realistic and
diverse human-object interactions (e.g.,~sitting on a chair and then getting
up) even for out-of-distribution test scenarios with different object shapes,
orientations, starting body positions, and poses. Experimental results
demonstrate that our approach outperforms state-of-the-art human-scene
interaction synthesis frameworks in terms of both motion naturalness and
diversity. Video results are available on the project page:
https://zkf1997.github.io/DIMOS
Fast Nonlinear Least Squares Optimization of Large-Scale Semi-Sparse Problems
Many problems in computer graphics and vision can be formulated as a nonlinear least squares optimization problem, for which numerous off-the-shelf solvers are readily available. Depending on the structure of the problem, however, existing solvers may be more or less suitable, and in some cases the solution comes at the cost of lengthy convergence times. One such case is semi-sparse optimization problems, emerging for example in localized facial performance reconstruction, where the nonlinear least squares problem can be composed of hundreds of thousands of cost functions, each one involving many of the optimization parameters. While such problems can be solved with existing solvers, the computation time can severely hinder the applicability of these methods. We introduce a novel iterative solver for nonlinear least squares optimization of large-scale semi-sparse problems. We use the nonlinear Levenberg-Marquardt method to locally linearize the problem in parallel, based on its firstorder approximation. Then, we decompose the linear problem in small blocks, using the local Schur complement, leading to a more compact linear system without loss of information. The resulting system is dense but its size is small enough to be solved using a parallel direct method in a short amount of time. The main benefit we get by using such an approach is that the overall optimization process is entirely parallel and scalable, making it suitable to be mapped onto graphics hardware (GPU). By using our minimizer, results are obtained up to one order of magnitude faster than other existing solvers, without sacrificing the generality and the accuracy of the model. We provide a detailed analysis of our approach and validate our results with the application of performance-based facial capture using a recently-proposed anatomical local face deformation model
ETH-XGaze: A Large Scale Dataset for Gaze Estimation under Extreme Head Pose and Gaze Variation
Gaze estimation is a fundamental task in many applications of computer
vision, human computer interaction and robotics. Many state-of-the-art methods
are trained and tested on custom datasets, making comparison across methods
challenging. Furthermore, existing gaze estimation datasets have limited head
pose and gaze variations, and the evaluations are conducted using different
protocols and metrics. In this paper, we propose a new gaze estimation dataset
called ETH-XGaze, consisting of over one million high-resolution images of
varying gaze under extreme head poses. We collect this dataset from 110
participants with a custom hardware setup including 18 digital SLR cameras and
adjustable illumination conditions, and a calibrated system to record ground
truth gaze targets. We show that our dataset can significantly improve the
robustness of gaze estimation methods across different head poses and gaze
angles. Additionally, we define a standardized experimental protocol and
evaluation metric on ETH-XGaze, to better unify gaze estimation research going
forward. The dataset and benchmark website are available at
https://ait.ethz.ch/projects/2020/ETH-XGazeComment: Accepted at ECCV 2020 (Spotlight
Interactive Sculpting of Digital Faces Using an Anatomical Modeling Paradigm
Digitally sculpting 3D human faces is a very challenging task. It typically requires either 1) highly-skilled artists using complex software packages for high quality results, or 2) highly-constrained simple interfaces for consumer-level avatar creation, such as in game engines. We propose a novel interactive method for the creation of digital faces that is simple and intuitive to use, even for novice users, while consistently producing plausible 3D face geometry, and allowing editing freedom beyond traditional video game avatar creation. At the core of our system lies a specialized anatomical local face model (ALM), which is constructed from a dataset of several hundred 3D face scans. User edits are propagated to constraints for an optimization of our data-driven ALM model, ensuring the resulting face remains plausible even for simple edits like clicking and dragging surface points. We show how several natural interaction methods can be implemented in our framework, including direct control of the surface, indirect control of semantic features like age, ethnicity, gender, and BMI, as well as indirect control through manipulating the underlying bony structures. The result is a simple new method for creating digital human faces, for artists and novice users alike. Our method is attractive for low-budget VFX and animation productions, and our anatomical modeling paradigm can complement traditional game engine avatar design packages
ITI-GEN: Inclusive Text-to-Image Generation
Text-to-image generative models often reflect the biases of the training
data, leading to unequal representations of underrepresented groups. This study
investigates inclusive text-to-image generative models that generate images
based on human-written prompts and ensure the resulting images are uniformly
distributed across attributes of interest. Unfortunately, directly expressing
the desired attributes in the prompt often leads to sub-optimal results due to
linguistic ambiguity or model misrepresentation. Hence, this paper proposes a
drastically different approach that adheres to the maxim that "a picture is
worth a thousand words". We show that, for some attributes, images can
represent concepts more expressively than text. For instance, categories of
skin tones are typically hard to specify by text but can be easily represented
by example images. Building upon these insights, we propose a novel approach,
ITI-GEN, that leverages readily available reference images for Inclusive
Text-to-Image GENeration. The key idea is learning a set of prompt embeddings
to generate images that can effectively represent all desired attribute
categories. More importantly, ITI-GEN requires no model fine-tuning, making it
computationally efficient to augment existing text-to-image models. Extensive
experiments demonstrate that ITI-GEN largely improves over state-of-the-art
models to generate inclusive images from a prompt. Project page:
https://czhang0528.github.io/iti-gen.Comment: Accepted to ICCV 2023 (Oral Presentation