107 research outputs found
EVA3D: Compositional 3D Human Generation from 2D Image Collections
Inverse graphics aims to recover 3D models from 2D observations. Utilizing
differentiable rendering, recent 3D-aware generative models have shown
impressive results of rigid object generation using 2D images. However, it
remains challenging to generate articulated objects, like human bodies, due to
their complexity and diversity in poses and appearances. In this work, we
propose, EVA3D, an unconditional 3D human generative model learned from 2D
image collections only. EVA3D can sample 3D humans with detailed geometry and
render high-quality images (up to 512x256) without bells and whistles (e.g.
super resolution). At the core of EVA3D is a compositional human NeRF
representation, which divides the human body into local parts. Each part is
represented by an individual volume. This compositional representation enables
1) inherent human priors, 2) adaptive allocation of network parameters, 3)
efficient training and rendering. Moreover, to accommodate for the
characteristics of sparse 2D human image collections (e.g. imbalanced pose
distribution), we propose a pose-guided sampling strategy for better GAN
learning. Extensive experiments validate that EVA3D achieves state-of-the-art
3D human generation performance regarding both geometry and texture quality.
Notably, EVA3D demonstrates great potential and scalability to
"inverse-graphics" diverse human bodies with a clean framework.Comment: Project Page at https://hongfz16.github.io/projects/EVA3D.htm
Exploiting Hierarchical Interactions for Protein Surface Learning
Predicting interactions between proteins is one of the most important yet
challenging problems in structural bioinformatics. Intrinsically, potential
function sites in protein surfaces are determined by both geometric and
chemical features. However, existing works only consider handcrafted or
individually learned chemical features from the atom type and extract geometric
features independently. Here, we identify two key properties of effective
protein surface learning: 1) relationship among atoms: atoms are linked with
each other by covalent bonds to form biomolecules instead of appearing alone,
leading to the significance of modeling the relationship among atoms in
chemical feature learning. 2) hierarchical feature interaction: the neighboring
residue effect validates the significance of hierarchical feature interaction
among atoms and between surface points and atoms (or residues). In this paper,
we present a principled framework based on deep learning techniques, namely
Hierarchical Chemical and Geometric Feature Interaction Network (HCGNet), for
protein surface analysis by bridging chemical and geometric features with
hierarchical interactions. Extensive experiments demonstrate that our method
outperforms the prior state-of-the-art method by 2.3% in site prediction task
and 3.2% in interaction matching task, respectively. Our code is available at
https://github.com/xmed-lab/HCGNet.Comment: Accepted to J-BH
HyperDreamer: Hyper-Realistic 3D Content Generation and Editing from a Single Image
3D content creation from a single image is a long-standing yet highly
desirable task. Recent advances introduce 2D diffusion priors, yielding
reasonable results. However, existing methods are not hyper-realistic enough
for post-generation usage, as users cannot view, render and edit the resulting
3D content from a full range. To address these challenges, we introduce
HyperDreamer with several key designs and appealing properties: 1) Viewable:
360 degree mesh modeling with high-resolution textures enables the creation of
visually compelling 3D models from a full range of observation points. 2)
Renderable: Fine-grained semantic segmentation and data-driven priors are
incorporated as guidance to learn reasonable albedo, roughness, and specular
properties of the materials, enabling semantic-aware arbitrary material
estimation. 3) Editable: For a generated model or their own data, users can
interactively select any region via a few clicks and efficiently edit the
texture with text-based guidance. Extensive experiments demonstrate the
effectiveness of HyperDreamer in modeling region-aware materials with
high-resolution textures and enabling user-friendly editing. We believe that
HyperDreamer holds promise for advancing 3D content creation and finding
applications in various domains.Comment: SIGGRAPH Asia 2023 (conference track). Project page:
https://ys-imtech.github.io/HyperDreamer
Towards Real-World Visual Tracking with Temporal Contexts
Visual tracking has made significant improvements in the past few decades.
Most existing state-of-the-art trackers 1) merely aim for performance in ideal
conditions while overlooking the real-world conditions; 2) adopt the
tracking-by-detection paradigm, neglecting rich temporal contexts; 3) only
integrate the temporal information into the template, where temporal contexts
among consecutive frames are far from being fully utilized. To handle those
problems, we propose a two-level framework (TCTrack) that can exploit temporal
contexts efficiently. Based on it, we propose a stronger version for real-world
visual tracking, i.e., TCTrack++. It boils down to two levels: features and
similarity maps. Specifically, for feature extraction, we propose an
attention-based temporally adaptive convolution to enhance the spatial features
using temporal information, which is achieved by dynamically calibrating the
convolution weights. For similarity map refinement, we introduce an adaptive
temporal transformer to encode the temporal knowledge efficiently and decode it
for the accurate refinement of the similarity map. To further improve the
performance, we additionally introduce a curriculum learning strategy. Also, we
adopt online evaluation to measure performance in real-world conditions.
Exhaustive experiments on 8 wellknown benchmarks demonstrate the superiority of
TCTrack++. Real-world tests directly verify that TCTrack++ can be readily used
in real-world applications.Comment: Accepted by IEEE TPAMI, Code:
https://github.com/vision4robotics/TCTrac
SHERF: Generalizable Human NeRF from a Single Image
Existing Human NeRF methods for reconstructing 3D humans typically rely on
multiple 2D images from multi-view cameras or monocular videos captured from
fixed camera views. However, in real-world scenarios, human images are often
captured from random camera angles, presenting challenges for high-quality 3D
human reconstruction. In this paper, we propose SHERF, the first generalizable
Human NeRF model for recovering animatable 3D humans from a single input image.
SHERF extracts and encodes 3D human representations in canonical space,
enabling rendering and animation from free views and poses. To achieve
high-fidelity novel view and pose synthesis, the encoded 3D human
representations should capture both global appearance and local fine-grained
textures. To this end, we propose a bank of 3D-aware hierarchical features,
including global, point-level, and pixel-aligned features, to facilitate
informative encoding. Global features enhance the information extracted from
the single input image and complement the information missing from the partial
2D observation. Point-level features provide strong clues of 3D human
structure, while pixel-aligned features preserve more fine-grained details. To
effectively integrate the 3D-aware hierarchical feature bank, we design a
feature fusion transformer. Extensive experiments on THuman, RenderPeople,
ZJU_MoCap, and HuMMan datasets demonstrate that SHERF achieves state-of-the-art
performance, with better generalizability for novel view and pose synthesis.Comment: Accepted by ICCV2023. Project webpage:
https://skhu101.github.io/SHERF
ComboVerse: Compositional 3D Assets Creation Using Spatially-Aware Diffusion Guidance
Generating high-quality 3D assets from a given image is highly desirable in
various applications such as AR/VR. Recent advances in single-image 3D
generation explore feed-forward models that learn to infer the 3D model of an
object without optimization. Though promising results have been achieved in
single object generation, these methods often struggle to model complex 3D
assets that inherently contain multiple objects. In this work, we present
ComboVerse, a 3D generation framework that produces high-quality 3D assets with
complex compositions by learning to combine multiple models. 1) We first
perform an in-depth analysis of this ``multi-object gap'' from both model and
data perspectives. 2) Next, with reconstructed 3D models of different objects,
we seek to adjust their sizes, rotation angles, and locations to create a 3D
asset that matches the given image. 3) To automate this process, we apply
spatially-aware score distillation sampling (SSDS) from pretrained diffusion
models to guide the positioning of objects. Our proposed framework emphasizes
spatial alignment of objects, compared with standard score distillation
sampling, and thus achieves more accurate results. Extensive experiments
validate ComboVerse achieves clear improvements over existing methods in
generating compositional 3D assets.Comment: https://cyw-3d.github.io/ComboVerse
DiffMimic: Efficient Motion Mimicking with Differentiable Physics
Motion mimicking is a foundational task in physics-based character animation.
However, most existing motion mimicking methods are built upon reinforcement
learning (RL) and suffer from heavy reward engineering, high variance, and slow
convergence with hard explorations. Specifically, they usually take tens of
hours or even days of training to mimic a simple motion sequence, resulting in
poor scalability. In this work, we leverage differentiable physics simulators
(DPS) and propose an efficient motion mimicking method dubbed DiffMimic. Our
key insight is that DPS casts a complex policy learning task to a much simpler
state matching problem. In particular, DPS learns a stable policy by analytical
gradients with ground-truth physical priors hence leading to significantly
faster and stabler convergence than RL-based methods. Moreover, to escape from
local optima, we utilize a Demonstration Replay mechanism to enable stable
gradient backpropagation in a long horizon. Extensive experiments on standard
benchmarks show that DiffMimic has a better sample efficiency and time
efficiency than existing methods (e.g., DeepMimic). Notably, DiffMimic allows a
physically simulated character to learn Backflip after 10 minutes of training
and be able to cycle it after 3 hours of training, while the existing approach
may require about a day of training to cycle Backflip. More importantly, we
hope DiffMimic can benefit more differentiable animation systems with
techniques like differentiable clothes simulation in future research.Comment: ICLR 2023 Code is at https://github.com/jiawei-ren/diffmimic Project
page is at https://diffmimic.github.io
- …