463 research outputs found
Learning Character-Agnostic Motion for Motion Retargeting in 2D
Analyzing human motion is a challenging task with a wide variety of
applications in computer vision and in graphics. One such application, of
particular importance in computer animation, is the retargeting of motion from
one performer to another. While humans move in three dimensions, the vast
majority of human motions are captured using video, requiring 2D-to-3D pose and
camera recovery, before existing retargeting approaches may be applied. In this
paper, we present a new method for retargeting video-captured motion between
different human performers, without the need to explicitly reconstruct 3D poses
and/or camera parameters. In order to achieve our goal, we learn to extract,
directly from a video, a high-level latent motion representation, which is
invariant to the skeleton geometry and the camera view. Our key idea is to
train a deep neural network to decompose temporal sequences of 2D poses into
three components: motion, skeleton, and camera view-angle. Having extracted
such a representation, we are able to re-combine motion with novel skeletons
and camera views, and decode a retargeted temporal sequence, which we compare
to a ground truth from a synthetic dataset. We demonstrate that our framework
can be used to robustly extract human motion from videos, bypassing 3D
reconstruction, and outperforming existing retargeting methods, when applied to
videos in-the-wild. It also enables additional applications, such as
performance cloning, video-driven cartoons, and motion retrieval.Comment: SIGGRAPH 2019. arXiv admin note: text overlap with arXiv:1804.05653
by other author
Neuron-level Selective Context Aggregation for Scene Segmentation
Contextual information provides important cues for disambiguating visually
similar pixels in scene segmentation. In this paper, we introduce a
neuron-level Selective Context Aggregation (SCA) module for scene segmentation,
comprised of a contextual dependency predictor and a context aggregation
operator. The dependency predictor is implicitly trained to infer contextual
dependencies between different image regions. The context aggregation operator
augments local representations with global context, which is aggregated
selectively at each neuron according to its on-the-fly predicted dependencies.
The proposed mechanism enables data-driven inference of contextual
dependencies, and facilitates context-aware feature learning. The proposed
method improves strong baselines built upon VGG16 on challenging scene
segmentation datasets, which demonstrates its effectiveness in modeling context
information
SAGNet:Structure-aware Generative Network for 3D-Shape Modeling
We present SAGNet, a structure-aware generative model for 3D shapes. Given a
set of segmented objects of a certain class, the geometry of their parts and
the pairwise relationships between them (the structure) are jointly learned and
embedded in a latent space by an autoencoder. The encoder intertwines the
geometry and structure features into a single latent code, while the decoder
disentangles the features and reconstructs the geometry and structure of the 3D
model. Our autoencoder consists of two branches, one for the structure and one
for the geometry. The key idea is that during the analysis, the two branches
exchange information between them, thereby learning the dependencies between
structure and geometry and encoding two augmented features, which are then
fused into a single latent code. This explicit intertwining of information
enables separately controlling the geometry and the structure of the generated
models. We evaluate the performance of our method and conduct an ablation
study. We explicitly show that encoding of shapes accounts for both
similarities in structure and geometry. A variety of quality results generated
by SAGNet are presented. The data and code are at
https://github.com/zhijieW-94/SAGNet.Comment: Accepted by SIGGRAPH 201
DiDA: Disentangled Synthesis for Domain Adaptation
Unsupervised domain adaptation aims at learning a shared model for two
related, but not identical, domains by leveraging supervision from a source
domain to an unsupervised target domain. A number of effective domain
adaptation approaches rely on the ability to extract discriminative, yet
domain-invariant, latent factors which are common to both domains. Extracting
latent commonality is also useful for disentanglement analysis, enabling
separation between the common and the domain-specific features of both domains.
In this paper, we present a method for boosting domain adaptation performance
by leveraging disentanglement analysis. The key idea is that by learning to
separately extract both the common and the domain-specific features, one can
synthesize more target domain data with supervision, thereby boosting the
domain adaptation performance. Better common feature extraction, in turn, helps
further improve the disentanglement analysis and disentangled synthesis. We
show that iterating between domain adaptation and disentanglement analysis can
consistently improve each other on several unsupervised domain adaptation
tasks, for various domain adaptation backbone models
Printed Perforated Lampshades for Continuous Projective Images
We present a technique for designing 3D-printed perforated lampshades, which
project continuous grayscale images onto the surrounding walls. Given the
geometry of the lampshade and a target grayscale image, our method computes a
distribution of tiny holes over the shell, such that the combined footprints of
the light emanating through the holes form the target image on a nearby diffuse
surface. Our objective is to approximate the continuous tones and the spatial
detail of the target image, to the extent possible within the constraints of
the fabrication process.
To ensure structural integrity, there are lower bounds on the thickness of
the shell, the radii of the holes, and the minimal distances between adjacent
holes. Thus, the holes are realized as thin tubes distributed over the
lampshade surface. The amount of light passing through a single tube may be
controlled by the tube's radius and by its direction (tilt angle). The core of
our technique thus consists of determining a suitable configuration of the
tubes: their distribution across the relevant portion of the lampshade, as well
as the parameters (radius, tilt angle) of each tube. This is achieved by
computing a capacity-constrained Voronoi tessellation over a suitably defined
density function, and embedding a tube inside the maximal inscribed circle of
each tessellation cell. The density function for a particular target image is
derived from a series of simulated images, each corresponding to a different
uniform density tube pattern on the lampshade.Comment: 10 page
Unsupervised multi-modal Styled Content Generation
The emergence of deep generative models has recently enabled the automatic
generation of massive amounts of graphical content, both in 2D and in 3D.
Generative Adversarial Networks (GANs) and style control mechanisms, such as
Adaptive Instance Normalization (AdaIN), have proved particularly effective in
this context, culminating in the state-of-the-art StyleGAN architecture. While
such models are able to learn diverse distributions, provided a sufficiently
large training set, they are not well-suited for scenarios where the
distribution of the training data exhibits a multi-modal behavior. In such
cases, reshaping a uniform or normal distribution over the latent space into a
complex multi-modal distribution in the data domain is challenging, and the
generator might fail to sample the target distribution well. Furthermore,
existing unsupervised generative models are not able to control the mode of the
generated samples independently of the other visual attributes, despite the
fact that they are typically disentangled in the training data.
In this paper, we introduce UMMGAN, a novel architecture designed to better
model multi-modal distributions, in an unsupervised fashion. Building upon the
StyleGAN architecture, our network learns multiple modes, in a completely
unsupervised manner, and combines them using a set of learned weights. We
demonstrate that this approach is capable of effectively approximating a
complex distribution as a superposition of multiple simple ones. We further
show that UMMGAN effectively disentangles between modes and style, thereby
providing an independent degree of control over the generated content
CrossNet: Latent Cross-Consistency for Unpaired Image Translation
Recent GAN-based architectures have been able to deliver impressive
performance on the general task of image-to-image translation. In particular,
it was shown that a wide variety of image translation operators may be learned
from two image sets, containing images from two different domains, without
establishing an explicit pairing between the images. This was made possible by
introducing clever regularizers to overcome the under-constrained nature of the
unpaired translation problem. In this work, we introduce a novel architecture
for unpaired image translation, and explore several new regularizers enabled by
it. Specifically, our architecture comprises a pair of GANs, as well as a pair
of translators between their respective latent spaces. These cross-translators
enable us to impose several regularizing constraints on the learnt image
translation operator, collectively referred to as latent cross-consistency. Our
results show that our proposed architecture and latent cross-consistency
constraints are able to outperform the existing state-of-the-art on a variety
of image translation tasks
Shape-Pose Disentanglement using SE(3)-equivariant Vector Neurons
We introduce an unsupervised technique for encoding point clouds into a
canonical shape representation, by disentangling shape and pose. Our encoder is
stable and consistent, meaning that the shape encoding is purely
pose-invariant, while the extracted rotation and translation are able to
semantically align different input shapes of the same class to a common
canonical pose. Specifically, we design an auto-encoder based on Vector Neuron
Networks, a rotation-equivariant neural network, whose layers we extend to
provide translation-equivariance in addition to rotation-equivariance only. The
resulting encoder produces pose-invariant shape encoding by construction,
enabling our approach to focus on learning a consistent canonical pose for a
class of objects. Quantitative and qualitative experiments validate the
superior stability and consistency of our approach
Cross-Domain Cascaded Deep Feature Translation
In recent years we have witnessed tremendous progress in unpaired
image-to-image translation methods, propelled by the emergence of DNNs and
adversarial training strategies. However, most existing methods focus on
transfer of style and appearance, rather than on shape translation. The latter
task is challenging, due to its intricate non-local nature, which calls for
additional supervision. We mitigate this by descending the deep layers of a
pre-trained network, where the deep features contain more semantics, and
applying the translation from and between these deep features. Specifically, we
leverage VGG, which is a classification network, pre-trained with large-scale
semantic supervision. Our translation is performed in a cascaded,
deep-to-shallow, fashion, along the deep feature hierarchy: we first translate
between the deepest layers that encode the higher-level semantic content of the
image, proceeding to translate the shallower layers, conditioned on the deeper
ones. We show that our method is able to translate between different domains,
which exhibit significantly different shapes. We evaluate our method both
qualitatively and quantitatively and compare it to state-of-the-art
image-to-image translation methods. Our code and trained models will be made
available
Synthesizing Training Images for Boosting Human 3D Pose Estimation
Human 3D pose estimation from a single image is a challenging task with
numerous applications. Convolutional Neural Networks (CNNs) have recently
achieved superior performance on the task of 2D pose estimation from a single
image, by training on images with 2D annotations collected by crowd sourcing.
This suggests that similar success could be achieved for direct estimation of
3D poses. However, 3D poses are much harder to annotate, and the lack of
suitable annotated training images hinders attempts towards end-to-end
solutions. To address this issue, we opt to automatically synthesize training
images with ground truth pose annotations. Our work is a systematic study along
this road. We find that pose space coverage and texture diversity are the key
ingredients for the effectiveness of synthetic training data. We present a
fully automatic, scalable approach that samples the human pose space for
guiding the synthesis procedure and extracts clothing textures from real
images. Furthermore, we explore domain adaptation for bridging the gap between
our synthetic training images and real testing photos. We demonstrate that CNNs
trained with our synthetic images out-perform those trained with real photos on
3D pose estimation tasks
- …