185 research outputs found
DRIT++: Diverse Image-to-Image Translation via Disentangled Representations
Image-to-image translation aims to learn the mapping between two visual
domains. There are two main challenges for this task: 1) lack of aligned
training pairs and 2) multiple possible outputs from a single input image. In
this work, we present an approach based on disentangled representation for
generating diverse outputs without paired training images. To synthesize
diverse outputs, we propose to embed images onto two spaces: a domain-invariant
content space capturing shared information across domains and a domain-specific
attribute space. Our model takes the encoded content features extracted from a
given input and attribute vectors sampled from the attribute space to
synthesize diverse outputs at test time. To handle unpaired training data, we
introduce a cross-cycle consistency loss based on disentangled representations.
Qualitative results show that our model can generate diverse and realistic
images on a wide range of tasks without paired training data. For quantitative
evaluations, we measure realism with user study and Fr\'{e}chet inception
distance, and measure diversity with the perceptual distance metric,
Jensen-Shannon divergence, and number of statistically-different bins.Comment: IJCV Journal extension for ECCV 2018 "Diverse Image-to-Image
Translation via Disentangled Representations" arXiv:1808.00948. Project Page:
http://vllab.ucmerced.edu/hylee/DRIT_pp/ Code:
https://github.com/HsinYingLee/DRI
Diverse Image-to-Image Translation via Disentangled Representations
Image-to-image translation aims to learn the mapping between two visual
domains. There are two main challenges for many applications: 1) the lack of
aligned training pairs and 2) multiple possible outputs from a single input
image. In this work, we present an approach based on disentangled
representation for producing diverse outputs without paired training images. To
achieve diversity, we propose to embed images onto two spaces: a
domain-invariant content space capturing shared information across domains and
a domain-specific attribute space. Our model takes the encoded content features
extracted from a given input and the attribute vectors sampled from the
attribute space to produce diverse outputs at test time. To handle unpaired
training data, we introduce a novel cross-cycle consistency loss based on
disentangled representations. Qualitative results show that our model can
generate diverse and realistic images on a wide range of tasks without paired
training data. For quantitative comparisons, we measure realism with user study
and diversity with a perceptual distance metric. We apply the proposed model to
domain adaptation and show competitive performance when compared to the
state-of-the-art on the MNIST-M and the LineMod datasets.Comment: ECCV 2018 (Oral). Project page: http://vllab.ucmerced.edu/hylee/DRIT/
Code: https://github.com/HsinYingLee/DRIT
TransGaGa: Geometry-Aware Unsupervised Image-to-Image Translation
Unsupervised image-to-image translation aims at learning a mapping between
two visual domains. However, learning a translation across large geometry
variations always ends up with failure. In this work, we present a novel
disentangle-and-translate framework to tackle the complex objects
image-to-image translation task. Instead of learning the mapping on the image
space directly, we disentangle image space into a Cartesian product of the
appearance and the geometry latent spaces. Specifically, we first introduce a
geometry prior loss and a conditional VAE loss to encourage the network to
learn independent but complementary representations. The translation is then
built on appearance and geometry space separately. Extensive experiments
demonstrate the superior performance of our method to other state-of-the-art
approaches, especially in the challenging near-rigid and non-rigid objects
translation tasks. In addition, by taking different exemplars as the appearance
references, our method also supports multimodal translation. Project page:
https://wywu.github.io/projects/TGaGa/TGaGa.htmlComment: Accepted to CVPR 2019. Project page:
https://wywu.github.io/projects/TGaGa/TGaGa.htm
Towards Instance-level Image-to-Image Translation
Unpaired Image-to-image Translation is a new rising and challenging vision
problem that aims to learn a mapping between unaligned image pairs in diverse
domains. Recent advances in this field like MUNIT and DRIT mainly focus on
disentangling content and style/attribute from a given image first, then
directly adopting the global style to guide the model to synthesize new domain
images. However, this kind of approaches severely incurs contradiction if the
target domain images are content-rich with multiple discrepant objects. In this
paper, we present a simple yet effective instance-aware image-to-image
translation approach (INIT), which employs the fine-grained local (instance)
and global styles to the target image spatially. The proposed INIT exhibits
three import advantages: (1) the instance-level objective loss can help learn a
more accurate reconstruction and incorporate diverse attributes of objects; (2)
the styles used for target domain of local/global areas are from corresponding
spatial regions in source domain, which intuitively is a more reasonable
mapping; (3) the joint training process can benefit both fine and coarse
granularity and incorporates instance information to improve the quality of
global translation. We also collect a large-scale benchmark for the new
instance-level translation task. We observe that our synthetic images can even
benefit real-world vision tasks like generic object detection.Comment: Accepted to CVPR 2019. Project page:
http://zhiqiangshen.com/projects/INIT/index.htm
Attribute Guided Unpaired Image-to-Image Translation with Semi-supervised Learning
Unpaired Image-to-Image Translation (UIT) focuses on translating images among
different domains by using unpaired data, which has received increasing
research focus due to its practical usage. However, existing UIT schemes defect
in the need of supervised training, as well as the lack of encoding domain
information. In this paper, we propose an Attribute Guided UIT model termed
AGUIT to tackle these two challenges. AGUIT considers multi-modal and
multi-domain tasks of UIT jointly with a novel semi-supervised setting, which
also merits in representation disentanglement and fine control of outputs.
Especially, AGUIT benefits from two-fold: (1) It adopts a novel semi-supervised
learning process by translating attributes of labeled data to unlabeled data,
and then reconstructing the unlabeled data by a cycle consistency operation.
(2) It decomposes image representation into domain-invariant content code and
domain-specific style code. The redesigned style code embeds image style into
two variables drawn from standard Gaussian distribution and the distribution of
domain label, which facilitates the fine control of translation due to the
continuity of both variables. Finally, we introduce a new challenge, i.e.,
disentangled transfer, for UIT models, which adopts the disentangled
representation to translate data less related with the training set. Extensive
experiments demonstrate the capacity of AGUIT over existing state-of-the-art
models
Multi-mapping Image-to-Image Translation via Learning Disentanglement
Recent advances of image-to-image translation focus on learning the
one-to-many mapping from two aspects: multi-modal translation and multi-domain
translation. However, the existing methods only consider one of the two
perspectives, which makes them unable to solve each other's problem. To address
this issue, we propose a novel unified model, which bridges these two
objectives. First, we disentangle the input images into the latent
representations by an encoder-decoder architecture with a conditional
adversarial training in the feature space. Then, we encourage the generator to
learn multi-mappings by a random cross-domain translation. As a result, we can
manipulate different parts of the latent representations to perform multi-modal
and multi-domain translations simultaneously. Experiments demonstrate that our
method outperforms state-of-the-art methods.Comment: Accepted by NeurIPS 2019. Code will be available at
https://github.com/Xiaoming-Yu/DMI
ChildPredictor: A Child Face Prediction Framework with Disentangled Learning
The appearances of children are inherited from their parents, which makes it
feasible to predict them. Predicting realistic children's faces may help settle
many social problems, such as age-invariant face recognition, kinship
verification, and missing child identification. It can be regarded as an
image-to-image translation task. Existing approaches usually assume domain
information in the image-to-image translation can be interpreted by "style",
i.e., the separation of image content and style. However, such separation is
improper for the child face prediction, because the facial contours between
children and parents are not the same. To address this issue, we propose a new
disentangled learning strategy for children's face prediction. We assume that
children's faces are determined by genetic factors (compact family features,
e.g., face contour), external factors (facial attributes irrelevant to
prediction, such as moustaches and glasses), and variety factors (individual
properties for each child). On this basis, we formulate predictions as a
mapping from parents' genetic factors to children's genetic factors, and
disentangle them from external and variety factors. In order to obtain accurate
genetic factors and perform the mapping, we propose a ChildPredictor framework.
It transfers human faces to genetic factors by encoders and back by generators.
Then, it learns the relationship between the genetic factors of parents and
children through a mapping function. To ensure the generated faces are
realistic, we collect a large Family Face Database to train ChildPredictor and
evaluate it on the FF-Database validation set. Experimental results demonstrate
that ChildPredictor is superior to other well-known image-to-image translation
methods in predicting realistic and diverse child faces. Implementation codes
can be found at https://github.com/zhaoyuzhi/ChildPredictor.Comment: accepted to IEEE Transactions on Multimedi
MISO: Mutual Information Loss with Stochastic Style Representations for Multimodal Image-to-Image Translation
Unpaired multimodal image-to-image translation is a task of translating a
given image in a source domain into diverse images in the target domain,
overcoming the limitation of one-to-one mapping. Existing multimodal
translation models are mainly based on the disentangled representations with an
image reconstruction loss. We propose two approaches to improve multimodal
translation quality. First, we use a content representation from the source
domain conditioned on a style representation from the target domain. Second,
rather than using a typical image reconstruction loss, we design MILO (Mutual
Information LOss), a new stochastically-defined loss function based on
information theory. This loss function directly reflects the interpretation of
latent variables as a random variable. We show that our proposed model Mutual
Information with StOchastic Style Representation(MISO) achieves
state-of-the-art performance through extensive experiments on various
real-world datasets
Unified cross-modality feature disentangler for unsupervised multi-domain MRI abdomen organs segmentation
Our contribution is a unified cross-modality feature disentagling approach
for multi-domain image translation and multiple organ segmentation. Using CT as
the labeled source domain, our approach learns to segment multi-modal
(T1-weighted and T2-weighted) MRI having no labeled data. Our approach uses a
variational auto-encoder (VAE) to disentangle the image content from style. The
VAE constrains the style feature encoding to match a universal prior (Gaussian)
that is assumed to span the styles of all the source and target modalities. The
extracted image style is converted into a latent style scaling code, which
modulates the generator to produce multi-modality images according to the
target domain code from the image content features. Finally, we introduce a
joint distribution matching discriminator that combines the translated images
with task-relevant segmentation probability maps to further constrain and
regularize image-to-image (I2I) translations. We performed extensive
comparisons to multiple state-of-the-art I2I translation and segmentation
methods. Our approach resulted in the lowest average multi-domain image
reconstruction error of 1.340.04. Our approach produced an average Dice
similarity coefficient (DSC) of 0.85 for T1w and 0.90 for T2w MRI for
multi-organ segmentation, which was highly comparable to a fully supervised MRI
multi-organ segmentation network (DSC of 0.86 for T1w and 0.90 for T2w MRI).Comment: This paper has been accepted by MICCAI202
Image-to-Image Translation via Group-wise Deep Whitening-and-Coloring Transformation
Recently, unsupervised exemplar-based image-to-image translation, conditioned
on a given exemplar without the paired data, has accomplished substantial
advancements. In order to transfer the information from an exemplar to an input
image, existing methods often use a normalization technique, e.g., adaptive
instance normalization, that controls the channel-wise statistics of an input
activation map at a particular layer, such as the mean and the variance.
Meanwhile, style transfer approaches similar task to image translation by
nature, demonstrated superior performance by using the higher-order statistics
such as covariance among channels in representing a style. In detail, it works
via whitening (given a zero-mean input feature, transforming its covariance
matrix into the identity). followed by coloring (changing the covariance matrix
of the whitened feature to those of the style feature). However, applying this
approach in image translation is computationally intensive and error-prone due
to the expensive time complexity and its non-trivial backpropagation. In
response, this paper proposes an end-to-end approach tailored for image
translation that efficiently approximates this transformation with our novel
regularization methods. We further extend our approach to a group-wise form for
memory and time efficiency as well as image quality. Extensive qualitative and
quantitative experiments demonstrate that our proposed method is fast, both in
training and inference, and highly effective in reflecting the style of an
exemplar. Finally, our code is available at
https://github.com/WonwoongCho/GDWCT.Comment: CVPR 2019 (oral
- …