29,539 research outputs found
Toward Multimodal Image-to-Image Translation
Many image-to-image translation problems are ambiguous, as a single input
image may correspond to multiple possible outputs. In this work, we aim to
model a \emph{distribution} of possible outputs in a conditional generative
modeling setting. The ambiguity of the mapping is distilled in a
low-dimensional latent vector, which can be randomly sampled at test time. A
generator learns to map the given input, combined with this latent code, to the
output. We explicitly encourage the connection between output and the latent
code to be invertible. This helps prevent a many-to-one mapping from the latent
code to the output during training, also known as the problem of mode collapse,
and produces more diverse results. We explore several variants of this approach
by employing different training objectives, network architectures, and methods
of injecting the latent code. Our proposed method encourages bijective
consistency between the latent encoding and output modes. We present a
systematic comparison of our method and other variants on both perceptual
realism and diversity.Comment: NIPS 2017 Final paper. v4 updated acknowledgment. Website:
https://junyanz.github.io/BicycleGAN
MISO: Mutual Information Loss with Stochastic Style Representations for Multimodal Image-to-Image Translation
Unpaired multimodal image-to-image translation is a task of translating a
given image in a source domain into diverse images in the target domain,
overcoming the limitation of one-to-one mapping. Existing multimodal
translation models are mainly based on the disentangled representations with an
image reconstruction loss. We propose two approaches to improve multimodal
translation quality. First, we use a content representation from the source
domain conditioned on a style representation from the target domain. Second,
rather than using a typical image reconstruction loss, we design MILO (Mutual
Information LOss), a new stochastically-defined loss function based on
information theory. This loss function directly reflects the interpretation of
latent variables as a random variable. We show that our proposed model Mutual
Information with StOchastic Style Representation(MISO) achieves
state-of-the-art performance through extensive experiments on various
real-world datasets
Diverse Image-to-Image Translation via Disentangled Representations
Image-to-image translation aims to learn the mapping between two visual
domains. There are two main challenges for many applications: 1) the lack of
aligned training pairs and 2) multiple possible outputs from a single input
image. In this work, we present an approach based on disentangled
representation for producing diverse outputs without paired training images. To
achieve diversity, we propose to embed images onto two spaces: a
domain-invariant content space capturing shared information across domains and
a domain-specific attribute space. Our model takes the encoded content features
extracted from a given input and the attribute vectors sampled from the
attribute space to produce diverse outputs at test time. To handle unpaired
training data, we introduce a novel cross-cycle consistency loss based on
disentangled representations. Qualitative results show that our model can
generate diverse and realistic images on a wide range of tasks without paired
training data. For quantitative comparisons, we measure realism with user study
and diversity with a perceptual distance metric. We apply the proposed model to
domain adaptation and show competitive performance when compared to the
state-of-the-art on the MNIST-M and the LineMod datasets.Comment: ECCV 2018 (Oral). Project page: http://vllab.ucmerced.edu/hylee/DRIT/
Code: https://github.com/HsinYingLee/DRIT
Bridging Dialogue Generation and Facial Expression Synthesis
Spoken dialogue systems that assist users to solve complex tasks such as
movie ticket booking have become an emerging research topic in artificial
intelligence and natural language processing areas. With a well-designed
dialogue system as an intelligent personal assistant, people can accomplish
certain tasks more easily via natural language interactions. Today there are
several virtual intelligent assistants in the market; however, most systems
only focus on single modality, such as textual or vocal interaction. A
multimodal interface has various advantages: (1) allowing human to communicate
with machines in a natural and concise form using the mixture of modalities
that most precisely convey the intention to satisfy communication needs, and
(2) providing more engaging experience by natural and human-like feedback. This
paper explores a brand new research direction, which aims at bridging dialogue
generation and facial expression synthesis for better multimodal interaction.
The goal is to generate dialogue responses and simultaneously synthesize
corresponding visual expressions on faces, which is also an ultimate step
toward more human-like virtual assistants
DRIT++: Diverse Image-to-Image Translation via Disentangled Representations
Image-to-image translation aims to learn the mapping between two visual
domains. There are two main challenges for this task: 1) lack of aligned
training pairs and 2) multiple possible outputs from a single input image. In
this work, we present an approach based on disentangled representation for
generating diverse outputs without paired training images. To synthesize
diverse outputs, we propose to embed images onto two spaces: a domain-invariant
content space capturing shared information across domains and a domain-specific
attribute space. Our model takes the encoded content features extracted from a
given input and attribute vectors sampled from the attribute space to
synthesize diverse outputs at test time. To handle unpaired training data, we
introduce a cross-cycle consistency loss based on disentangled representations.
Qualitative results show that our model can generate diverse and realistic
images on a wide range of tasks without paired training data. For quantitative
evaluations, we measure realism with user study and Fr\'{e}chet inception
distance, and measure diversity with the perceptual distance metric,
Jensen-Shannon divergence, and number of statistically-different bins.Comment: IJCV Journal extension for ECCV 2018 "Diverse Image-to-Image
Translation via Disentangled Representations" arXiv:1808.00948. Project Page:
http://vllab.ucmerced.edu/hylee/DRIT_pp/ Code:
https://github.com/HsinYingLee/DRI
Harmonizing Maximum Likelihood with GANs for Multimodal Conditional Generation
Recent advances in conditional image generation tasks, such as image-to-image
translation and image inpainting, are largely accounted to the success of
conditional GAN models, which are often optimized by the joint use of the GAN
loss with the reconstruction loss. However, we reveal that this training recipe
shared by almost all existing methods causes one critical side effect: lack of
diversity in output samples. In order to accomplish both training stability and
multimodal output generation, we propose novel training schemes with a new set
of losses named moment reconstruction losses that simply replace the
reconstruction loss. We show that our approach is applicable to any conditional
generation tasks by performing thorough experiments on image-to-image
translation, super-resolution and image inpainting using Cityscapes and CelebA
dataset. Quantitative evaluations also confirm that our methods achieve a great
diversity in outputs while retaining or even improving the visual fidelity of
generated samples.Comment: Accepted as a conference paper at ICLR 201
iParaphrasing: Extracting Visually Grounded Paraphrases via an Image
A paraphrase is a restatement of the meaning of a text in other words.
Paraphrases have been studied to enhance the performance of many natural
language processing tasks. In this paper, we propose a novel task iParaphrasing
to extract visually grounded paraphrases (VGPs), which are different phrasal
expressions describing the same visual concept in an image. These extracted
VGPs have the potential to improve language and image multimodal tasks such as
visual question answering and image captioning. How to model the similarity
between VGPs is the key of iParaphrasing. We apply various existing methods as
well as propose a novel neural network-based method with image attention, and
report the results of the first attempt toward iParaphrasing.Comment: COLING 201
SingleGAN: Image-to-Image Translation by a Single-Generator Network using Multiple Generative Adversarial Learning
Image translation is a burgeoning field in computer vision where the goal is
to learn the mapping between an input image and an output image. However, most
recent methods require multiple generators for modeling different domain
mappings, which are inefficient and ineffective on some multi-domain image
translation tasks. In this paper, we propose a novel method, SingleGAN, to
perform multi-domain image-to-image translations with a single generator. We
introduce the domain code to explicitly control the different generative tasks
and integrate multiple optimization goals to ensure the translation.
Experimental results on several unpaired datasets show superior performance of
our model in translation between two domains. Besides, we explore variants of
SingleGAN for different tasks, including one-to-many domain translation,
many-to-many domain translation and one-to-one domain translation with
multimodality. The extended experiments show the universality and extensibility
of our model.Comment: Accepted in ACCV 2018. Code is available at
https://github.com/Xiaoming-Yu/SingleGA
Streetscape augmentation using generative adversarial networks: insights related to health and wellbeing
Deep learning using neural networks has provided advances in image style
transfer, merging the content of one image (e.g., a photo) with the style of
another (e.g., a painting). Our research shows this concept can be extended to
analyse the design of streetscapes in relation to health and wellbeing
outcomes. An Australian population health survey (n=34,000) was used to
identify the spatial distribution of health and wellbeing outcomes, including
general health and social capital. For each outcome, the most and least
desirable locations formed two domains. Streetscape design was sampled using
around 80,000 Google Street View images per domain. Generative adversarial
networks translated these images from one domain to the other, preserving the
main structure of the input image, but transforming the `style' from locations
where self-reported health was bad to locations where it was good. These
translations indicate that areas in Melbourne with good general health are
characterised by sufficient green space and compactness of the urban
environment, whilst streetscape imagery related to high social capital
contained more and wider footpaths, fewer fences and more grass. Beyond
identifying relationships, the method is a first step towards
computer-generated design interventions that have the potential to improve
population health and wellbeing.Comment: 20 pages, 8 figures. Preprint accepted for publication in Sustainable
Cities and Societ
Exploring Models and Data for Remote Sensing Image Caption Generation
Inspired by recent development of artificial satellite, remote sensing images
have attracted extensive attention. Recently, noticeable progress has been made
in scene classification and target detection.However, it is still not clear how
to describe the remote sensing image content with accurate and concise
sentences. In this paper, we investigate to describe the remote sensing images
with accurate and flexible sentences. First, some annotated instructions are
presented to better describe the remote sensing images considering the special
characteristics of remote sensing images. Second, in order to exhaustively
exploit the contents of remote sensing images, a large-scale aerial image data
set is constructed for remote sensing image caption. Finally, a comprehensive
review is presented on the proposed data set to fully advance the task of
remote sensing caption. Extensive experiments on the proposed data set
demonstrate that the content of the remote sensing image can be completely
described by generating language descriptions. The data set is available at
https://github.com/201528014227051/RSICD_optimalComment: 14 pages, 8 figure
- …