119 research outputs found
Text-Guided Neural Image Inpainting
Image inpainting task requires filling the corrupted image with contents
coherent with the context. This research field has achieved promising progress
by using neural image inpainting methods. Nevertheless, there is still a
critical challenge in guessing the missed content with only the context pixels.
The goal of this paper is to fill the semantic information in corrupted images
according to the provided descriptive text. Unique from existing text-guided
image generation works, the inpainting models are required to compare the
semantic content of the given text and the remaining part of the image, then
find out the semantic content that should be filled for missing part. To
fulfill such a task, we propose a novel inpainting model named Text-Guided Dual
Attention Inpainting Network (TDANet). Firstly, a dual multimodal attention
mechanism is designed to extract the explicit semantic information about the
corrupted regions, which is done by comparing the descriptive text and
complementary image areas through reciprocal attention. Secondly, an image-text
matching loss is applied to maximize the semantic similarity of the generated
image and the text. Experiments are conducted on two open datasets. Results
show that the proposed TDANet model reaches new state-of-the-art on both
quantitative and qualitative measures. Result analysis suggests that the
generated images are consistent with the guidance text, enabling the generation
of various results by providing different descriptions. Codes are available at
https://github.com/idealwhite/TDANetComment: ACM MM'2020 (Oral). 9 pages, 4 tables, 7 figure
Hierarchical Fashion Design with Multi-stage Diffusion Models
Cross-modal fashion synthesis and editing offer intelligent support to
fashion designers by enabling the automatic generation and local modification
of design drafts.While current diffusion models demonstrate commendable
stability and controllability in image synthesis,they still face significant
challenges in generating fashion design from abstract design elements and
fine-grained editing.Abstract sensory expressions, \eg office, business, and
party, form the high-level design concepts, while measurable aspects like
sleeve length, collar type, and pant length are considered the low-level
attributes of clothing.Controlling and editing fashion images using lengthy
text descriptions poses a difficulty.In this paper, we propose HieraFashDiff,a
novel fashion design method using the shared multi-stage diffusion model
encompassing high-level design concepts and low-level clothing attributes in a
hierarchical structure.Specifically, we categorized the input text into
different levels and fed them in different time step to the diffusion model
according to the criteria of professional clothing designers.HieraFashDiff
allows designers to add low-level attributes after high-level prompts for
interactive editing incrementally.In addition, we design a differentiable loss
function in the sampling process with a mask to keep non-edit
areas.Comprehensive experiments performed on our newly conducted Hierarchical
fashion dataset,demonstrate that our proposed method outperforms other
state-of-the-art competitors
3D GANs and Latent Space: A comprehensive survey
Generative Adversarial Networks (GANs) have emerged as a significant player
in generative modeling by mapping lower-dimensional random noise to
higher-dimensional spaces. These networks have been used to generate
high-resolution images and 3D objects. The efficient modeling of 3D objects and
human faces is crucial in the development process of 3D graphical environments
such as games or simulations. 3D GANs are a new type of generative model used
for 3D reconstruction, point cloud reconstruction, and 3D semantic scene
completion. The choice of distribution for noise is critical as it represents
the latent space. Understanding a GAN's latent space is essential for
fine-tuning the generated samples, as demonstrated by the morphing of
semantically meaningful parts of images. In this work, we explore the latent
space and 3D GANs, examine several GAN variants and training methods to gain
insights into improving 3D GAN training, and suggest potential future
directions for further research
CLIP2GAN: Towards Bridging Text with the Latent Space of GANs
In this work, we are dedicated to text-guided image generation and propose a
novel framework, i.e., CLIP2GAN, by leveraging CLIP model and StyleGAN. The key
idea of our CLIP2GAN is to bridge the output feature embedding space of CLIP
and the input latent space of StyleGAN, which is realized by introducing a
mapping network. In the training stage, we encode an image with CLIP and map
the output feature to a latent code, which is further used to reconstruct the
image. In this way, the mapping network is optimized in a self-supervised
learning way. In the inference stage, since CLIP can embed both image and text
into a shared feature embedding space, we replace CLIP image encoder in the
training architecture with CLIP text encoder, while keeping the following
mapping network as well as StyleGAN model. As a result, we can flexibly input a
text description to generate an image. Moreover, by simply adding mapped text
features of an attribute to a mapped CLIP image feature, we can effectively
edit the attribute to the image. Extensive experiments demonstrate the superior
performance of our proposed CLIP2GAN compared to previous methods
- …