1,550 research outputs found
Towards Arbitrary Text-driven Image Manipulation via Space Alignment
The recent GAN inversion methods have been able to successfully invert the
real image input to the corresponding editable latent code in StyleGAN. By
combining with the language-vision model (CLIP), some text-driven image
manipulation methods are proposed. However, these methods require extra costs
to perform optimization for a certain image or a new attribute editing mode. To
achieve a more efficient editing method, we propose a new Text-driven image
Manipulation framework via Space Alignment (TMSA). The Space Alignment module
aims to align the same semantic regions in CLIP and StyleGAN spaces. Then, the
text input can be directly accessed into the StyleGAN space and be used to find
the semantic shift according to the text description. The framework can support
arbitrary image editing mode without additional cost. Our work provides the
user with an interface to control the attributes of a given image according to
text input and get the result in real time. Ex tensive experiments demonstrate
our superior performance over prior works.Comment: 8 pages, 12 figure
SinFusion: Training Diffusion Models on a Single Image or Video
Diffusion models exhibited tremendous progress in image and video generation,
exceeding GANs in quality and diversity. However, they are usually trained on
very large datasets and are not naturally adapted to manipulate a given input
image or video. In this paper we show how this can be resolved by training a
diffusion model on a single input image or video. Our image/video-specific
diffusion model (SinFusion) learns the appearance and dynamics of the single
image or video, while utilizing the conditioning capabilities of diffusion
models. It can solve a wide array of image/video-specific manipulation tasks.
In particular, our model can learn from few frames the motion and dynamics of a
single input video. It can then generate diverse new video samples of the same
dynamic scene, extrapolate short videos into long ones (both forward and
backward in time) and perform video upsampling. When trained on a single image,
our model shows comparable performance and capabilities to previous
single-image models in various image manipulation tasks.Comment: Project Page: https://yanivnik.github.io/sinfusio
Scaling up GANs for Text-to-Image Synthesis
The recent success of text-to-image synthesis has taken the world by storm
and captured the general public's imagination. From a technical standpoint, it
also marked a drastic change in the favored architecture to design generative
image models. GANs used to be the de facto choice, with techniques like
StyleGAN. With DALL-E 2, auto-regressive and diffusion models became the new
standard for large-scale generative models overnight. This rapid shift raises a
fundamental question: can we scale up GANs to benefit from large datasets like
LAION? We find that na\"Ively increasing the capacity of the StyleGAN
architecture quickly becomes unstable. We introduce GigaGAN, a new GAN
architecture that far exceeds this limit, demonstrating GANs as a viable option
for text-to-image synthesis. GigaGAN offers three major advantages. First, it
is orders of magnitude faster at inference time, taking only 0.13 seconds to
synthesize a 512px image. Second, it can synthesize high-resolution images, for
example, 16-megapixel pixels in 3.66 seconds. Finally, GigaGAN supports various
latent space editing applications such as latent interpolation, style mixing,
and vector arithmetic operations.Comment: CVPR 2023. Project webpage at https://mingukkang.github.io/GigaGAN
Unified Discrete Diffusion for Simultaneous Vision-Language Generation
The recently developed discrete diffusion models perform extraordinarily well
in the text-to-image task, showing significant promise for handling the
multi-modality signals. In this work, we harness these traits and present a
unified multimodal generation model that can conduct both the "modality
translation" and "multi-modality generation" tasks using a single model,
performing text-based, image-based, and even vision-language simultaneous
generation. Specifically, we unify the discrete diffusion process for
multimodal signals by proposing a unified transition matrix. Moreover, we
design a mutual attention module with fused embedding layer and a unified
objective function to emphasise the inter-modal linkages, which are vital for
multi-modality generation. Extensive experiments indicate that our proposed
method can perform comparably to the state-of-the-art solutions in various
generation tasks
Recommended from our members
Controllable Neural Synthesis for Natural Images and Vector Art
Neural image synthesis approaches have become increasingly popular over the last years due to their ability to generate photorealistic images useful for several applications, such as digital entertainment, mixed reality, synthetic dataset creation, computer art, to name a few. Despite the progress over the last years, current approaches lack two important aspects: (a) they often fail to capture long-range interactions in the image, and as a result, they fail to generate scenes with complex dependencies between their different objects or parts. (b) they often ignore the underlying 3D geometry of the shape/scene in the image, and as a result, they frequently lose coherency and details.My thesis proposes novel solutions to the above problems. First, I propose a neural transformer architecture that captures long-range interactions and context for image synthesis at high resolutions, leading to synthesizing interesting phenomena in scenes, such as reflections of landscapes onto water or flora consistent with the rest of the landscape, that was not possible to generate reliably with previous ConvNet- and other transformer-based approaches. The key idea of the architecture is to sparsify the transformer\u27s attention matrix at high resolutions, guided by dense attention extracted at lower image resolution. I present qualitative and quantitative results, along with user studies, demonstrating the effectiveness of the method, and its superiority compared to the state-of-the-art. Second, I propose a method that generates artistic images with the guidance of input 3D shapes. In contrast to previous methods, the use of a geometric representation of 3D shape enables the synthesis of more precise stylized drawings with fewer artifacts. My method outputs the synthesized images in a vector representation, enabling richer downstream analysis or editing in interactive applications. I also show that the method produces substantially better results than existing image-based methods, in terms of predicting artists’ drawings and in user evaluation of results
LoopDraw: a Loop-Based Autoregressive Model for Shape Synthesis and Editing
There is no settled universal 3D representation for geometry with many
alternatives such as point clouds, meshes, implicit functions, and voxels to
name a few. In this work, we present a new, compelling alternative for
representing shapes using a sequence of cross-sectional closed loops. The loops
across all planes form an organizational hierarchy which we leverage for
autoregressive shape synthesis and editing. Loops are a non-local description
of the underlying shape, as simple loop manipulations (such as shifts) result
in significant structural changes to the geometry. This is in contrast to
manipulating local primitives such as points in a point cloud or a triangle in
a triangle mesh. We further demonstrate that loops are intuitive and natural
primitive for analyzing and editing shapes, both computationally and for users
- …