92 research outputs found
Any-Size-Diffusion: Toward Efficient Text-Driven Synthesis for Any-Size HD Images
Stable diffusion, a generative model used in text-to-image synthesis,
frequently encounters resolution-induced composition problems when generating
images of varying sizes. This issue primarily stems from the model being
trained on pairs of single-scale images and their corresponding text
descriptions. Moreover, direct training on images of unlimited sizes is
unfeasible, as it would require an immense number of text-image pairs and
entail substantial computational expenses. To overcome these challenges, we
propose a two-stage pipeline named Any-Size-Diffusion (ASD), designed to
efficiently generate well-composed images of any size, while minimizing the
need for high-memory GPU resources. Specifically, the initial stage, dubbed Any
Ratio Adaptability Diffusion (ARAD), leverages a selected set of images with a
restricted range of ratios to optimize the text-conditional diffusion model,
thereby improving its ability to adjust composition to accommodate diverse
image sizes. To support the creation of images at any desired size, we further
introduce a technique called Fast Seamless Tiled Diffusion (FSTD) at the
subsequent stage. This method allows for the rapid enlargement of the ASD
output to any high-resolution size, avoiding seaming artifacts or memory
overloads. Experimental results on the LAION-COCO and MM-CelebA-HQ benchmarks
demonstrate that ASD can produce well-structured images of arbitrary sizes,
cutting down the inference time by 2x compared to the traditional tiled
algorithm
Motion Guidance: Diffusion-Based Image Editing with Differentiable Motion Estimators
Diffusion models are capable of generating impressive images conditioned on
text descriptions, and extensions of these models allow users to edit images at
a relatively coarse scale. However, the ability to precisely edit the layout,
position, pose, and shape of objects in images with diffusion models is still
difficult. To this end, we propose motion guidance, a zero-shot technique that
allows a user to specify dense, complex motion fields that indicate where each
pixel in an image should move. Motion guidance works by steering the diffusion
sampling process with the gradients through an off-the-shelf optical flow
network. Specifically, we design a guidance loss that encourages the sample to
have the desired motion, as estimated by a flow network, while also being
visually similar to the source image. By simultaneously sampling from a
diffusion model and guiding the sample to have low guidance loss, we can obtain
a motion-edited image. We demonstrate that our technique works on complex
motions and produces high quality edits of real and generated images
Generative Prior for Unsupervised Image Restoration
The challenge of restoring real world low-quality images is due to a lack of appropriate training data and difficulty in determining how the image was degraded. Recently, generative models have demonstrated great potential for creating high- quality images by utilizing the rich and diverse information contained within the model’s trained weights and learned latent representations. One popular type of generative model is the generative adversarial network (GAN). Many new methods have been developed to harness the information found in GANs for image manipulation. Our proposed approach is to utilize generative models for both understanding the degradation of an image and restoring it. We propose using a combination of cycle consistency losses and self-attention to enhance face images by first learning the degradation and then using this information to train a style-based neural network. We also aim to use the latent representation to achieve a high level of magnification for face images (x64). By incorporating the weights of a pre-trained StyleGAN into a restoration network with a vision transformer layer, we hope to improve the current state-of-the-art in face image restoration. Finally, we present a projection-based image-denoising algorithm named Noise2Code in the latent space of the VQGAN model with a fixed-point regularization strategy. The fixed-point condition follows the observation that the pre-trained VQGAN affects the clean and noisy images in a drastically different way. Unlike previous projection-based image restoration in the latent space, both the denoising network and VQGAN model parameters are jointly trained, although the latter is not needed during the testing. We report experimental results to demonstrate that the proposed Noise2Code approach is conceptually simple, computationally efficient, and generalizable to real-world degradation scenarios
Quilt-1M: One Million Image-Text Pairs for Histopathology
Recent accelerations in multi-modal applications have been made possible with
the plethora of image and text data available online. However, the scarcity of
analogous data in the medical field, specifically in histopathology, has halted
comparable progress. To enable similar representation learning for
histopathology, we turn to YouTube, an untapped resource of videos, offering
hours of valuable educational histopathology videos from expert
clinicians. From YouTube, we curate Quilt: a large-scale vision-language
dataset consisting of image and text pairs. Quilt was automatically
curated using a mixture of models, including large language models, handcrafted
algorithms, human knowledge databases, and automatic speech recognition. In
comparison, the most comprehensive datasets curated for histopathology amass
only around K samples. We combine Quilt with datasets from other sources,
including Twitter, research papers, and the internet in general, to create an
even larger dataset: Quilt-1M, with M paired image-text samples, marking it
as the largest vision-language histopathology dataset to date. We demonstrate
the value of Quilt-1M by fine-tuning a pre-trained CLIP model. Our model
outperforms state-of-the-art models on both zero-shot and linear probing tasks
for classifying new histopathology images across diverse patch-level
datasets of different sub-pathologies and cross-modal retrieval tasks
State of the Art on Diffusion Models for Visual Computing
The field of visual computing is rapidly advancing due to the emergence of
generative artificial intelligence (AI), which unlocks unprecedented
capabilities for the generation, editing, and reconstruction of images, videos,
and 3D scenes. In these domains, diffusion models are the generative AI
architecture of choice. Within the last year alone, the literature on
diffusion-based tools and applications has seen exponential growth and relevant
papers are published across the computer graphics, computer vision, and AI
communities with new works appearing daily on arXiv. This rapid growth of the
field makes it difficult to keep up with all recent developments. The goal of
this state-of-the-art report (STAR) is to introduce the basic mathematical
concepts of diffusion models, implementation details and design choices of the
popular Stable Diffusion model, as well as overview important aspects of these
generative AI tools, including personalization, conditioning, inversion, among
others. Moreover, we give a comprehensive overview of the rapidly growing
literature on diffusion-based generation and editing, categorized by the type
of generated medium, including 2D images, videos, 3D objects, locomotion, and
4D scenes. Finally, we discuss available datasets, metrics, open challenges,
and social implications. This STAR provides an intuitive starting point to
explore this exciting topic for researchers, artists, and practitioners alike
Advanced Image Acquisition, Processing Techniques and Applications
"Advanced Image Acquisition, Processing Techniques and Applications" is the first book of a series that provides image processing principles and practical software implementation on a broad range of applications. The book integrates material from leading researchers on Applied Digital Image Acquisition and Processing. An important feature of the book is its emphasis on software tools and scientific computing in order to enhance results and arrive at problem solution
Face Hallucination via Deep Neural Networks.
We firstly address aligned low-resolution (LR) face images (i.e. 16X16 pixels) by designing a discriminative generative network, named URDGN. URDGN is composed of two networks: a generative model and a discriminative model.
We introduce a pixel-wise L2 regularization term to the generative model and exploit the feedback of the discriminative network to make the upsampled face images more similar to real ones.
We present an end-to-end transformative discriminative neural network (TDN) devised for super-resolving unaligned tiny face images. TDN embeds spatial transformation layers to enforce local receptive fields to line-up with similar spatial supports. To upsample noisy unaligned LR face images, we propose decoder-encoder-decoder networks. A transformative discriminative decoder network is employed to upsample and denoise LR inputs simultaneously. Then we project the intermediate HR faces to aligned and noise-free LR faces by a transformative encoder network. Finally, high-quality hallucinated HR images are generated by our second decoder. Furthermore, we present an end-to-end multiscale transformative discriminative neural network (MTDN) to super-resolve unaligned LR face images of different resolutions in a unified framework.
We propose a method that explicitly incorporates structural information of faces into the face super-resolution process by using a multi-task convolutional neural network (CNN). Our method not only uses low-level information (i.e. intensity similarity), but also middle-level information (i.e. face structure) to further explore spatial constraints of facial components from LR inputs images.
We demonstrate that supplementing residual images or feature maps with additional facial attribute information can significantly reduce the ambiguity in face super-resolution. To explore this idea, we develop an attribute-embedded upsampling network. In this manner, our method is able to super-resolve LR faces by a large upscaling factor while reducing the uncertainty of one-to-many mappings remarkably.
We further push the boundaries of hallucinating a tiny, non-frontal face image to understand how much of this is possible by leveraging the availability of large datasets and deep networks. To this end, we introduce a novel Transformative Adversarial Neural Network (TANN) to jointly frontalize very LR out-of-plane rotated face images (including profile views) and aggressively super-resolve them by 8X, regardless of their original poses and without using any 3D information. Besides recovering an HR face images from an LR version, this thesis also addresses the task of restoring realistic faces from stylized portrait images, which can also be regarded as face hallucination
- …