1,263 research outputs found
CLIPAG: Towards Generator-Free Text-to-Image Generation
Perceptually Aligned Gradients (PAG) refer to an intriguing property observed
in robust image classification models, wherein their input gradients align with
human perception and pose semantic meanings. While this phenomenon has gained
significant research attention, it was solely studied in the context of
unimodal vision-only architectures. In this work, we extend the study of PAG to
Vision-Language architectures, which form the foundations for diverse
image-text tasks and applications. Through an adversarial robustification
finetuning of CLIP, we demonstrate that robust Vision-Language models exhibit
PAG in contrast to their vanilla counterparts. This work reveals the merits of
CLIP with PAG (CLIPAG) in several vision-language generative tasks. Notably, we
show that seamlessly integrating CLIPAG in a "plug-n-play" manner leads to
substantial improvements in vision-language generative applications.
Furthermore, leveraging its PAG property, CLIPAG enables text-to-image
generation without any generative model, which typically requires huge
generators
Which Models have Perceptually-Aligned Gradients? An Explanation via Off-Manifold Robustness
One of the remarkable properties of robust computer vision models is that
their input-gradients are often aligned with human perception, referred to in
the literature as perceptually-aligned gradients (PAGs). Despite only being
trained for classification, PAGs cause robust models to have rudimentary
generative capabilities, including image generation, denoising, and
in-painting. However, the underlying mechanisms behind these phenomena remain
unknown. In this work, we provide a first explanation of PAGs via
\emph{off-manifold robustness}, which states that models must be more robust
off- the data manifold than they are on-manifold. We first demonstrate
theoretically that off-manifold robustness leads input gradients to lie
approximately on the data manifold, explaining their perceptual alignment. We
then show that Bayes optimal models satisfy off-manifold robustness, and
confirm the same empirically for robust models trained via gradient norm
regularization, noise augmentation, and randomized smoothing. Quantifying the
perceptual alignment of model gradients via their similarity with the gradients
of generative models, we show that off-manifold robustness correlates well with
perceptual alignment. Finally, based on the levels of on- and off-manifold
robustness, we identify three different regimes of robustness that affect both
perceptual alignment and model accuracy: weak robustness, bayes-aligned
robustness, and excessive robustness
Inverting Adversarially Robust Networks for Image Synthesis
Recent research in adversarially robust classifiers suggests their
representations tend to be aligned with human perception, which makes them
attractive for image synthesis and restoration applications. Despite favorable
empirical results on a few downstream tasks, their advantages are limited to
slow and sensitive optimization-based techniques. Moreover, their use on
generative models remains unexplored. This work proposes the use of robust
representations as a perceptual primitive for feature inversion models, and
show its benefits with respect to standard non-robust image features. We
empirically show that adopting robust representations as an image prior
significantly improves the reconstruction accuracy of CNN-based feature
inversion models. Furthermore, it allows reconstructing images at multiple
scales out-of-the-box. Following these findings, we propose an
encoding-decoding network based on robust representations and show its
advantages for applications such as anomaly detection, style transfer and image
denoising
MAGIC: Mask-Guided Image Synthesis by Inverting a Quasi-Robust Classifier
We offer a method for one-shot mask-guided image synthesis that allows
controlling manipulations of a single image by inverting a quasi-robust
classifier equipped with strong regularizers. Our proposed method, entitled
MAGIC, leverages structured gradients from a pre-trained quasi-robust
classifier to better preserve the input semantics while preserving its
classification accuracy, thereby guaranteeing credibility in the synthesis.
Unlike current methods that use complex primitives to supervise the process or
use attention maps as a weak supervisory signal, MAGIC aggregates gradients
over the input, driven by a guide binary mask that enforces a strong, spatial
prior. MAGIC implements a series of manipulations with a single framework
achieving shape and location control, intense non-rigid shape deformations, and
copy/move operations in the presence of repeating objects and gives users firm
control over the synthesis by requiring to simply specify binary guide masks.
Our study and findings are supported by various qualitative comparisons with
the state-of-the-art on the same images sampled from ImageNet and quantitative
analysis using machine perception along with a user survey of 100+ participants
that endorse our synthesis quality. Project page at
https://mozhdehrouhsedaghat.github.io/magic.html. Code is available at
https://github.com/mozhdehrouhsedaghat/magicComment: Accepted to the Thirty-Seventh Conference on Artificial Intelligence
(AAAI) 2023 - 12 pages, 9 figure
- …