75 research outputs found
Image Aesthetics Assessment Using Composite Features from off-the-Shelf Deep Models
Deep convolutional neural networks have recently achieved great success on
image aesthetics assessment task. In this paper, we propose an efficient method
which takes the global, local and scene-aware information of images into
consideration and exploits the composite features extracted from corresponding
pretrained deep learning models to classify the derived features with support
vector machine. Contrary to popular methods that require fine-tuning or
training a new model from scratch, our training-free method directly takes the
deep features generated by off-the-shelf models for image classification and
scene recognition. Also, we analyzed the factors that could influence the
performance from two aspects: the architecture of the deep neural network and
the contribution of local and scene-aware information. It turns out that deep
residual network could produce more aesthetics-aware image representation and
composite features lead to the improvement of overall performance. Experiments
on common large-scale aesthetics assessment benchmarks demonstrate that our
method outperforms the state-of-the-art results in photo aesthetics assessment.Comment: Accepted by ICIP 201
Controlling Text-to-Image Diffusion by Orthogonal Finetuning
Large text-to-image diffusion models have impressive capabilities in
generating photorealistic images from text prompts. How to effectively guide or
control these powerful models to perform different downstream tasks becomes an
important open problem. To tackle this challenge, we introduce a principled
finetuning method -- Orthogonal Finetuning (OFT), for adapting text-to-image
diffusion models to downstream tasks. Unlike existing methods, OFT can provably
preserve hyperspherical energy which characterizes the pairwise neuron
relationship on the unit hypersphere. We find that this property is crucial for
preserving the semantic generation ability of text-to-image diffusion models.
To improve finetuning stability, we further propose Constrained Orthogonal
Finetuning (COFT) which imposes an additional radius constraint to the
hypersphere. Specifically, we consider two important finetuning text-to-image
tasks: subject-driven generation where the goal is to generate subject-specific
images given a few images of a subject and a text prompt, and controllable
generation where the goal is to enable the model to take in additional control
signals. We empirically show that our OFT framework outperforms existing
methods in generation quality and convergence speed.Comment: NeurIPS 2023 (43 pages, 34 figures, project page:
https://oft.wyliu.com/
Image Aesthetic Assessment: A Comparative Study of Hand-Crafted & Deep Learning Models
publishedVersio
Battle of the Backbones: A Large-Scale Comparison of Pretrained Models across Computer Vision Tasks
Neural network based computer vision systems are typically built on a
backbone, a pretrained or randomly initialized feature extractor. Several years
ago, the default option was an ImageNet-trained convolutional neural network.
However, the recent past has seen the emergence of countless backbones
pretrained using various algorithms and datasets. While this abundance of
choice has led to performance increases for a range of systems, it is difficult
for practitioners to make informed decisions about which backbone to choose.
Battle of the Backbones (BoB) makes this choice easier by benchmarking a
diverse suite of pretrained models, including vision-language models, those
trained via self-supervised learning, and the Stable Diffusion backbone, across
a diverse set of computer vision tasks ranging from classification to object
detection to OOD generalization and more. Furthermore, BoB sheds light on
promising directions for the research community to advance computer vision by
illuminating strengths and weakness of existing approaches through a
comprehensive analysis conducted on more than 1500 training runs. While vision
transformers (ViTs) and self-supervised learning (SSL) are increasingly
popular, we find that convolutional neural networks pretrained in a supervised
fashion on large training sets still perform best on most tasks among the
models we consider. Moreover, in apples-to-apples comparisons on the same
architectures and similarly sized pretraining datasets, we find that SSL
backbones are highly competitive, indicating that future works should perform
SSL pretraining with advanced architectures and larger pretraining datasets. We
release the raw results of our experiments along with code that allows
researchers to put their own backbones through the gauntlet here:
https://github.com/hsouri/Battle-of-the-BackbonesComment: Accepted to NeurIPS 202
CLIPAG: Towards Generator-Free Text-to-Image Generation
Perceptually Aligned Gradients (PAG) refer to an intriguing property observed
in robust image classification models, wherein their input gradients align with
human perception and pose semantic meanings. While this phenomenon has gained
significant research attention, it was solely studied in the context of
unimodal vision-only architectures. In this work, we extend the study of PAG to
Vision-Language architectures, which form the foundations for diverse
image-text tasks and applications. Through an adversarial robustification
finetuning of CLIP, we demonstrate that robust Vision-Language models exhibit
PAG in contrast to their vanilla counterparts. This work reveals the merits of
CLIP with PAG (CLIPAG) in several vision-language generative tasks. Notably, we
show that seamlessly integrating CLIPAG in a "plug-n-play" manner leads to
substantial improvements in vision-language generative applications.
Furthermore, leveraging its PAG property, CLIPAG enables text-to-image
generation without any generative model, which typically requires huge
generators
VILA: Learning Image Aesthetics from User Comments with Vision-Language Pretraining
Assessing the aesthetics of an image is challenging, as it is influenced by
multiple factors including composition, color, style, and high-level semantics.
Existing image aesthetic assessment (IAA) methods primarily rely on
human-labeled rating scores, which oversimplify the visual aesthetic
information that humans perceive. Conversely, user comments offer more
comprehensive information and are a more natural way to express human opinions
and preferences regarding image aesthetics. In light of this, we propose
learning image aesthetics from user comments, and exploring vision-language
pretraining methods to learn multimodal aesthetic representations.
Specifically, we pretrain an image-text encoder-decoder model with
image-comment pairs, using contrastive and generative objectives to learn rich
and generic aesthetic semantics without human labels. To efficiently adapt the
pretrained model for downstream IAA tasks, we further propose a lightweight
rank-based adapter that employs text as an anchor to learn the aesthetic
ranking concept. Our results show that our pretrained aesthetic vision-language
model outperforms prior works on image aesthetic captioning over the
AVA-Captions dataset, and it has powerful zero-shot capability for aesthetic
tasks such as zero-shot style classification and zero-shot IAA, surpassing many
supervised baselines. With only minimal finetuning parameters using the
proposed adapter module, our model achieves state-of-the-art IAA performance
over the AVA dataset.Comment: CVPR 2023,
https://github.com/google-research/google-research/tree/master/vil
Face Cartoonisation For Various Poses Using StyleGAN
This paper presents an innovative approach to achieve face cartoonisation
while preserving the original identity and accommodating various poses. Unlike
previous methods in this field that relied on conditional-GANs, which posed
challenges related to dataset requirements and pose training, our approach
leverages the expressive latent space of StyleGAN. We achieve this by
introducing an encoder that captures both pose and identity information from
images and generates a corresponding embedding within the StyleGAN latent
space. By subsequently passing this embedding through a pre-trained generator,
we obtain the desired cartoonised output. While many other approaches based on
StyleGAN necessitate a dedicated and fine-tuned StyleGAN model, our method
stands out by utilizing an already-trained StyleGAN designed to produce
realistic facial images. We show by extensive experimentation how our encoder
adapts the StyleGAN output to better preserve identity when the objective is
cartoonisation
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
We present Stable Video Diffusion - a latent video diffusion model for
high-resolution, state-of-the-art text-to-video and image-to-video generation.
Recently, latent diffusion models trained for 2D image synthesis have been
turned into generative video models by inserting temporal layers and finetuning
them on small, high-quality video datasets. However, training methods in the
literature vary widely, and the field has yet to agree on a unified strategy
for curating video data. In this paper, we identify and evaluate three
different stages for successful training of video LDMs: text-to-image
pretraining, video pretraining, and high-quality video finetuning. Furthermore,
we demonstrate the necessity of a well-curated pretraining dataset for
generating high-quality videos and present a systematic curation process to
train a strong base model, including captioning and filtering strategies. We
then explore the impact of finetuning our base model on high-quality data and
train a text-to-video model that is competitive with closed-source video
generation. We also show that our base model provides a powerful motion
representation for downstream tasks such as image-to-video generation and
adaptability to camera motion-specific LoRA modules. Finally, we demonstrate
that our model provides a strong multi-view 3D-prior and can serve as a base to
finetune a multi-view diffusion model that jointly generates multiple views of
objects in a feedforward fashion, outperforming image-based methods at a
fraction of their compute budget. We release code and model weights at
https://github.com/Stability-AI/generative-models
- …