4 research outputs found
Neural Aesthetic Image Reviewer
Recently, there is a rising interest in perceiving image aesthetics. The
existing works deal with image aesthetics as a classification or regression
problem. To extend the cognition from rating to reasoning, a deeper
understanding of aesthetics should be based on revealing why a high- or
low-aesthetic score should be assigned to an image. From such a point of view,
we propose a model referred to as Neural Aesthetic Image Reviewer, which can
not only give an aesthetic score for an image, but also generate a textual
description explaining why the image leads to a plausible rating score.
Specifically, we propose two multi-task architectures based on shared
aesthetically semantic layers and task-specific embedding layers at a high
level for performance improvement on different tasks. To facilitate researches
on this problem, we collect the AVA-Reviews dataset, which contains 52,118
images and 312,708 comments in total. Through multi-task learning, the proposed
models can rate aesthetic images as well as produce comments in an end-to-end
manner. It is confirmed that the proposed models outperform the baselines
according to the performance evaluation on the AVA-Reviews dataset. Moreover,
we demonstrate experimentally that our model can generate textual reviews
related to aesthetics, which are consistent with human perception.Comment: 8 pages, 13 figure
VILA: Learning Image Aesthetics from User Comments with Vision-Language Pretraining
Assessing the aesthetics of an image is challenging, as it is influenced by
multiple factors including composition, color, style, and high-level semantics.
Existing image aesthetic assessment (IAA) methods primarily rely on
human-labeled rating scores, which oversimplify the visual aesthetic
information that humans perceive. Conversely, user comments offer more
comprehensive information and are a more natural way to express human opinions
and preferences regarding image aesthetics. In light of this, we propose
learning image aesthetics from user comments, and exploring vision-language
pretraining methods to learn multimodal aesthetic representations.
Specifically, we pretrain an image-text encoder-decoder model with
image-comment pairs, using contrastive and generative objectives to learn rich
and generic aesthetic semantics without human labels. To efficiently adapt the
pretrained model for downstream IAA tasks, we further propose a lightweight
rank-based adapter that employs text as an anchor to learn the aesthetic
ranking concept. Our results show that our pretrained aesthetic vision-language
model outperforms prior works on image aesthetic captioning over the
AVA-Captions dataset, and it has powerful zero-shot capability for aesthetic
tasks such as zero-shot style classification and zero-shot IAA, surpassing many
supervised baselines. With only minimal finetuning parameters using the
proposed adapter module, our model achieves state-of-the-art IAA performance
over the AVA dataset.Comment: CVPR 2023,
https://github.com/google-research/google-research/tree/master/vil