Cultural heritage applications and advanced machine learning models are
creating a fruitful synergy to provide effective and accessible ways of
interacting with artworks. Smart audio-guides, personalized art-related content
and gamification approaches are just a few examples of how technology can be
exploited to provide additional value to artists or exhibitions. Nonetheless,
from a machine learning point of view, the amount of available artistic data is
often not enough to train effective models. Off-the-shelf computer vision
modules can still be exploited to some extent, yet a severe domain shift is
present between art images and standard natural image datasets used to train
such models. As a result, this can lead to degraded performance. This paper
introduces a novel approach to address the challenges of limited annotated data
and domain shifts in the cultural heritage domain. By leveraging generative
vision-language models, we augment art datasets by generating diverse
variations of artworks conditioned on their captions. This augmentation
strategy enhances dataset diversity, bridging the gap between natural images
and artworks, and improving the alignment of visual cues with knowledge from
general-purpose datasets. The generated variations assist in training vision
and language models with a deeper understanding of artistic characteristics and
that are able to generate better captions with appropriate jargon.Comment: Accepted at ICCV 2023 4th Workshop on e-Heritag