145 research outputs found
ADS-Cap: A Framework for Accurate and Diverse Stylized Captioning with Unpaired Stylistic Corpora
Generating visually grounded image captions with specific linguistic styles
using unpaired stylistic corpora is a challenging task, especially since we
expect stylized captions with a wide variety of stylistic patterns. In this
paper, we propose a novel framework to generate Accurate and Diverse Stylized
Captions (ADS-Cap). Our ADS-Cap first uses a contrastive learning module to
align the image and text features, which unifies paired factual and unpaired
stylistic corpora during the training process. A conditional variational
auto-encoder is then used to automatically memorize diverse stylistic patterns
in latent space and enhance diversity through sampling. We also design a simple
but effective recheck module to boost style accuracy by filtering
style-specific captions. Experimental results on two widely used stylized image
captioning datasets show that regarding consistency with the image, style
accuracy and diversity, ADS-Cap achieves outstanding performances compared to
various baselines. We finally conduct extensive analyses to understand the
effectiveness of our method. Our code is available at
https://github.com/njucckevin/ADS-Cap.Comment: Accepted at Natural Language Processing and Chinese Computing (NLPCC)
202
Manipulating Attributes of Natural Scenes via Hallucination
In this study, we explore building a two-stage framework for enabling users
to directly manipulate high-level attributes of a natural scene. The key to our
approach is a deep generative network which can hallucinate images of a scene
as if they were taken at a different season (e.g. during winter), weather
condition (e.g. in a cloudy day) or time of the day (e.g. at sunset). Once the
scene is hallucinated with the given attributes, the corresponding look is then
transferred to the input image while preserving the semantic details intact,
giving a photo-realistic manipulation result. As the proposed framework
hallucinates what the scene will look like, it does not require any reference
style image as commonly utilized in most of the appearance or style transfer
approaches. Moreover, it allows to simultaneously manipulate a given scene
according to a diverse set of transient attributes within a single model,
eliminating the need of training multiple networks per each translation task.
Our comprehensive set of qualitative and quantitative results demonstrate the
effectiveness of our approach against the competing methods.Comment: Accepted for publication in ACM Transactions on Graphic
AI-generated Content for Various Data Modalities: A Survey
AI-generated content (AIGC) methods aim to produce text, images, videos, 3D
assets, and other media using AI algorithms. Due to its wide range of
applications and the demonstrated potential of recent works, AIGC developments
have been attracting lots of attention recently, and AIGC methods have been
developed for various data modalities, such as image, video, text, 3D shape (as
voxels, point clouds, meshes, and neural implicit fields), 3D scene, 3D human
avatar (body and head), 3D motion, and audio -- each presenting different
characteristics and challenges. Furthermore, there have also been many
significant developments in cross-modality AIGC methods, where generative
methods can receive conditioning input in one modality and produce outputs in
another. Examples include going from various modalities to image, video, 3D
shape, 3D scene, 3D avatar (body and head), 3D motion (skeleton and avatar),
and audio modalities. In this paper, we provide a comprehensive review of AIGC
methods across different data modalities, including both single-modality and
cross-modality methods, highlighting the various challenges, representative
works, and recent technical directions in each setting. We also survey the
representative datasets throughout the modalities, and present comparative
results for various modalities. Moreover, we also discuss the challenges and
potential future research directions
I Can't Believe There's No Images! Learning Visual Tasks Using only Language Supervision
Many high-level skills that are required for computer vision tasks, such as
parsing questions, comparing and contrasting semantics, and writing
descriptions, are also required in other domains such as natural language
processing. In this paper, we ask whether it is possible to learn those skills
from text data and then transfer them to vision tasks without ever training on
visual training data. Key to our approach is exploiting the joint embedding
space of contrastively trained vision and language encoders. In practice, there
can be systematic differences between embedding spaces for different modalities
in contrastive models, and we analyze how these differences affect our approach
and study strategies to mitigate this concern. We produce models using only
text training data on four representative tasks: image captioning, visual
entailment, visual question answering and visual news captioning, and evaluate
them on standard benchmarks using images. We find these models perform close to
models trained on images, while surpassing prior work for captioning and visual
entailment in this text-only setting by over 9 points, and outperforming all
prior work on visual news by over 30 points. We also showcase a variety of
stylistic image captioning models that are trained using no image data and no
human-curated language data, but instead using readily-available text data from
books, the web, or language models.Comment: website (https://prior.allenai.org/projects/close), code
(https://github.com/allenai/close
- …