4 research outputs found
Learning to Globally Edit Images with Textual Description
We show how we can globally edit images using textual instructions: given a
source image and a textual instruction for the edit, generate a new image
transformed under this instruction. To tackle this novel problem, we develop
three different trainable models based on RNN and Generative Adversarial
Network (GAN). The models (bucket, filter bank, and end-to-end) differ in how
much expert knowledge is encoded, with the most general version being purely
end-to-end. To train these systems, we use Amazon Mechanical Turk to collect
textual descriptions for around 2000 image pairs sampled from several datasets.
Experimental results evaluated on our dataset validate our approaches. In
addition, given that the filter bank model is a good compromise between
generality and performance, we investigate it further by replacing RNN with
Graph RNN, and show that Graph RNN improves performance. To the best of our
knowledge, this is the first computational photography work on global image
editing that is purely based on free-form textual instructions
Adjusting Image Attributes of Localized Regions with Low-level Dialogue
Natural Language Image Editing (NLIE) aims to use natural language
instructions to edit images. Since novices are inexperienced with image editing
techniques, their instructions are often ambiguous and contain high-level
abstractions that tend to correspond to complex editing steps to accomplish.
Motivated by this inexperience aspect, we aim to smooth the learning curve by
teaching the novices to edit images using low-level commanding terminologies.
Towards this end, we develop a task-oriented dialogue system to investigate
low-level instructions for NLIE. Our system grounds language on the level of
edit operations, and suggests options for a user to choose from. Though
compelled to express in low-level terms, a user evaluation shows that 25% of
users found our system easy-to-use, resonating with our motivation. An analysis
shows that users generally adapt to utilizing the proposed low-level language
interface. In this study, we identify that object segmentation as the key
factor to the user satisfaction. Our work demonstrates the advantages of the
low-level, direct language-action mapping approach that can be applied to other
problem domains beyond image editing such as audio editing or industrial
design.Comment: Accepted as a Poster presentation at the 12th International
Conference on Language Resources and Evaluation (LREC 2020
Expressing Visual Relationships via Language
Describing images with text is a fundamental problem in vision-language
research. Current studies in this domain mostly focus on single image
captioning. However, in various real applications (e.g., image editing,
difference interpretation, and retrieval), generating relational captions for
two images, can also be very useful. This important problem has not been
explored mostly due to lack of datasets and effective models. To push forward
the research in this direction, we first introduce a new language-guided image
editing dataset that contains a large number of real image pairs with
corresponding editing instructions. We then propose a new relational speaker
model based on an encoder-decoder architecture with static relational attention
and sequential multi-head attention. We also extend the model with dynamic
relational attention, which calculates visual alignment while decoding. Our
models are evaluated on our newly collected and two public datasets consisting
of image pairs annotated with relationship sentences. Experimental results,
based on both automatic and human evaluation, demonstrate that our model
outperforms all baselines and existing methods on all the datasets.Comment: ACL 2019 (11 pages
A Benchmark and Baseline for Language-Driven Image Editing
Language-driven image editing can significantly save the laborious image
editing work and be friendly to the photography novice. However, most similar
work can only deal with a specific image domain or can only do global
retouching. To solve this new task, we first present a new language-driven
image editing dataset that supports both local and global editing with editing
operation and mask annotations. Besides, we also propose a baseline method that
fully utilizes the annotation to solve this problem. Our new method treats each
editing operation as a sub-module and can automatically predict operation
parameters. Not only performing well on challenging user data, but such an
approach is also highly interpretable. We believe our work, including both the
benchmark and the baseline, will advance the image editing area towards a more
general and free-form level.Comment: Accepted by ACCV 202