46 research outputs found
Multimodal Color Recommendation in Vector Graphic Documents
Color selection plays a critical role in graphic document design and requires
sufficient consideration of various contexts. However, recommending appropriate
colors which harmonize with the other colors and textual contexts in documents
is a challenging task, even for experienced designers. In this study, we
propose a multimodal masked color model that integrates both color and textual
contexts to provide text-aware color recommendation for graphic documents. Our
proposed model comprises self-attention networks to capture the relationships
between colors in multiple palettes, and cross-attention networks that
incorporate both color and CLIP-based text representations. Our proposed method
primarily focuses on color palette completion, which recommends colors based on
the given colors and text. Additionally, it is applicable for another color
recommendation task, full palette generation, which generates a complete color
palette corresponding to the given text. Experimental results demonstrate that
our proposed approach surpasses previous color palette completion methods on
accuracy, color distribution, and user experience, as well as full palette
generation methods concerning color diversity and similarity to the ground
truth palettes.Comment: Accepted to ACM MM 202
Color Recommendation for Vector Graphic Documents based on Multi-Palette Representation
Vector graphic documents present multiple visual elements, such as images,
shapes, and texts. Choosing appropriate colors for multiple visual elements is
a difficult but crucial task for both amateurs and professional designers.
Instead of creating a single color palette for all elements, we extract
multiple color palettes from each visual element in a graphic document, and
then combine them into a color sequence. We propose a masked color model for
color sequence completion and recommend the specified colors based on color
context in multi-palette with high probability. We train the model and build a
color recommendation system on a large-scale dataset of vector graphic
documents. The proposed color recommendation method outperformed other
state-of-the-art methods by both quantitative and qualitative evaluations on
color prediction and our color recommendation system received positive feedback
from professional designers in an interview study.Comment: Accepted to WACV 202
KnowIT VQA: Answering Knowledge-Based Questions about Videos
We propose a novel video understanding task by fusing knowledge-based and
video question answering. First, we introduce KnowIT VQA, a video dataset with
24,282 human-generated question-answer pairs about a popular sitcom. The
dataset combines visual, textual and temporal coherence reasoning together with
knowledge-based questions, which need of the experience obtained from the
viewing of the series to be answered. Second, we propose a video understanding
model by combining the visual and textual video content with specific knowledge
about the show. Our main findings are: (i) the incorporation of knowledge
produces outstanding improvements for VQA in video, and (ii) the performance on
KnowIT VQA still lags well behind human accuracy, indicating its usefulness for
studying current video modelling limitations
LayoutDM: Discrete Diffusion Model for Controllable Layout Generation
Controllable layout generation aims at synthesizing plausible arrangement of
element bounding boxes with optional constraints, such as type or position of a
specific element. In this work, we try to solve a broad range of layout
generation tasks in a single model that is based on discrete state-space
diffusion models. Our model, named LayoutDM, naturally handles the structured
layout data in the discrete representation and learns to progressively infer a
noiseless layout from the initial input, where we model the layout corruption
process by modality-wise discrete diffusion. For conditional generation, we
propose to inject layout constraints in the form of masking or logit adjustment
during inference. We show in the experiments that our LayoutDM successfully
generates high-quality layouts and outperforms both task-specific and
task-agnostic baselines on several layout tasks.Comment: To be published in CVPR2023, project page:
https://cyberagentailab.github.io/layout-dm
Towards Flexible Multi-modal Document Models
Creative workflows for generating graphical documents involve complex
inter-related tasks, such as aligning elements, choosing appropriate fonts, or
employing aesthetically harmonious colors. In this work, we attempt at building
a holistic model that can jointly solve many different design tasks. Our model,
which we denote by FlexDM, treats vector graphic documents as a set of
multi-modal elements, and learns to predict masked fields such as element type,
position, styling attributes, image, or text, using a unified architecture.
Through the use of explicit multi-task learning and in-domain pre-training, our
model can better capture the multi-modal relationships among the different
document fields. Experimental results corroborate that our single FlexDM is
able to successfully solve a multitude of different design tasks, while
achieving performance that is competitive with task-specific and costly
baselines.Comment: To be published in CVPR2023 (highlight), project page:
https://cyberagentailab.github.io/flex-d
Generative Colorization of Structured Mobile Web Pages
Color is a critical design factor for web pages, affecting important factors
such as viewer emotions and the overall trust and satisfaction of a website.
Effective coloring requires design knowledge and expertise, but if this process
could be automated through data-driven modeling, efficient exploration and
alternative workflows would be possible. However, this direction remains
underexplored due to the lack of a formalization of the web page colorization
problem, datasets, and evaluation protocols. In this work, we propose a new
dataset consisting of e-commerce mobile web pages in a tractable format, which
are created by simplifying the pages and extracting canonical color styles with
a common web browser. The web page colorization problem is then formalized as a
task of estimating plausible color styles for a given web page content with a
given hierarchical structure of the elements. We present several
Transformer-based methods that are adapted to this task by prepending
structural message passing to capture hierarchical relationships between
elements. Experimental results, including a quantitative evaluation designed
for this task, demonstrate the advantages of our methods over statistical and
image colorization methods. The code is available at
https://github.com/CyberAgentAILab/webcolor.Comment: Accepted to WACV 202