182 research outputs found
Visual Programming for Text-to-Image Generation and Evaluation
As large language models have demonstrated impressive performance in many
domains, recent works have adopted language models (LMs) as controllers of
visual modules for vision-and-language tasks. While existing work focuses on
equipping LMs with visual understanding, we propose two novel
interpretable/explainable visual programming frameworks for text-to-image (T2I)
generation and evaluation. First, we introduce VPGen, an interpretable
step-by-step T2I generation framework that decomposes T2I generation into three
steps: object/count generation, layout generation, and image generation. We
employ an LM to handle the first two steps (object/count generation and layout
generation), by finetuning it on text-layout pairs. Our step-by-step T2I
generation framework provides stronger spatial control than end-to-end models,
the dominant approach for this task. Furthermore, we leverage the world
knowledge of pretrained LMs, overcoming the limitation of previous
layout-guided T2I works that can only handle predefined object classes. We
demonstrate that our VPGen has improved control in counts/spatial
relations/scales of objects than state-of-the-art T2I generation models.
Second, we introduce VPEval, an interpretable and explainable evaluation
framework for T2I generation based on visual programming. Unlike previous T2I
evaluations with a single scoring model that is accurate in some skills but
unreliable in others, VPEval produces evaluation programs that invoke a set of
visual modules that are experts in different skills, and also provides
visual+textual explanations of the evaluation results. Our analysis shows
VPEval provides a more human-correlated evaluation for skill-specific and
open-ended prompts than widely used single model-based evaluation. We hope our
work encourages future progress on interpretable/explainable generation and
evaluation for T2I models. Website: https://vp-t2i.github.ioComment: 18 pages; Project website: https://vp-t2i.github.i
Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention
We present Perceiver-VL, a vision-and-language framework that efficiently
handles high-dimensional multimodal inputs such as long videos and text.
Powered by the iterative latent cross-attention of Perceiver, our framework
scales with linear complexity, in contrast to the quadratic complexity of
self-attention used in many state-of-the-art transformer-based models. To
further improve the efficiency of our framework, we also study applying
LayerDrop on cross-attention layers and introduce a mixed-stream architecture
for cross-modal retrieval. We evaluate Perceiver-VL on diverse video-text and
image-text benchmarks, where Perceiver-VL achieves the lowest GFLOPs and
latency while maintaining competitive performance. In addition, we also provide
comprehensive analyses of various aspects of our framework, including
pretraining data, scalability of latent size and input size, dropping
cross-attention layers at inference to reduce latency, modality aggregation
strategy, positional encoding, and weight initialization strategy. Our code and
checkpoints are available at: https://github.com/zinengtang/Perceiver_VLComment: WACV 2023 (first two authors contributed equally
DiagrammerGPT: Generating Open-Domain, Open-Platform Diagrams via LLM Planning
Text-to-image (T2I) generation has seen significant growth over the past few
years. Despite this, there has been little work on generating diagrams with T2I
models. A diagram is a symbolic/schematic representation that explains
information using structurally rich and spatially complex visualizations (e.g.,
a dense combination of related objects, text labels, directional arrows,
connection lines, etc.). Existing state-of-the-art T2I models often fail at
diagram generation because they lack fine-grained object layout control when
many objects are densely connected via complex relations such as arrows/lines
and also often fail to render comprehensible text labels. To address this gap,
we present DiagrammerGPT, a novel two-stage text-to-diagram generation
framework that leverages the layout guidance capabilities of LLMs (e.g., GPT-4)
to generate more accurate open-domain, open-platform diagrams. In the first
stage, we use LLMs to generate and iteratively refine 'diagram plans' (in a
planner-auditor feedback loop) which describe all the entities (objects and
text labels), their relationships (arrows or lines), and their bounding box
layouts. In the second stage, we use a diagram generator, DiagramGLIGEN, and a
text label rendering module to generate diagrams following the diagram plans.
To benchmark the text-to-diagram generation task, we introduce AI2D-Caption, a
densely annotated diagram dataset built on top of the AI2D dataset. We show
quantitatively and qualitatively that our DiagrammerGPT framework produces more
accurate diagrams, outperforming existing T2I models. We also provide
comprehensive analysis including open-domain diagram generation, vector graphic
diagram generation in different platforms, human-in-the-loop diagram plan
editing, and multimodal planner/auditor LLMs (e.g., GPT-4Vision). We hope our
work can inspire further research on diagram generation via T2I models and
LLMs.Comment: Project page: https://diagrammerGPT.github.io
VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning
Although recent text-to-video (T2V) generation methods have seen significant
advancements, most of these works focus on producing short video clips of a
single event with a single background (i.e., single-scene videos). Meanwhile,
recent large language models (LLMs) have demonstrated their capability in
generating layouts and programs to control downstream visual modules such as
image generation models. This raises an important question: can we leverage the
knowledge embedded in these LLMs for temporally consistent long video
generation? In this paper, we propose VideoDirectorGPT, a novel framework for
consistent multi-scene video generation that uses the knowledge of LLMs for
video content planning and grounded video generation. Specifically, given a
single text prompt, we first ask our video planner LLM (GPT-4) to expand it
into a 'video plan', which involves generating the scene descriptions, the
entities with their respective layouts, the background for each scene, and
consistency groupings of the entities and backgrounds. Next, guided by this
output from the video planner, our video generator, Layout2Vid, has explicit
control over spatial layouts and can maintain temporal consistency of
entities/backgrounds across scenes, while only trained with image-level
annotations. Our experiments demonstrate that VideoDirectorGPT framework
substantially improves layout and movement control in both single- and
multi-scene video generation and can generate multi-scene videos with visual
consistency across scenes, while achieving competitive performance with SOTAs
in open-domain single-scene T2V generation. We also demonstrate that our
framework can dynamically control the strength for layout guidance and can also
generate videos with user-provided images. We hope our framework can inspire
future work on better integrating the planning ability of LLMs into consistent
long video generation.Comment: Project page: https://videodirectorgpt.github.i
Disentangling Structure and Style: Political Bias Detection in News by Inducing Document Hierarchy
We address an important gap in detection of political bias in news articles.
Previous works that perform supervised document classification can be biased
towards the writing style of each news outlet, leading to overfitting and
limited generalizability. Our approach overcomes this limitation by considering
both the sentence-level semantics and the document-level rhetorical structure,
resulting in a more robust and style-agnostic approach to detecting political
bias in news articles. We introduce a novel multi-head hierarchical attention
model that effectively encodes the structure of long documents through a
diverse ensemble of attention heads. While journalism follows a formalized
rhetorical structure, the writing style may vary by news outlet. We demonstrate
that our method overcomes this domain dependency and outperforms previous
approaches for robustness and accuracy. Further analysis demonstrates the
ability of our model to capture the discourse structures commonly used in the
journalism domain.Comment: Preprint. Under revie
- …