5,595 research outputs found
Image-Graph-Image Translation via Auto-Encoding
This work presents the first convolutional neural network that learns an
image-to-graph translation task without needing external supervision. Obtaining
graph representations of image content, where objects are represented as nodes
and their relationships as edges, is an important task in scene understanding.
Current approaches follow a fully-supervised approach thereby requiring
meticulous annotations. To overcome this, we are the first to present a
self-supervised approach based on a fully-differentiable auto-encoder in which
the bottleneck encodes the graph's nodes and edges. This self-supervised
approach can currently encode simple line drawings into graphs and obtains
comparable results to a fully-supervised baseline in terms of F1 score on
triplet matching. Besides these promising results, we provide several
directions for future research on how our approach can be extended to cover
more complex imagery
Diffusion-Based Scene Graph to Image Generation with Masked Contrastive Pre-Training
Generating images from graph-structured inputs, such as scene graphs, is
uniquely challenging due to the difficulty of aligning nodes and connections in
graphs with objects and their relations in images. Most existing methods
address this challenge by using scene layouts, which are image-like
representations of scene graphs designed to capture the coarse structures of
scene images. Because scene layouts are manually crafted, the alignment with
images may not be fully optimized, causing suboptimal compliance between the
generated images and the original scene graphs. To tackle this issue, we
propose to learn scene graph embeddings by directly optimizing their alignment
with images. Specifically, we pre-train an encoder to extract both global and
local information from scene graphs that are predictive of the corresponding
images, relying on two loss functions: masked autoencoding loss and contrastive
loss. The former trains embeddings by reconstructing randomly masked image
regions, while the latter trains embeddings to discriminate between compliant
and non-compliant images according to the scene graph. Given these embeddings,
we build a latent diffusion model to generate images from scene graphs. The
resulting method, called SGDiff, allows for the semantic manipulation of
generated images by modifying scene graph nodes and connections. On the Visual
Genome and COCO-Stuff datasets, we demonstrate that SGDiff outperforms
state-of-the-art methods, as measured by both the Inception Score and Fr\'echet
Inception Distance (FID) metrics. We will release our source code and trained
models at https://github.com/YangLing0818/SGDiff.Comment: Code and models shall be released at
https://github.com/YangLing0818/SGDif
- …