5,595 research outputs found

    Image-Graph-Image Translation via Auto-Encoding

    Full text link
    This work presents the first convolutional neural network that learns an image-to-graph translation task without needing external supervision. Obtaining graph representations of image content, where objects are represented as nodes and their relationships as edges, is an important task in scene understanding. Current approaches follow a fully-supervised approach thereby requiring meticulous annotations. To overcome this, we are the first to present a self-supervised approach based on a fully-differentiable auto-encoder in which the bottleneck encodes the graph's nodes and edges. This self-supervised approach can currently encode simple line drawings into graphs and obtains comparable results to a fully-supervised baseline in terms of F1 score on triplet matching. Besides these promising results, we provide several directions for future research on how our approach can be extended to cover more complex imagery

    Diffusion-Based Scene Graph to Image Generation with Masked Contrastive Pre-Training

    Full text link
    Generating images from graph-structured inputs, such as scene graphs, is uniquely challenging due to the difficulty of aligning nodes and connections in graphs with objects and their relations in images. Most existing methods address this challenge by using scene layouts, which are image-like representations of scene graphs designed to capture the coarse structures of scene images. Because scene layouts are manually crafted, the alignment with images may not be fully optimized, causing suboptimal compliance between the generated images and the original scene graphs. To tackle this issue, we propose to learn scene graph embeddings by directly optimizing their alignment with images. Specifically, we pre-train an encoder to extract both global and local information from scene graphs that are predictive of the corresponding images, relying on two loss functions: masked autoencoding loss and contrastive loss. The former trains embeddings by reconstructing randomly masked image regions, while the latter trains embeddings to discriminate between compliant and non-compliant images according to the scene graph. Given these embeddings, we build a latent diffusion model to generate images from scene graphs. The resulting method, called SGDiff, allows for the semantic manipulation of generated images by modifying scene graph nodes and connections. On the Visual Genome and COCO-Stuff datasets, we demonstrate that SGDiff outperforms state-of-the-art methods, as measured by both the Inception Score and Fr\'echet Inception Distance (FID) metrics. We will release our source code and trained models at https://github.com/YangLing0818/SGDiff.Comment: Code and models shall be released at https://github.com/YangLing0818/SGDif
    • …
    corecore