1,520 research outputs found

    Scene Graph Parsing as Dependency Parsing

    Get PDF
    In this paper, we study the problem of parsing structured knowledge graphs from textual descrip- tions. In particular, we consider the scene graph representation that considers objects together with their attributes and relations: this representation has been proved useful across a variety of vision and language applications. We begin by introducing an alternative but equivalent edge-centric view of scene graphs that connect to dependency parses. Together with a careful redesign of label and action space, we combine the two-stage pipeline used in prior work (generic dependency parsing followed by simple post-processing) into one, enabling end-to-end training. The scene graphs generated by our learned neural dependency parser achieve an F-score similarity of 49.67% to ground truth graphs on our evaluation set, surpassing best previous approaches by 5%. We further demonstrate the effective- ness of our learned parser on image retrieval applications.This material is based upon work supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216

    Scene Graph Parsing as Dependency Parsing

    Get PDF
    In this paper, we study the problem of parsing structured knowledge graphs from textual descriptions. In particular, we consider the scene graph representation that considers objects together with their attributes and relations: this representation has been proved useful across a variety of vision and language applications. We begin by introducing an alternative but equivalent edge-centric view of scene graphs that connect to dependency parses. Together with a careful redesign of label and action space, we combine the two-stage pipeline used in prior work (generic dependency parsing followed by simple post-processing) into one, enabling end-to-end training. The scene graphs generated by our learned neural dependency parser achieve an F-score similarity of 49.67% to ground truth graphs on our evaluation set, surpassing best previous approaches by 5%. We further demonstrate the effectiveness of our learned parser on image retrieval applications.Comment: To appear in NAACL 2018 as oral. Code is available at https://github.com/Yusics/bist-parser/tree/sgparse

    Unpaired Image Captioning via Scene Graph Alignments

    Full text link
    Most of current image captioning models heavily rely on paired image-caption datasets. However, getting large scale image-caption paired data is labor-intensive and time-consuming. In this paper, we present a scene graph-based approach for unpaired image captioning. Our framework comprises an image scene graph generator, a sentence scene graph generator, a scene graph encoder, and a sentence decoder. Specifically, we first train the scene graph encoder and the sentence decoder on the text modality. To align the scene graphs between images and sentences, we propose an unsupervised feature alignment method that maps the scene graph features from the image to the sentence modality. Experimental results show that our proposed model can generate quite promising results without using any image-caption training pairs, outperforming existing methods by a wide margin.Comment: Accepted in ICCV 201

    FACTUAL: A Benchmark for Faithful and Consistent Textual Scene Graph Parsing

    Full text link
    Textual scene graph parsing has become increasingly important in various vision-language applications, including image caption evaluation and image retrieval. However, existing scene graph parsers that convert image captions into scene graphs often suffer from two types of errors. First, the generated scene graphs fail to capture the true semantics of the captions or the corresponding images, resulting in a lack of faithfulness. Second, the generated scene graphs have high inconsistency, with the same semantics represented by different annotations. To address these challenges, we propose a novel dataset, which involves re-annotating the captions in Visual Genome (VG) using a new intermediate representation called FACTUAL-MR. FACTUAL-MR can be directly converted into faithful and consistent scene graph annotations. Our experimental results clearly demonstrate that the parser trained on our dataset outperforms existing approaches in terms of faithfulness and consistency. This improvement leads to a significant performance boost in both image caption evaluation and zero-shot image retrieval tasks. Furthermore, we introduce a novel metric for measuring scene graph similarity, which, when combined with the improved scene graph parser, achieves state-of-the-art (SOTA) results on multiple benchmark datasets for the aforementioned tasks. The code and dataset are available at https://github.com/zhuang-li/FACTUAL .Comment: 9 pages, ACL 2023 (findings

    Auto-Encoding Scene Graphs for Image Captioning

    Full text link
    We propose Scene Graph Auto-Encoder (SGAE) that incorporates the language inductive bias into the encoder-decoder image captioning framework for more human-like captions. Intuitively, we humans use the inductive bias to compose collocations and contextual inference in discourse. For example, when we see the relation `person on bike', it is natural to replace `on' with `ride' and infer `person riding bike on a road' even the `road' is not evident. Therefore, exploiting such bias as a language prior is expected to help the conventional encoder-decoder models less likely overfit to the dataset bias and focus on reasoning. Specifically, we use the scene graph --- a directed graph (G\mathcal{G}) where an object node is connected by adjective nodes and relationship nodes --- to represent the complex structural layout of both image (I\mathcal{I}) and sentence (S\mathcal{S}). In the textual domain, we use SGAE to learn a dictionary (D\mathcal{D}) that helps to reconstruct sentences in the S→G→D→S\mathcal{S}\rightarrow \mathcal{G} \rightarrow \mathcal{D} \rightarrow \mathcal{S} pipeline, where D\mathcal{D} encodes the desired language prior; in the vision-language domain, we use the shared D\mathcal{D} to guide the encoder-decoder in the I→G→D→S\mathcal{I}\rightarrow \mathcal{G}\rightarrow \mathcal{D} \rightarrow \mathcal{S} pipeline. Thanks to the scene graph representation and shared dictionary, the inductive bias is transferred across domains in principle. We validate the effectiveness of SGAE on the challenging MS-COCO image captioning benchmark, e.g., our SGAE-based single-model achieves a new state-of-the-art 127.8127.8 CIDEr-D on the Karpathy split, and a competitive 125.5125.5 CIDEr-D (c40) on the official server even compared to other ensemble models
    • …
    corecore