Search CORE

10,841 research outputs found

Scene Graph Generation by Iterative Message Passing

Author: Choy Christopher B.
Fei-Fei Li
Xu Danfei
Zhu Yuke
Publication venue
Publication date: 12/04/2017
Field of study

Understanding a visual scene goes beyond recognizing individual objects in isolation. Relationships between objects also constitute rich semantic information about the scene. In this work, we explicitly model the objects and their relationships using scene graphs, a visually-grounded graphical structure of an image. We propose a novel end-to-end model that generates such structured scene representation from an input image. The model solves the scene graph inference problem using standard RNNs and learns to iteratively improves its predictions via message passing. Our joint inference model can take advantage of contextual cues to make better predictions on objects and their relationships. The experiments show that our model significantly outperforms previous methods for generating scene graphs using Visual Genome dataset and inferring support relations with NYU Depth v2 dataset.Comment: CVPR 201

arXiv.org e-Print Archive

Crossref

Auto-Encoding Scene Graphs for Image Captioning

Author: Cai Jianfei
Tang Kaihua
Yang Xu
Zhang Hanwang
Publication venue
Publication date: 10/12/2018
Field of study

We propose Scene Graph Auto-Encoder (SGAE) that incorporates the language inductive bias into the encoder-decoder image captioning framework for more human-like captions. Intuitively, we humans use the inductive bias to compose collocations and contextual inference in discourse. For example, when we see the relation `person on bike', it is natural to replace `on' with `ride' and infer `person riding bike on a road' even the `road' is not evident. Therefore, exploiting such bias as a language prior is expected to help the conventional encoder-decoder models less likely overfit to the dataset bias and focus on reasoning. Specifically, we use the scene graph --- a directed graph (

\mathcal{G}

) where an object node is connected by adjective nodes and relationship nodes --- to represent the complex structural layout of both image (

\mathcal{I}

) and sentence (

\mathcal{S}

). In the textual domain, we use SGAE to learn a dictionary (

\mathcal{D}

) that helps to reconstruct sentences in the

\mathcal{S}\rightarrow \mathcal{G} \rightarrow \mathcal{D} \rightarrow \mathcal{S}

pipeline, where

\mathcal{D}

encodes the desired language prior; in the vision-language domain, we use the shared

\mathcal{D}

to guide the encoder-decoder in the

\mathcal{I}\rightarrow \mathcal{G}\rightarrow \mathcal{D} \rightarrow \mathcal{S}

pipeline. Thanks to the scene graph representation and shared dictionary, the inductive bias is transferred across domains in principle. We validate the effectiveness of SGAE on the challenging MS-COCO image captioning benchmark, e.g., our SGAE-based single-model achieves a new state-of-the-art

127.8

CIDEr-D on the Karpathy split, and a competitive

125.5

CIDEr-D (c40) on the official server even compared to other ensemble models

arXiv.org e-Print Archive

Crossref

Monash University Research Portal

Unpaired Image Captioning via Scene Graph Alignments

Author: Cai Jianfei
Gu Jiuxiang
Joty Shafiq
Wang Gang
Yang Xu
Zhao Handong
Publication venue
Publication date: 01/01/2019
Field of study

Most of current image captioning models heavily rely on paired image-caption datasets. However, getting large scale image-caption paired data is labor-intensive and time-consuming. In this paper, we present a scene graph-based approach for unpaired image captioning. Our framework comprises an image scene graph generator, a sentence scene graph generator, a scene graph encoder, and a sentence decoder. Specifically, we first train the scene graph encoder and the sentence decoder on the text modality. To align the scene graphs between images and sentences, we propose an unsupervised feature alignment method that maps the scene graph features from the image to the sentence modality. Experimental results show that our proposed model can generate quite promising results without using any image-caption training pairs, outperforming existing methods by a wide margin.Comment: Accepted in ICCV 201

arXiv.org e-Print Archive

Crossref

Monash University Research Portal