Auto-Encoding Scene Graphs for Image Captioning

Cai, Jianfei; Tang, Kaihua; Yang, Xu; Zhang, Hanwang

research

Auto-Encoding Scene Graphs for Image Captioning

Authors: Jianfei Cai
Kaihua Tang
Xu Yang
Hanwang Zhang
Publication date: 10 December 2018
Publisher
Doi

Abstract

We propose Scene Graph Auto-Encoder (SGAE) that incorporates the language inductive bias into the encoder-decoder image captioning framework for more human-like captions. Intuitively, we humans use the inductive bias to compose collocations and contextual inference in discourse. For example, when we see the relation `person on bike', it is natural to replace `on' with `ride' and infer `person riding bike on a road' even the `road' is not evident. Therefore, exploiting such bias as a language prior is expected to help the conventional encoder-decoder models less likely overfit to the dataset bias and focus on reasoning. Specifically, we use the scene graph --- a directed graph (

\mathcal{G}

) where an object node is connected by adjective nodes and relationship nodes --- to represent the complex structural layout of both image (

\mathcal{I}

) and sentence (

\mathcal{S}

). In the textual domain, we use SGAE to learn a dictionary (

\mathcal{D}

) that helps to reconstruct sentences in the

\mathcal{S}\rightarrow \mathcal{G} \rightarrow \mathcal{D} \rightarrow \mathcal{S}

pipeline, where

\mathcal{D}

encodes the desired language prior; in the vision-language domain, we use the shared

\mathcal{D}

to guide the encoder-decoder in the

\mathcal{I}\rightarrow \mathcal{G}\rightarrow \mathcal{D} \rightarrow \mathcal{S}

pipeline. Thanks to the scene graph representation and shared dictionary, the inductive bias is transferred across domains in principle. We validate the effectiveness of SGAE on the challenging MS-COCO image captioning benchmark, e.g., our SGAE-based single-model achieves a new state-of-the-art

127.8

CIDEr-D on the Karpathy split, and a competitive

125.5

CIDEr-D (c40) on the official server even compared to other ensemble models

Similar works

Full text

Available Versions

Monash University Research Portal

oai:monash.edu:publications/82...

Last time updated on 26/02/2020

Crossref

Last time updated on 10/08/2021