21,786 research outputs found
Target-Tailored Source-Transformation for Scene Graph Generation
Scene graph generation aims to provide a semantic and structural description
of an image, denoting the objects (with nodes) and their relationships (with
edges). The best performing works to date are based on exploiting the context
surrounding objects or relations,e.g., by passing information among objects. In
these approaches, to transform the representation of source objects is a
critical process for extracting information for the use by target objects. In
this work, we argue that a source object should give what tar-get object needs
and give different objects different information rather than contributing
common information to all targets. To achieve this goal, we propose a
Target-TailoredSource-Transformation (TTST) method to efficiently propagate
information among object proposals and relations. Particularly, for a source
object proposal which will contribute information to other target objects, we
transform the source object feature to the target object feature domain by
simultaneously taking both the source and target into account. We further
explore more powerful representations by integrating language prior with the
visual context in the transformation for the scene graph generation. By doing
so the target object is able to extract target-specific information from the
source object and source relation accordingly to refine its representation. Our
framework is validated on the Visual Genome bench-mark and demonstrated its
state-of-the-art performance for the scene graph generation. The experimental
results show that the performance of object detection and visual relation-ship
detection are promoted mutually by our method
Context-Dependent Diffusion Network for Visual Relationship Detection
Visual relationship detection can bridge the gap between computer vision and
natural language for scene understanding of images. Different from pure object
recognition tasks, the relation triplets of subject-predicate-object lie on an
extreme diversity space, such as \textit{person-behind-person} and
\textit{car-behind-building}, while suffering from the problem of combinatorial
explosion. In this paper, we propose a context-dependent diffusion network
(CDDN) framework to deal with visual relationship detection. To capture the
interactions of different object instances, two types of graphs, word semantic
graph and visual scene graph, are constructed to encode global context
interdependency. The semantic graph is built through language priors to model
semantic correlations across objects, whilst the visual scene graph defines the
connections of scene objects so as to utilize the surrounding scene
information. For the graph-structured data, we design a diffusion network to
adaptively aggregate information from contexts, which can effectively learn
latent representations of visual relationships and well cater to visual
relationship detection in view of its isomorphic invariance to graphs.
Experiments on two widely-used datasets demonstrate that our proposed method is
more effective and achieves the state-of-the-art performance.Comment: 8 pages, 3 figures, 2018 ACM Multimedia Conference (MM'18
Auto-Encoding Scene Graphs for Image Captioning
We propose Scene Graph Auto-Encoder (SGAE) that incorporates the language
inductive bias into the encoder-decoder image captioning framework for more
human-like captions. Intuitively, we humans use the inductive bias to compose
collocations and contextual inference in discourse. For example, when we see
the relation `person on bike', it is natural to replace `on' with `ride' and
infer `person riding bike on a road' even the `road' is not evident. Therefore,
exploiting such bias as a language prior is expected to help the conventional
encoder-decoder models less likely overfit to the dataset bias and focus on
reasoning. Specifically, we use the scene graph --- a directed graph
() where an object node is connected by adjective nodes and
relationship nodes --- to represent the complex structural layout of both image
() and sentence (). In the textual domain, we use
SGAE to learn a dictionary () that helps to reconstruct sentences
in the pipeline, where encodes the desired language prior;
in the vision-language domain, we use the shared to guide the
encoder-decoder in the pipeline. Thanks to the scene graph
representation and shared dictionary, the inductive bias is transferred across
domains in principle. We validate the effectiveness of SGAE on the challenging
MS-COCO image captioning benchmark, e.g., our SGAE-based single-model achieves
a new state-of-the-art CIDEr-D on the Karpathy split, and a competitive
CIDEr-D (c40) on the official server even compared to other ensemble
models
TransNFCM: Translation-Based Neural Fashion Compatibility Modeling
Identifying mix-and-match relationships between fashion items is an urgent
task in a fashion e-commerce recommender system. It will significantly enhance
user experience and satisfaction. However, due to the challenges of inferring
the rich yet complicated set of compatibility patterns in a large e-commerce
corpus of fashion items, this task is still underexplored. Inspired by the
recent advances in multi-relational knowledge representation learning and deep
neural networks, this paper proposes a novel Translation-based Neural Fashion
Compatibility Modeling (TransNFCM) framework, which jointly optimizes fashion
item embeddings and category-specific complementary relations in a unified
space via an end-to-end learning manner. TransNFCM places items in a unified
embedding space where a category-specific relation (category-comp-category) is
modeled as a vector translation operating on the embeddings of compatible items
from the corresponding categories. By this way, we not only capture the
specific notion of compatibility conditioned on a specific pair of
complementary categories, but also preserve the global notion of compatibility.
We also design a deep fashion item encoder which exploits the complementary
characteristic of visual and textual features to represent the fashion
products. To the best of our knowledge, this is the first work that uses
category-specific complementary relations to model the category-aware
compatibility between items in a translation-based embedding space. Extensive
experiments demonstrate the effectiveness of TransNFCM over the
state-of-the-arts on two real-world datasets.Comment: Accepted in AAAI 2019 conferenc
- …