39 research outputs found
Towards efficient deep neural networks with applications to visual recognition
The thesis focuses on the following two topics: designing energy-efficient neural
networks and hashing approach to make deep learning more feasible to real applications;
deep convolutional neural networks for visual recognition.Thesis (Ph.D.) (Research by Publication) -- University of Adelaide, School of Computer Science, 201
Context-Dependent Diffusion Network for Visual Relationship Detection
Visual relationship detection can bridge the gap between computer vision and
natural language for scene understanding of images. Different from pure object
recognition tasks, the relation triplets of subject-predicate-object lie on an
extreme diversity space, such as \textit{person-behind-person} and
\textit{car-behind-building}, while suffering from the problem of combinatorial
explosion. In this paper, we propose a context-dependent diffusion network
(CDDN) framework to deal with visual relationship detection. To capture the
interactions of different object instances, two types of graphs, word semantic
graph and visual scene graph, are constructed to encode global context
interdependency. The semantic graph is built through language priors to model
semantic correlations across objects, whilst the visual scene graph defines the
connections of scene objects so as to utilize the surrounding scene
information. For the graph-structured data, we design a diffusion network to
adaptively aggregate information from contexts, which can effectively learn
latent representations of visual relationships and well cater to visual
relationship detection in view of its isomorphic invariance to graphs.
Experiments on two widely-used datasets demonstrate that our proposed method is
more effective and achieves the state-of-the-art performance.Comment: 8 pages, 3 figures, 2018 ACM Multimedia Conference (MM'18
Large-Scale Visual Relationship Understanding
Large scale visual understanding is challenging, as it requires a model to
handle the widely-spread and imbalanced distribution of <subject, relation,
object> triples. In real-world scenarios with large numbers of objects and
relations, some are seen very commonly while others are barely seen. We develop
a new relationship detection model that embeds objects and relations into two
vector spaces where both discriminative capability and semantic affinity are
preserved. We learn both a visual and a semantic module that map features from
the two modalities into a shared space, where matched pairs of features have to
discriminate against those unmatched, but also maintain close distances to
semantically similar ones. Benefiting from that, our model can achieve superior
performance even when the visual entity categories scale up to more than
80,000, with extremely skewed class distribution. We demonstrate the efficacy
of our model on a large and imbalanced benchmark based of Visual Genome that
comprises 53,000+ objects and 29,000+ relations, a scale at which no previous
work has ever been evaluated at. We show superiority of our model over
carefully designed baselines on the original Visual Genome dataset with 80,000+
categories. We also show state-of-the-art performance on the VRD dataset and
the scene graph dataset which is a subset of Visual Genome with 200 categories