48,364 research outputs found
Detecting Visual Relationships with Deep Relational Networks
Relationships among objects play a crucial role in image understanding.
Despite the great success of deep learning techniques in recognizing individual
objects, reasoning about the relationships among objects remains a challenging
task. Previous methods often treat this as a classification problem,
considering each type of relationship (e.g. "ride") or each distinct visual
phrase (e.g. "person-ride-horse") as a category. Such approaches are faced with
significant difficulties caused by the high diversity of visual appearance for
each kind of relationships or the large number of distinct visual phrases. We
propose an integrated framework to tackle this problem. At the heart of this
framework is the Deep Relational Network, a novel formulation designed
specifically for exploiting the statistical dependencies between objects and
their relationships. On two large datasets, the proposed method achieves
substantial improvement over state-of-the-art.Comment: To be appeared in CVPR 2017 as an oral pape
A Deep-structured Conditional Random Field Model for Object Silhouette Tracking
In this work, we introduce a deep-structured conditional random field
(DS-CRF) model for the purpose of state-based object silhouette tracking. The
proposed DS-CRF model consists of a series of state layers, where each state
layer spatially characterizes the object silhouette at a particular point in
time. The interactions between adjacent state layers are established by
inter-layer connectivity dynamically determined based on inter-frame optical
flow. By incorporate both spatial and temporal context in a dynamic fashion
within such a deep-structured probabilistic graphical model, the proposed
DS-CRF model allows us to develop a framework that can accurately and
efficiently track object silhouettes that can change greatly over time, as well
as under different situations such as occlusion and multiple targets within the
scene. Experiment results using video surveillance datasets containing
different scenarios such as occlusion and multiple targets showed that the
proposed DS-CRF approach provides strong object silhouette tracking performance
when compared to baseline methods such as mean-shift tracking, as well as
state-of-the-art methods such as context tracking and boosted particle
filtering.Comment: 17 page
Learning Action Maps of Large Environments via First-Person Vision
When people observe and interact with physical spaces, they are able to
associate functionality to regions in the environment. Our goal is to automate
dense functional understanding of large spaces by leveraging sparse activity
demonstrations recorded from an ego-centric viewpoint. The method we describe
enables functionality estimation in large scenes where people have behaved, as
well as novel scenes where no behaviors are observed. Our method learns and
predicts "Action Maps", which encode the ability for a user to perform
activities at various locations. With the usage of an egocentric camera to
observe human activities, our method scales with the size of the scene without
the need for mounting multiple static surveillance cameras and is well-suited
to the task of observing activities up-close. We demonstrate that by capturing
appearance-based attributes of the environment and associating these attributes
with activity demonstrations, our proposed mathematical framework allows for
the prediction of Action Maps in new environments. Additionally, we offer a
preliminary glance of the applicability of Action Maps by demonstrating a
proof-of-concept application in which they are used in concert with activity
detections to perform localization.Comment: To appear at CVPR 201
Target-Tailored Source-Transformation for Scene Graph Generation
Scene graph generation aims to provide a semantic and structural description
of an image, denoting the objects (with nodes) and their relationships (with
edges). The best performing works to date are based on exploiting the context
surrounding objects or relations,e.g., by passing information among objects. In
these approaches, to transform the representation of source objects is a
critical process for extracting information for the use by target objects. In
this work, we argue that a source object should give what tar-get object needs
and give different objects different information rather than contributing
common information to all targets. To achieve this goal, we propose a
Target-TailoredSource-Transformation (TTST) method to efficiently propagate
information among object proposals and relations. Particularly, for a source
object proposal which will contribute information to other target objects, we
transform the source object feature to the target object feature domain by
simultaneously taking both the source and target into account. We further
explore more powerful representations by integrating language prior with the
visual context in the transformation for the scene graph generation. By doing
so the target object is able to extract target-specific information from the
source object and source relation accordingly to refine its representation. Our
framework is validated on the Visual Genome bench-mark and demonstrated its
state-of-the-art performance for the scene graph generation. The experimental
results show that the performance of object detection and visual relation-ship
detection are promoted mutually by our method
- …