684 research outputs found
Harvesting Information from Captions for Weakly Supervised Semantic Segmentation
Since acquiring pixel-wise annotations for training convolutional neural
networks for semantic image segmentation is time-consuming, weakly supervised
approaches that only require class tags have been proposed. In this work, we
propose another form of supervision, namely image captions as they can be found
on the Internet. These captions have two advantages. They do not require
additional curation as it is the case for the clean class tags used by current
weakly supervised approaches and they provide textual context for the classes
present in an image. To leverage such textual context, we deploy a multi-modal
network that learns a joint embedding of the visual representation of the image
and the textual representation of the caption. The network estimates text
activation maps (TAMs) for class names as well as compound concepts, i.e.
combinations of nouns and their attributes. The TAMs of compound concepts
describing classes of interest substantially improve the quality of the
estimated class activation maps which are then used to train a network for
semantic segmentation. We evaluate our method on the COCO dataset where it
achieves state of the art results for weakly supervised image segmentation
Referring Expression Comprehension: A Survey of Methods and Datasets
Referring expression comprehension (REC) aims to localize a target object in
an image described by a referring expression phrased in natural language.
Different from the object detection task that queried object labels have been
pre-defined, the REC problem only can observe the queries during the test. It
thus more challenging than a conventional computer vision problem. This task
has attracted a lot of attention from both computer vision and natural language
processing community, and several lines of work have been proposed, from
CNN-RNN model, modular network to complex graph-based model. In this survey, we
first examine the state of the art by comparing modern approaches to the
problem. We classify methods by their mechanism to encode the visual and
textual modalities. In particular, we examine the common approach of joint
embedding images and expressions to a common feature space. We also discuss
modular architectures and graph-based models that interface with structured
graph representation. In the second part of this survey, we review the datasets
available for training and evaluating REC systems. We then group results
according to the datasets, backbone models, settings so that they can be fairly
compared. Finally, we discuss promising future directions for the field, in
particular the compositional referring expression comprehension that requires
longer reasoning chain to address.Comment: Accepted to IEEE TM
Knowledge-guided Pairwise Reconstruction Network for Weakly Supervised Referring Expression Grounding
Weakly supervised referring expression grounding (REG) aims at localizing the
referential entity in an image according to linguistic query, where the mapping
between the image region (proposal) and the query is unknown in the training
stage. In referring expressions, people usually describe a target entity in
terms of its relationship with other contextual entities as well as visual
attributes. However, previous weakly supervised REG methods rarely pay
attention to the relationship between the entities. In this paper, we propose a
knowledge-guided pairwise reconstruction network (KPRN), which models the
relationship between the target entity (subject) and contextual entity (object)
as well as grounds these two entities. Specifically, we first design a
knowledge extraction module to guide the proposal selection of subject and
object. The prior knowledge is obtained in a specific form of semantic
similarities between each proposal and the subject/object. Second, guided by
such knowledge, we design the subject and object attention module to construct
the subject-object proposal pairs. The subject attention excludes the unrelated
proposals from the candidate proposals. The object attention selects the most
suitable proposal as the contextual proposal. Third, we introduce a pairwise
attention and an adaptive weighting scheme to learn the correspondence between
these proposal pairs and the query. Finally, a pairwise reconstruction module
is used to measure the grounding for weakly supervised learning. Extensive
experiments on four large-scale datasets show our method outperforms existing
state-of-the-art methods by a large margin.Comment: Accepted by ACMMM 2019. arXiv admin note: text overlap with
arXiv:1908.1056
- …