204 research outputs found
Conditional Image-Text Embedding Networks
This paper presents an approach for grounding phrases in images which jointly
learns multiple text-conditioned embeddings in a single end-to-end model. In
order to differentiate text phrases into semantically distinct subspaces, we
propose a concept weight branch that automatically assigns phrases to
embeddings, whereas prior works predefine such assignments. Our proposed
solution simplifies the representation requirements for individual embeddings
and allows the underrepresented concepts to take advantage of the shared
representations before feeding them into concept-specific layers. Comprehensive
experiments verify the effectiveness of our approach across three phrase
grounding datasets, Flickr30K Entities, ReferIt Game, and Visual Genome, where
we obtain a (resp.) 4%, 3%, and 4% improvement in grounding performance over a
strong region-phrase embedding baseline.Comment: ECCV 2018 accepted pape
Medical Phrase Grounding with Region-Phrase Context Contrastive Alignment
Medical phrase grounding (MPG) aims to locate the most relevant region in a
medical image, given a phrase query describing certain medical findings, which
is an important task for medical image analysis and radiological diagnosis.
However, existing visual grounding methods rely on general visual features for
identifying objects in natural images and are not capable of capturing the
subtle and specialized features of medical findings, leading to sub-optimal
performance in MPG. In this paper, we propose MedRPG, an end-to-end approach
for MPG. MedRPG is built on a lightweight vision-language transformer encoder
and directly predicts the box coordinates of mentioned medical findings, which
can be trained with limited medical data, making it a valuable tool in medical
image analysis. To enable MedRPG to locate nuanced medical findings with better
region-phrase correspondences, we further propose Tri-attention Context
contrastive alignment (TaCo). TaCo seeks context alignment to pull both the
features and attention outputs of relevant region-phrase pairs close together
while pushing those of irrelevant regions far away. This ensures that the final
box prediction depends more on its finding-specific regions and phrases.
Experimental results on three MPG datasets demonstrate that our MedRPG
outperforms state-of-the-art visual grounding approaches by a large margin.
Additionally, the proposed TaCo strategy is effective in enhancing finding
localization ability and reducing spurious region-phrase correlations
Geospatial phrase grounding and disambiguation
Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2008.Includes bibliographical references (p. 101-107).GeoCoder is a spatial reasoning system that converts natural language inputs into a set of precise spatial coordinates to display on a map. GeoCoder's spatial knowledge is represented in a set of ontologies. GeoCoder parses input phrases and adds location reference individuals to its ontology model. Relationships between location references are recognized based on mid-level structural patterns in the parsed phrase. GeoCoder grounds (or finds possible geometries for) location references in an iterative process, in which locations are grounded based on their relationships to previously grounded locations. GeoCoder improves upon previous systems by grounding and disambiguating at the phrase level, interpreting parses with rules that match mid level structure patterns, expressing disambiguation heuristics in ontologies, and improving scalability by separating grounding from reasoning about relationships.by Amy Michelle Slagle.M.Eng
A Joint Study of Phrase Grounding and Task Performance in Vision and Language Models
Key to tasks that require reasoning about natural language in visual contexts
is grounding words and phrases to image regions. However, observing this
grounding in contemporary models is complex, even if it is generally expected
to take place if the task is addressed in a way that is conductive to
generalization. We propose a framework to jointly study task performance and
phrase grounding, and propose three benchmarks to study the relation between
the two. Our results show that contemporary models demonstrate inconsistency
between their ability to ground phrases and solve tasks. We show how this can
be addressed through brute-force training on ground phrasing annotations, and
analyze the dynamics it creates. Code and at available at
https://github.com/lil-lab/phrase_grounding
- …