204 research outputs found

    Conditional Image-Text Embedding Networks

    Full text link
    This paper presents an approach for grounding phrases in images which jointly learns multiple text-conditioned embeddings in a single end-to-end model. In order to differentiate text phrases into semantically distinct subspaces, we propose a concept weight branch that automatically assigns phrases to embeddings, whereas prior works predefine such assignments. Our proposed solution simplifies the representation requirements for individual embeddings and allows the underrepresented concepts to take advantage of the shared representations before feeding them into concept-specific layers. Comprehensive experiments verify the effectiveness of our approach across three phrase grounding datasets, Flickr30K Entities, ReferIt Game, and Visual Genome, where we obtain a (resp.) 4%, 3%, and 4% improvement in grounding performance over a strong region-phrase embedding baseline.Comment: ECCV 2018 accepted pape

    Medical Phrase Grounding with Region-Phrase Context Contrastive Alignment

    Full text link
    Medical phrase grounding (MPG) aims to locate the most relevant region in a medical image, given a phrase query describing certain medical findings, which is an important task for medical image analysis and radiological diagnosis. However, existing visual grounding methods rely on general visual features for identifying objects in natural images and are not capable of capturing the subtle and specialized features of medical findings, leading to sub-optimal performance in MPG. In this paper, we propose MedRPG, an end-to-end approach for MPG. MedRPG is built on a lightweight vision-language transformer encoder and directly predicts the box coordinates of mentioned medical findings, which can be trained with limited medical data, making it a valuable tool in medical image analysis. To enable MedRPG to locate nuanced medical findings with better region-phrase correspondences, we further propose Tri-attention Context contrastive alignment (TaCo). TaCo seeks context alignment to pull both the features and attention outputs of relevant region-phrase pairs close together while pushing those of irrelevant regions far away. This ensures that the final box prediction depends more on its finding-specific regions and phrases. Experimental results on three MPG datasets demonstrate that our MedRPG outperforms state-of-the-art visual grounding approaches by a large margin. Additionally, the proposed TaCo strategy is effective in enhancing finding localization ability and reducing spurious region-phrase correlations

    Geospatial phrase grounding and disambiguation

    Get PDF
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2008.Includes bibliographical references (p. 101-107).GeoCoder is a spatial reasoning system that converts natural language inputs into a set of precise spatial coordinates to display on a map. GeoCoder's spatial knowledge is represented in a set of ontologies. GeoCoder parses input phrases and adds location reference individuals to its ontology model. Relationships between location references are recognized based on mid-level structural patterns in the parsed phrase. GeoCoder grounds (or finds possible geometries for) location references in an iterative process, in which locations are grounded based on their relationships to previously grounded locations. GeoCoder improves upon previous systems by grounding and disambiguating at the phrase level, interpreting parses with rules that match mid level structure patterns, expressing disambiguation heuristics in ontologies, and improving scalability by separating grounding from reasoning about relationships.by Amy Michelle Slagle.M.Eng

    A Joint Study of Phrase Grounding and Task Performance in Vision and Language Models

    Full text link
    Key to tasks that require reasoning about natural language in visual contexts is grounding words and phrases to image regions. However, observing this grounding in contemporary models is complex, even if it is generally expected to take place if the task is addressed in a way that is conductive to generalization. We propose a framework to jointly study task performance and phrase grounding, and propose three benchmarks to study the relation between the two. Our results show that contemporary models demonstrate inconsistency between their ability to ground phrases and solve tasks. We show how this can be addressed through brute-force training on ground phrasing annotations, and analyze the dynamics it creates. Code and at available at https://github.com/lil-lab/phrase_grounding
    corecore