8 research outputs found
Ref-NMS: Breaking Proposal Bottlenecks in Two-Stage Referring Expression Grounding
The prevailing framework for solving referring expression grounding is based
on a two-stage process: 1) detecting proposals with an object detector and 2)
grounding the referent to one of the proposals. Existing two-stage solutions
mostly focus on the grounding step, which aims to align the expressions with
the proposals. In this paper, we argue that these methods overlook an obvious
mismatch between the roles of proposals in the two stages: they generate
proposals solely based on the detection confidence (i.e., expression-agnostic),
hoping that the proposals contain all right instances in the expression (i.e.,
expression-aware). Due to this mismatch, current two-stage methods suffer from
a severe performance drop between detected and ground-truth proposals. To this
end, we propose Ref-NMS, which is the first method to yield expression-aware
proposals at the first stage. Ref-NMS regards all nouns in the expression as
critical objects, and introduces a lightweight module to predict a score for
aligning each box with a critical object. These scores can guide the NMS
operation to filter out the boxes irrelevant to the expression, increasing the
recall of critical objects, resulting in a significantly improved grounding
performance. Since Ref- NMS is agnostic to the grounding step, it can be easily
integrated into any state-of-the-art two-stage method. Extensive ablation
studies on several backbones, benchmarks, and tasks consistently demonstrate
the superiority of Ref-NMS. Codes are available at:
https://github.com/ChopinSharp/ref-nms.Comment: Appear in AAAI 2021, Codes are available at:
https://github.com/ChopinSharp/ref-nm
Top-Down Framework for Weakly-supervised Grounded Image Captioning
Weakly-supervised grounded image captioning (WSGIC) aims to generate the
caption and ground (localize) predicted object words in the input image without
using bounding box supervision. Recent two-stage solutions mostly apply a
bottom-up pipeline: (1) encode the input image into multiple region features
using an object detector; (2) leverage region features for captioning and
grounding. However, utilizing independent proposals produced by object
detectors tends to make the subsequent grounded captioner overfitted in finding
the correct object words, overlooking the relation between objects, and
selecting incompatible proposal regions for grounding. To address these issues,
we propose a one-stage weakly-supervised grounded captioner that directly takes
the RGB image as input to perform captioning and grounding at the top-down
image level. Specifically, we encode the image into visual token
representations and propose a Recurrent Grounding Module (RGM) in the decoder
to obtain precise Visual Language Attention Maps (VLAMs), which recognize the
spatial locations of the objects. In addition, we explicitly inject a relation
module into our one-stage framework to encourage relation understanding through
multi-label classification. This relation semantics served as contextual
information facilitating the prediction of relation and object words in the
caption. We observe that the relation semantic not only assists the grounded
captioner in generating a more accurate caption but also improves the grounding
performance. We validate the effectiveness of our proposed method on two
challenging datasets (Flick30k Entities captioning and MSCOCO captioning). The
experimental results demonstrate that our method achieves state-of-the-art
grounding performance
Recommended from our members
Understanding of Visual Domains via the Lens of Natural Language
A joint understanding of vision and language can enable intelligent systems to perceive, act, and communicate with humans for a wide range of applications. For example, they can assist a human to navigate in an environment, edit the content of an image through natural language commands, or search through image collections using natural language queries. In this thesis, we aim to improve our understanding of visual domains through the lens of natural language. We specifically look into (1) images of categories within a fine-grained taxonomy such as species of birds or variants of aircraft, (2) images of textures that describe local color, shape, and patterns, and (3) regions in images that correspond to objects, materials, and textures.
In one line of work, we investigate ways to discover a domain-specific language by asking annotators to describe visual differences between instances within a fine-grained taxonomy. We show that a system trained to describe these differences leads to an accurate and interpretable basis for categorization. In another line of work, we investigate the effectiveness of language and vision models for describing textures, a problem that, despite the ubiquity of textures, has not been sufficiently studied in the literature. Textures are diverse, yet their local nature allows for the description of appearance of a wide range of visual categories. The locality also allows us to systematically generate synthetic variations to investigate how disentangled visual representations are for properties such as shape, color, and figure-ground segmentation. Finally, instead of modeling an image as a whole, we design a system that allows descriptions of regions within an image. A challenge is to handle the long-tail distribution of names and appearances of concepts within natural scenes. We design a modular framework that integrates object detection, semantic segmentation, and contextual reasoning with language that leads to better performance. In addition to methods and analysis, we contribute datasets and benchmarks to evaluate the performance of models in each of these domains.
The availability of large-scale pre-trained models for vision (e.g., ResNet) and language (e.g., BERT) have catalyzed improvements and novel applications in computer vision and natural language processing, but until recently similar models that could jointly reason about language and vision were not available. This has changed through the availability of models such as CLIP, which have been trained on a massive number of images with associated texts. Therefore, we analyze the effectiveness of CLIP-based representations for tasks posed in our earlier work. By comparing and contrasting these with domain-specific ones we presented in the earlier chapters, we shed some light on the nature of the learned representations and the biases they encode
Modeling Visual Rhetoric and Semantics in Multimedia
Recent advances in machine learning have enabled computer vision algorithms to model complicated visual phenomena with accuracies unthinkable a mere decade ago. Their high-performance on a plethora of vision-related tasks has enabled computer vision researchers to begin to move beyond traditional visual recognition problems to tasks requiring higher-level image understanding. However, most computer vision research still focuses on describing what images, text, or other media literally portrays. In contrast, in this dissertation we focus on learning how and why such content is portrayed. Rather than viewing media for its content, we recast the problem as understanding visual communication and visual rhetoric. For example, the same content may be portrayed in different ways in order to present the story the author wishes to convey. We thus seek to model not only the content of the media, but its authorial intent and latent messaging. Understanding how and why visual content is portrayed a certain way requires understanding higher level abstract semantic concepts which are themselves latent within visual media. By latent, we mean the concept is not readily visually accessible within a single image (e.g. right vs left political bias), in contrast to explicit visual semantic concepts such as objects.
Specifically, we study the problems of modeling photographic style (how professional photographers portray their subjects), understanding visual persuasion in image advertisements, modeling political bias in multimedia (image and text) news articles, and learning cross-modal semantic representations. While most past research in vision and natural language processing studies the case where visual content and paired text are highly aligned (as in the case of image captions), we target the case where each modality conveys complementary information to tell a larger story. We particularly focus on the problem of learning cross-modal representations from multimedia exhibiting weak alignment between the image and text modalities. A variety of techniques are presented which improve modeling of multimedia rhetoric in real-world data and enable more robust artificially intelligent systems