Search CORE

8 research outputs found

Ref-NMS: Breaking Proposal Bottlenecks in Two-Stage Referring Expression Grounding

Author: Chang Shih-Fu
Chen Long
Ma Wenbo
Xiao Jun
Zhang Hanwang
Publication venue
Publication date: 14/12/2020
Field of study

The prevailing framework for solving referring expression grounding is based on a two-stage process: 1) detecting proposals with an object detector and 2) grounding the referent to one of the proposals. Existing two-stage solutions mostly focus on the grounding step, which aims to align the expressions with the proposals. In this paper, we argue that these methods overlook an obvious mismatch between the roles of proposals in the two stages: they generate proposals solely based on the detection confidence (i.e., expression-agnostic), hoping that the proposals contain all right instances in the expression (i.e., expression-aware). Due to this mismatch, current two-stage methods suffer from a severe performance drop between detected and ground-truth proposals. To this end, we propose Ref-NMS, which is the first method to yield expression-aware proposals at the first stage. Ref-NMS regards all nouns in the expression as critical objects, and introduces a lightweight module to predict a score for aligning each box with a critical object. These scores can guide the NMS operation to filter out the boxes irrelevant to the expression, increasing the recall of critical objects, resulting in a significantly improved grounding performance. Since Ref- NMS is agnostic to the grounding step, it can be easily integrated into any state-of-the-art two-stage method. Extensive ablation studies on several backbones, benchmarks, and tasks consistently demonstrate the superiority of Ref-NMS. Codes are available at: https://github.com/ChopinSharp/ref-nms.Comment: Appear in AAAI 2021, Codes are available at: https://github.com/ChopinSharp/ref-nm

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Top-Down Framework for Weakly-supervised Grounded Image Captioning

Author: Cai Chen
Wang Suchen
Wang Yi
Yap Kim-hui
Publication venue
Publication date: 02/03/2024
Field of study

Weakly-supervised grounded image captioning (WSGIC) aims to generate the caption and ground (localize) predicted object words in the input image without using bounding box supervision. Recent two-stage solutions mostly apply a bottom-up pipeline: (1) encode the input image into multiple region features using an object detector; (2) leverage region features for captioning and grounding. However, utilizing independent proposals produced by object detectors tends to make the subsequent grounded captioner overfitted in finding the correct object words, overlooking the relation between objects, and selecting incompatible proposal regions for grounding. To address these issues, we propose a one-stage weakly-supervised grounded captioner that directly takes the RGB image as input to perform captioning and grounding at the top-down image level. Specifically, we encode the image into visual token representations and propose a Recurrent Grounding Module (RGM) in the decoder to obtain precise Visual Language Attention Maps (VLAMs), which recognize the spatial locations of the objects. In addition, we explicitly inject a relation module into our one-stage framework to encourage relation understanding through multi-label classification. This relation semantics served as contextual information facilitating the prediction of relation and object words in the caption. We observe that the relation semantic not only assists the grounded captioner in generating a more accurate caption but also improves the grounding performance. We validate the effectiveness of our proposed method on two challenging datasets (Flick30k Entities captioning and MSCOCO captioning). The experimental results demonstrate that our method achieves state-of-the-art grounding performance

arXiv.org e-Print Archive

Recommended from our members

Understanding of Visual Domains via the Lens of Natural Language

Author: Wu Chenyun
Publication venue: ScholarWorks@UMass Amherst
Publication date: 21/10/2021
Field of study

A joint understanding of vision and language can enable intelligent systems to perceive, act, and communicate with humans for a wide range of applications. For example, they can assist a human to navigate in an environment, edit the content of an image through natural language commands, or search through image collections using natural language queries. In this thesis, we aim to improve our understanding of visual domains through the lens of natural language. We specifically look into (1) images of categories within a fine-grained taxonomy such as species of birds or variants of aircraft, (2) images of textures that describe local color, shape, and patterns, and (3) regions in images that correspond to objects, materials, and textures. In one line of work, we investigate ways to discover a domain-specific language by asking annotators to describe visual differences between instances within a fine-grained taxonomy. We show that a system trained to describe these differences leads to an accurate and interpretable basis for categorization. In another line of work, we investigate the effectiveness of language and vision models for describing textures, a problem that, despite the ubiquity of textures, has not been sufficiently studied in the literature. Textures are diverse, yet their local nature allows for the description of appearance of a wide range of visual categories. The locality also allows us to systematically generate synthetic variations to investigate how disentangled visual representations are for properties such as shape, color, and figure-ground segmentation. Finally, instead of modeling an image as a whole, we design a system that allows descriptions of regions within an image. A challenge is to handle the long-tail distribution of names and appearances of concepts within natural scenes. We design a modular framework that integrates object detection, semantic segmentation, and contextual reasoning with language that leads to better performance. In addition to methods and analysis, we contribute datasets and benchmarks to evaluate the performance of models in each of these domains. The availability of large-scale pre-trained models for vision (e.g., ResNet) and language (e.g., BERT) have catalyzed improvements and novel applications in computer vision and natural language processing, but until recently similar models that could jointly reason about language and vision were not available. This has changed through the availability of models such as CLIP, which have been trained on a massive number of images with associated texts. Therefore, we analyze the effectiveness of CLIP-based representations for tasks posed in our earlier work. By comparing and contrasting these with domain-specific ones we presented in the earlier chapters, we shed some light on the nature of the learned representations and the biases they encode

ScholarWorks@UMass Amherst

Modeling Visual Rhetoric and Semantics in Multimedia

Author: Thomas Christopher
Publication venue
Publication date: 16/09/2020
Field of study

Recent advances in machine learning have enabled computer vision algorithms to model complicated visual phenomena with accuracies unthinkable a mere decade ago. Their high-performance on a plethora of vision-related tasks has enabled computer vision researchers to begin to move beyond traditional visual recognition problems to tasks requiring higher-level image understanding. However, most computer vision research still focuses on describing what images, text, or other media literally portrays. In contrast, in this dissertation we focus on learning how and why such content is portrayed. Rather than viewing media for its content, we recast the problem as understanding visual communication and visual rhetoric. For example, the same content may be portrayed in different ways in order to present the story the author wishes to convey. We thus seek to model not only the content of the media, but its authorial intent and latent messaging. Understanding how and why visual content is portrayed a certain way requires understanding higher level abstract semantic concepts which are themselves latent within visual media. By latent, we mean the concept is not readily visually accessible within a single image (e.g. right vs left political bias), in contrast to explicit visual semantic concepts such as objects. Specifically, we study the problems of modeling photographic style (how professional photographers portray their subjects), understanding visual persuasion in image advertisements, modeling political bias in multimedia (image and text) news articles, and learning cross-modal semantic representations. While most past research in vision and natural language processing studies the case where visual content and paired text are highly aligned (as in the case of image captions), we target the case where each modality conveys complementary information to tell a larger story. We particularly focus on the problem of learning cross-modal representations from multimedia exhibiting weak alignment between the image and text modalities. A variety of techniques are presented which improve modeling of multimedia rhetoric in real-world data and enable more robust artificially intelligent systems

D-Scholarship@Pitt