Search CORE

138 research outputs found

Why do These Match? Explaining the Behavior of Image Similarity Models

Author: Forsyth David
Petsiuk Vitali
Plummer Bryan A.
Saenko Kate
Vasileva Mariya I.
Publication venue
Publication date: 01/01/2020
Field of study

Explaining a deep learning model can help users understand its behavior and allow researchers to discern its shortcomings. Recent work has primarily focused on explaining models for tasks like image classification or visual question answering. In this paper, we introduce Salient Attributes for Network Explanation (SANE) to explain image similarity models, where a model's output is a score measuring the similarity of two inputs rather than a classification score. In this task, an explanation depends on both of the input images, so standard methods do not apply. Our SANE explanations pairs a saliency map identifying important image regions with an attribute that best explains the match. We find that our explanations provide additional information not typically captured by saliency maps alone, and can also improve performance on the classic task of attribute recognition. Our approach's ability to generalize is demonstrated on two datasets from diverse domains, Polyvore Outfits and Animals with Attributes 2. Code available at: https://github.com/VisionLearningGroup/SANEComment: Accepted at ECCV 202

arXiv.org e-Print Archive

Crossref

Boston University Institutional Repository (OpenBU)

Multilevel Language and Vision Integration for Text-to-Clip Retrieval

Author: He Kun
Plummer Bryan A.
Saenko Kate
Sclaroff Stan
Sigal Leonid
Xu Huijuan
Publication venue
Publication date: 25/12/2018
Field of study

We address the problem of text-based activity retrieval in video. Given a sentence describing an activity, our task is to retrieve matching clips from an untrimmed video. To capture the inherent structures present in both text and video, we introduce a multilevel model that integrates vision and language features earlier and more tightly than prior work. First, we inject text features early on when generating clip proposals, to help eliminate unlikely clips and thus speed up processing and boost performance. Second, to learn a fine-grained similarity metric for retrieval, we use visual features to modulate the processing of query sentences at the word level in a recurrent neural network. A multi-task loss is also employed by adding query re-generation as an auxiliary task. Our approach significantly outperforms prior work on two challenging benchmarks: Charades-STA and ActivityNet Captions.Comment: AAAI 201

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Solving Visual Madlibs with Multiple Cues

Author: Alexander C. Berg
Arun Mallya
Bryan Plummer
Svetlana Lazebnik
Tamara L. Berg
Tommasi Tatiana
Publication venue: 'British Machine Vision Association and Society for Pattern Recognition'
Publication date: 01/01/2016
Field of study

This paper focuses on answering fill-in-the-blank style multiple choice questions from the Visual Madlibs dataset. Previous approaches to Visual Question Answering (VQA) have mainly used generic image features from networks trained on the ImageNet dataset, despite the wide scope of questions. In contrast, our approach employs features derived from networks trained for specialized tasks of scene classification, person activity prediction, and person and object attribute prediction. We also present a method for selecting sub-regions of an image that are relevant for evaluating the appropriateness of a putative answer. Visual features are computed both from the whole image and from local regions, while sentences are mapped to a common space using a simple normalized canonical correlation analysis (CCA) model. Our results show a significant improvement over the previous state of the art, and indicate that answering different question types benefits from examining a variety of image cues and carefully choosing informative image sub-regions

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Archivio della ricerca- Università di Roma La Sapienza

Bias Mimicking: A Simple Sampling Approach for Bias Mitigation

Author: Plummer Bryan A.
Qraitem Maan
Saenko Kate
Publication venue
Publication date: 25/03/2023
Field of study

Prior work has shown that Visual Recognition datasets frequently underrepresent bias groups

B

(\eg Female) within class labels

Y

(\eg Programmers). This dataset bias can lead to models that learn spurious correlations between class labels and bias groups such as age, gender, or race. Most recent methods that address this problem require significant architectural changes or additional loss functions requiring more hyper-parameter tuning. Alternatively, data sampling baselines from the class imbalance literature (\eg Undersampling, Upweighting), which can often be implemented in a single line of code and often have no hyperparameters, offer a cheaper and more efficient solution. However, these methods suffer from significant shortcomings. For example, Undersampling drops a significant part of the input distribution per epoch while Oversampling repeats samples, causing overfitting. To address these shortcomings, we introduce a new class-conditioned sampling method: Bias Mimicking. The method is based on the observation that if a class

c

bias distribution, \ie

P_D(B|Y=c)

is mimicked across every

c^{\prime}\neq c

, then

Y

and

B

are statistically independent. Using this notion, BM, through a novel training procedure, ensures that the model is exposed to the entire distribution per epoch without repeating samples. Consequently, Bias Mimicking improves underrepresented groups' accuracy of sampling methods by 3\% over four benchmarks while maintaining and sometimes improving performance over nonsampling methods. Code: \url{https://github.com/mqraitem/Bias-Mimicking

arXiv.org e-Print Archive

A Unified Framework for Connecting Noise Modeling to Boost Noise Detection

Author: Pham Chau
Plummer Bryan A.
Wang Siqi
Publication venue
Publication date: 30/11/2023
Field of study

Noisy labels can impair model performance, making the study of learning with noisy labels an important topic. Two conventional approaches are noise modeling and noise detection. However, these two methods are typically studied independently, and there has been limited work on their collaboration. In this work, we explore the integration of these two approaches, proposing an interconnected structure with three crucial blocks: noise modeling, source knowledge identification, and enhanced noise detection using noise source-knowledge-integration methods. This collaboration structure offers advantages such as discriminating hard negatives and preserving genuinely clean labels that might be suspiciously noisy. Our experiments on four datasets, featuring three types of noise and different combinations of each block, demonstrate the efficacy of these components' collaboration. Our collaborative structure methods achieve up to a 10% increase in top-1 classification accuracy in synthesized noise datasets and 3-5% in real-world noisy datasets. The results also suggest that these components make distinct contributions to overall performance across various noise scenarios. These findings provide valuable insights for designing noisy label learning methods customized for specific noise scenarios in the future. Our code is accessible to the public

arXiv.org e-Print Archive

Detecting Cross-Modal Inconsistency to Defend Against Neural Fake News

Author: Plummer Bryan A.
Saenko Kate
Tan Reuben
Publication venue
Publication date: 24/09/2020
Field of study

Large-scale dissemination of disinformation online intended to mislead or deceive the general population is a major societal problem. Rapid progression in image, video, and natural language generative models has only exacerbated this situation and intensified our need for an effective defense mechanism. While existing approaches have been proposed to defend against neural fake news, they are generally constrained to the very limited setting where articles only have text and metadata such as the title and authors. In this paper, we introduce the more realistic and challenging task of defending against machine-generated news that also includes images and captions. To identify the possible weaknesses that adversaries can exploit, we create a NeuralNews dataset composed of 4 different types of generated articles as well as conduct a series of human user study experiments based on this dataset. In addition to the valuable insights gleaned from our user study experiments, we provide a relatively effective approach based on detecting visual-semantic inconsistencies, which will serve as an effective first line of defense and a useful reference for future work in defending against machine-generated disinformation.Comment: Accepted at EMNLP 202

arXiv.org e-Print Archive

From Fake to Real: Pretraining on Balanced Synthetic Images to Prevent Bias

Author: Plummer Bryan A.
Qraitem Maan
Saenko Kate
Publication venue
Publication date: 29/09/2023
Field of study

Visual recognition models are prone to learning spurious correlations induced by a biased training set where certain conditions

B

(\eg, Indoors) are over-represented in certain classes

Y

(\eg, Big Dogs). Synthetic data from generative models offers a promising direction to mitigate this issue by augmenting underrepresented conditions in the real dataset. However, this introduces another potential source of bias from generative model artifacts in the synthetic data. Indeed, as we will show, prior work uses synthetic data to resolve the model's bias toward

B

, but it doesn't correct the models' bias toward the pair

(B, G)

where

G

denotes whether the sample is real or synthetic. Thus, the model could simply learn signals based on the pair

(B, G)

(\eg, Synthetic Indoors) to make predictions about

Y

(\eg, Big Dogs). To address this issue, we propose a two-step training pipeline that we call From Fake to Real (FFR). The first step of FFR pre-trains a model on balanced synthetic data to learn robust representations across subgroups. In the second step, FFR fine-tunes the model on real data using ERM or common loss-based bias mitigation methods. By training on real and synthetic data separately, FFR avoids the issue of bias toward signals from the pair

(B, G)

. In other words, synthetic data in the first step provides effective unbiased representations that boosts performance in the second step. Indeed, our analysis of high bias setting (99.9\%) shows that FFR improves performance over the state-of-the-art by 7-14\% over three datasets (CelebA, UTK-Face, and SpuCO Animals)

arXiv.org e-Print Archive

Show and Write: Entity-aware Article Generation with Image Information

Author: Gu Yiwen
Plummer Bryan A.
Zhang Zhongping
Publication venue
Publication date: 24/03/2022
Field of study

Many vision-language applications contain long articles of text paired with images (e.g., news or Wikipedia articles). Prior work learning to encode and/or generate these articles has primarily focused on understanding the article itself and some related metadata like the title or date it was written. However, the images and their captions or alt-text often contain crucial information such as named entities that are difficult to be correctly recognized and predicted by language models. To address this shortcoming, this paper introduces an ENtity-aware article Generation method with Image iNformation, ENGIN, to incorporate an article's image information into language models. ENGIN represents articles that can be conditioned on metadata used by prior work and information such as captions and named entities extracted from images. Our key contribution is a novel Entity-aware mechanism to help our model better recognize and predict the entity names in articles. We perform experiments on three public datasets, GoodNews, VisualNews, and WikiText. Quantitative results show that our approach improves generated article perplexity by 4-5 points over the base models. Qualitative results demonstrate the text generated by ENGIN is more consistent with embedded article images. We also perform article quality annotation experiments on the generated articles to validate that our model produces higher-quality articles. Finally, we investigate the effect ENGIN has on methods that automatically detect machine-generated articles

arXiv.org e-Print Archive