138 research outputs found
Why do These Match? Explaining the Behavior of Image Similarity Models
Explaining a deep learning model can help users understand its behavior and
allow researchers to discern its shortcomings. Recent work has primarily
focused on explaining models for tasks like image classification or visual
question answering. In this paper, we introduce Salient Attributes for Network
Explanation (SANE) to explain image similarity models, where a model's output
is a score measuring the similarity of two inputs rather than a classification
score. In this task, an explanation depends on both of the input images, so
standard methods do not apply. Our SANE explanations pairs a saliency map
identifying important image regions with an attribute that best explains the
match. We find that our explanations provide additional information not
typically captured by saliency maps alone, and can also improve performance on
the classic task of attribute recognition. Our approach's ability to generalize
is demonstrated on two datasets from diverse domains, Polyvore Outfits and
Animals with Attributes 2. Code available at:
https://github.com/VisionLearningGroup/SANEComment: Accepted at ECCV 202
Multilevel Language and Vision Integration for Text-to-Clip Retrieval
We address the problem of text-based activity retrieval in video. Given a
sentence describing an activity, our task is to retrieve matching clips from an
untrimmed video. To capture the inherent structures present in both text and
video, we introduce a multilevel model that integrates vision and language
features earlier and more tightly than prior work. First, we inject text
features early on when generating clip proposals, to help eliminate unlikely
clips and thus speed up processing and boost performance. Second, to learn a
fine-grained similarity metric for retrieval, we use visual features to
modulate the processing of query sentences at the word level in a recurrent
neural network. A multi-task loss is also employed by adding query
re-generation as an auxiliary task. Our approach significantly outperforms
prior work on two challenging benchmarks: Charades-STA and ActivityNet
Captions.Comment: AAAI 201
Solving Visual Madlibs with Multiple Cues
This paper focuses on answering fill-in-the-blank style multiple choice
questions from the Visual Madlibs dataset. Previous approaches to Visual
Question Answering (VQA) have mainly used generic image features from networks
trained on the ImageNet dataset, despite the wide scope of questions. In
contrast, our approach employs features derived from networks trained for
specialized tasks of scene classification, person activity prediction, and
person and object attribute prediction. We also present a method for selecting
sub-regions of an image that are relevant for evaluating the appropriateness of
a putative answer. Visual features are computed both from the whole image and
from local regions, while sentences are mapped to a common space using a simple
normalized canonical correlation analysis (CCA) model. Our results show a
significant improvement over the previous state of the art, and indicate that
answering different question types benefits from examining a variety of image
cues and carefully choosing informative image sub-regions
Bias Mimicking: A Simple Sampling Approach for Bias Mitigation
Prior work has shown that Visual Recognition datasets frequently
underrepresent bias groups (\eg Female) within class labels (\eg
Programmers). This dataset bias can lead to models that learn spurious
correlations between class labels and bias groups such as age, gender, or race.
Most recent methods that address this problem require significant architectural
changes or additional loss functions requiring more hyper-parameter tuning.
Alternatively, data sampling baselines from the class imbalance literature (\eg
Undersampling, Upweighting), which can often be implemented in a single line of
code and often have no hyperparameters, offer a cheaper and more efficient
solution. However, these methods suffer from significant shortcomings. For
example, Undersampling drops a significant part of the input distribution per
epoch while Oversampling repeats samples, causing overfitting. To address these
shortcomings, we introduce a new class-conditioned sampling method: Bias
Mimicking. The method is based on the observation that if a class bias
distribution, \ie is mimicked across every ,
then and are statistically independent. Using this notion, BM, through
a novel training procedure, ensures that the model is exposed to the entire
distribution per epoch without repeating samples. Consequently, Bias Mimicking
improves underrepresented groups' accuracy of sampling methods by 3\% over four
benchmarks while maintaining and sometimes improving performance over
nonsampling methods. Code: \url{https://github.com/mqraitem/Bias-Mimicking
A Unified Framework for Connecting Noise Modeling to Boost Noise Detection
Noisy labels can impair model performance, making the study of learning with
noisy labels an important topic. Two conventional approaches are noise modeling
and noise detection. However, these two methods are typically studied
independently, and there has been limited work on their collaboration. In this
work, we explore the integration of these two approaches, proposing an
interconnected structure with three crucial blocks: noise modeling, source
knowledge identification, and enhanced noise detection using noise
source-knowledge-integration methods. This collaboration structure offers
advantages such as discriminating hard negatives and preserving genuinely clean
labels that might be suspiciously noisy. Our experiments on four datasets,
featuring three types of noise and different combinations of each block,
demonstrate the efficacy of these components' collaboration. Our collaborative
structure methods achieve up to a 10% increase in top-1 classification accuracy
in synthesized noise datasets and 3-5% in real-world noisy datasets. The
results also suggest that these components make distinct contributions to
overall performance across various noise scenarios. These findings provide
valuable insights for designing noisy label learning methods customized for
specific noise scenarios in the future. Our code is accessible to the public
Detecting Cross-Modal Inconsistency to Defend Against Neural Fake News
Large-scale dissemination of disinformation online intended to mislead or
deceive the general population is a major societal problem. Rapid progression
in image, video, and natural language generative models has only exacerbated
this situation and intensified our need for an effective defense mechanism.
While existing approaches have been proposed to defend against neural fake
news, they are generally constrained to the very limited setting where articles
only have text and metadata such as the title and authors. In this paper, we
introduce the more realistic and challenging task of defending against
machine-generated news that also includes images and captions. To identify the
possible weaknesses that adversaries can exploit, we create a NeuralNews
dataset composed of 4 different types of generated articles as well as conduct
a series of human user study experiments based on this dataset. In addition to
the valuable insights gleaned from our user study experiments, we provide a
relatively effective approach based on detecting visual-semantic
inconsistencies, which will serve as an effective first line of defense and a
useful reference for future work in defending against machine-generated
disinformation.Comment: Accepted at EMNLP 202
From Fake to Real: Pretraining on Balanced Synthetic Images to Prevent Bias
Visual recognition models are prone to learning spurious correlations induced
by a biased training set where certain conditions (\eg, Indoors) are
over-represented in certain classes (\eg, Big Dogs). Synthetic data from
generative models offers a promising direction to mitigate this issue by
augmenting underrepresented conditions in the real dataset. However, this
introduces another potential source of bias from generative model artifacts in
the synthetic data. Indeed, as we will show, prior work uses synthetic data to
resolve the model's bias toward , but it doesn't correct the models' bias
toward the pair where denotes whether the sample is real or
synthetic. Thus, the model could simply learn signals based on the pair (\eg, Synthetic Indoors) to make predictions about (\eg, Big Dogs). To
address this issue, we propose a two-step training pipeline that we call From
Fake to Real (FFR). The first step of FFR pre-trains a model on balanced
synthetic data to learn robust representations across subgroups. In the second
step, FFR fine-tunes the model on real data using ERM or common loss-based bias
mitigation methods. By training on real and synthetic data separately, FFR
avoids the issue of bias toward signals from the pair . In other words,
synthetic data in the first step provides effective unbiased representations
that boosts performance in the second step. Indeed, our analysis of high bias
setting (99.9\%) shows that FFR improves performance over the state-of-the-art
by 7-14\% over three datasets (CelebA, UTK-Face, and SpuCO Animals)
Show and Write: Entity-aware Article Generation with Image Information
Many vision-language applications contain long articles of text paired with
images (e.g., news or Wikipedia articles). Prior work learning to encode and/or
generate these articles has primarily focused on understanding the article
itself and some related metadata like the title or date it was written.
However, the images and their captions or alt-text often contain crucial
information such as named entities that are difficult to be correctly
recognized and predicted by language models. To address this shortcoming, this
paper introduces an ENtity-aware article Generation method with Image
iNformation, ENGIN, to incorporate an article's image information into language
models. ENGIN represents articles that can be conditioned on metadata used by
prior work and information such as captions and named entities extracted from
images. Our key contribution is a novel Entity-aware mechanism to help our
model better recognize and predict the entity names in articles. We perform
experiments on three public datasets, GoodNews, VisualNews, and WikiText.
Quantitative results show that our approach improves generated article
perplexity by 4-5 points over the base models. Qualitative results demonstrate
the text generated by ENGIN is more consistent with embedded article images. We
also perform article quality annotation experiments on the generated articles
to validate that our model produces higher-quality articles. Finally, we
investigate the effect ENGIN has on methods that automatically detect
machine-generated articles
- …