27 research outputs found
Scene Graph Embeddings Using Relative Similarity Supervision
Scene graphs are a powerful structured representation of the underlying
content of images, and embeddings derived from them have been shown to be
useful in multiple downstream tasks. In this work, we employ a graph
convolutional network to exploit structure in scene graphs and produce image
embeddings useful for semantic image retrieval. Different from
classification-centric supervision traditionally available for learning image
representations, we address the task of learning from relative similarity
labels in a ranking context. Rooted within the contrastive learning paradigm,
we propose a novel loss function that operates on pairs of similar and
dissimilar images and imposes relative ordering between them in embedding
space. We demonstrate that this Ranking loss, coupled with an intuitive triple
sampling strategy, leads to robust representations that outperform well-known
contrastive losses on the retrieval task. In addition, we provide qualitative
evidence of how retrieved results that utilize structured scene information
capture the global context of the scene, different from visual similarity
search.Comment: Accepted to AAAI 202
Robustness of Fusion-based Multimodal Classifiers to Cross-Modal Content Dilutions
As multimodal learning finds applications in a wide variety of high-stakes
societal tasks, investigating their robustness becomes important. Existing work
has focused on understanding the robustness of vision-and-language models to
imperceptible variations on benchmark tasks. In this work, we investigate the
robustness of multimodal classifiers to cross-modal dilutions - a plausible
variation. We develop a model that, given a multimodal (image + text) input,
generates additional dilution text that (a) maintains relevance and topical
coherence with the image and existing text, and (b) when added to the original
text, leads to misclassification of the multimodal input. Via experiments on
Crisis Humanitarianism and Sentiment Detection tasks, we find that the
performance of task-specific fusion-based multimodal classifiers drops by 23.3%
and 22.5%, respectively, in the presence of dilutions generated by our model.
Metric-based comparisons with several baselines and human evaluations indicate
that our dilutions show higher relevance and topical coherence, while
simultaneously being more effective at demonstrating the brittleness of the
multimodal classifiers. Our work aims to highlight and encourage further
research on the robustness of deep multimodal models to realistic variations,
especially in human-facing societal applications. The code and other resources
are available at https://claws-lab.github.io/multimodal-robustness/.Comment: Accepted at the 2022 Conference on Empirical Methods in Natural
Language Processing (EMNLP); Full Paper (Oral
CyCLIP: Cyclic Contrastive Language-Image Pretraining
Recent advances in contrastive representation learning over paired image-text
data have led to models such as CLIP that achieve state-of-the-art performance
for zero-shot classification and distributional robustness. Such models
typically require joint reasoning in the image and text representation spaces
for downstream inference tasks. Contrary to prior beliefs, we demonstrate that
the image and text representations learned via a standard contrastive objective
are not interchangeable and can lead to inconsistent downstream predictions. To
mitigate this issue, we formalize consistency and propose CyCLIP, a framework
for contrastive representation learning that explicitly optimizes for the
learned representations to be geometrically consistent in the image and text
space. In particular, we show that consistent representations can be learned by
explicitly symmetrizing (a) the similarity between the two mismatched
image-text pairs (cross-modal consistency); and (b) the similarity between the
image-image pair and the text-text pair (in-modal consistency). Empirically, we
show that the improved consistency in CyCLIP translates to significant gains
over CLIP, with gains ranging from 10%-24% for zero-shot classification
accuracy on standard benchmarks (CIFAR-10, CIFAR-100, ImageNet1K) and 10%-27%
for robustness to various natural distribution shifts. The code is available at
https://github.com/goel-shashank/CyCLIP.Comment: 19 pages, 13 tables, 6 figures, Oral at NeuRIPS 202
On Aggregating Labels from Multiple Crowd Workers to Infer Relevance of Documents
Abstract. We consider the problem of acquiring relevance judgements for in-formation retrieval (IR) test collections through crowdsourcing when no true relevance labels are available. We collect multiple, possibly noisy relevance la-bels per document from workers of unknown labelling accuracy. We use these labels to infer the document relevance based on two methods. The first method is the commonly used majority voting (MV) which determines the document relevance based on the label that received the most votes, treating all the work-ers equally. The second is a probabilistic model that concurrently estimates the document relevance and the workers accuracy using expectation maximization (EM). We run simulations and conduct experiments with crowdsourced rele-vance labels from the INEX 2010 Book Search track to investigate the accuracy and robustness of the relevance assessments to the noisy labels. We observe the effect of the derived relevance judgments on the ranking of the search systems. Our experimental results show that the EM method outperforms the MV method in the accuracy of relevance assessments and IR systems ranking. The performance improvements are especially noticeable when the number of labels per document is small and the labels are of varied quality.