19 research outputs found
Annotating Object Instances with a Polygon-RNN
We propose an approach for semi-automatic annotation of object instances.
While most current methods treat object segmentation as a pixel-labeling
problem, we here cast it as a polygon prediction task, mimicking how most
current datasets have been annotated. In particular, our approach takes as
input an image crop and sequentially produces vertices of the polygon outlining
the object. This allows a human annotator to interfere at any time and correct
a vertex if needed, producing as accurate segmentation as desired by the
annotator. We show that our approach speeds up the annotation process by a
factor of 4.7 across all classes in Cityscapes, while achieving 78.4% agreement
in IoU with original ground-truth, matching the typical agreement between human
annotators. For cars, our speed-up factor is 7.3 for an agreement of 82.2%. We
further show generalization capabilities of our approach to unseen datasets
Learning Aligned Cross-Modal Representations from Weakly Aligned Data
People can recognize scenes across many different modalities beyond natural
images. In this paper, we investigate how to learn cross-modal scene
representations that transfer across modalities. To study this problem, we
introduce a new cross-modal scene dataset. While convolutional neural networks
can categorize cross-modal scenes well, they also learn an intermediate
representation not aligned across modalities, which is undesirable for
cross-modal transfer applications. We present methods to regularize cross-modal
convolutional neural networks so that they have a shared representation that is
agnostic of the modality. Our experiments suggest that our scene representation
can help transfer representations across modalities for retrieval. Moreover,
our visualizations suggest that units emerge in the shared representation that
tend to activate on consistent concepts independently of the modality.Comment: Conference paper at CVPR 201
How (not) to ensemble LVLMs for VQA
This paper studies ensembling in the era of Large Vision-Language Models
(LVLMs). Ensembling is a classical method to combine different models to get
increased performance. In the recent work on Encyclopedic-VQA the authors
examine a wide variety of models to solve their task: from vanilla LVLMs, to
models including the caption as extra context, to models augmented with
Lens-based retrieval of Wikipedia pages. Intuitively these models are highly
complementary, which should make them ideal for ensembling. Indeed, an oracle
experiment shows potential gains from 48.8% accuracy (the best single model)
all the way up to 67% (best possible ensemble). So it is a trivial exercise to
create an ensemble with substantial real gains. Or is it?Comment: 4th I Can't Believe It's Not Better Workshop (co-located with NeurIPS
2023
Encyclopedic VQA: Visual questions about detailed properties of fine-grained categories
We propose Encyclopedic-VQA, a large scale visual question answering (VQA)
dataset featuring visual questions about detailed properties of fine-grained
categories and instances. It contains 221k unique question+answer pairs each
matched with (up to) 5 images, resulting in a total of 1M VQA samples.
Moreover, our dataset comes with a controlled knowledge base derived from
Wikipedia, marking the evidence to support each answer. Empirically, we show
that our dataset poses a hard challenge for large vision+language models as
they perform poorly on our dataset: PaLI [14] is state-of-the-art on OK-VQA
[37], yet it only achieves 13.0% accuracy on our dataset. Moreover, we
experimentally show that progress on answering our encyclopedic questions can
be achieved by augmenting large models with a mechanism that retrieves relevant
information from the knowledge base. An oracle experiment with perfect
retrieval achieves 87.0% accuracy on the single-hop portion of our dataset, and
an automatic retrieval-augmented prototype yields 48.8%. We believe that our
dataset enables future research on retrieval-augmented vision+language models.
It is available at
https://github.com/google-research/google-research/tree/master/encyclopedic_vqa .Comment: ICCV'2
Cross-Modal Scene Networks
People can recognize scenes across many different modalities beyond natural images. In this paper, we investigate how to learn cross-modal scene representations that transfer across modalities. To study this problem, we introduce a new cross-modal scene dataset. While convolutional neural networks can categorize scenes well, they also learn an intermediate representation not aligned across modalities, which is undesirable for cross-modal transfer applications. We present methods to regularize cross-modal convolutional neural networks so that they have a shared representation that is agnostic of the modality. Our experiments suggest that our scene representation can help transfer representations across modalities for retrieval. Moreover, our visualizations suggest that units emerge in the shared representation that tend to activate on consistent concepts independently of the modality.This work was supported by NSF grant IIS-1524817, by a Google faculty research award to A.T and by a Google Ph.D. fellowship to C.V.https://ieeexplore.ieee.org/abstract/document/803921
Cross-Modal Scene Networks
© 1979-2012 IEEE. People can recognize scenes across many different modalities beyond natural images. In this paper, we investigate how to learn cross-modal scene representations that transfer across modalities. To study this problem, we introduce a new cross-modal scene dataset. While convolutional neural networks can categorize scenes well, they also learn an intermediate representation not aligned across modalities, which is undesirable for cross-modal transfer applications. We present methods to regularize cross-modal convolutional neural networks so that they have a shared representation that is agnostic of the modality. Our experiments suggest that our scene representation can help transfer representations across modalities for retrieval. Moreover, our visualizations suggest that units emerge in the shared representation that tend to activate on consistent concepts independently of the modality