105,236 research outputs found
Multimodal One-Shot Learning of Speech and Images
Imagine a robot is shown new concepts visually together with spoken tags,
e.g. "milk", "eggs", "butter". After seeing one paired audio-visual example per
class, it is shown a new set of unseen instances of these objects, and asked to
pick the "milk". Without receiving any hard labels, could it learn to match the
new continuous speech input to the correct visual instance? Although unimodal
one-shot learning has been studied, where one labelled example in a single
modality is given per class, this example motivates multimodal one-shot
learning. Our main contribution is to formally define this task, and to propose
several baseline and advanced models. We use a dataset of paired spoken and
visual digits to specifically investigate recent advances in Siamese
convolutional neural networks. Our best Siamese model achieves twice the
accuracy of a nearest neighbour model using pixel-distance over images and
dynamic time warping over speech in 11-way cross-modal matching.Comment: 5 pages, 1 figure, 3 tables; accepted to ICASSP 201
Weakly supervised training of universal visual concepts for multi-domain semantic segmentation
Deep supervised models have an unprecedented capacity to absorb large
quantities of training data. Hence, training on multiple datasets becomes a
method of choice towards strong generalization in usual scenes and graceful
performance degradation in edge cases. Unfortunately, different datasets often
have incompatible labels. For instance, the Cityscapes road class subsumes all
driving surfaces, while Vistas defines separate classes for road markings,
manholes etc. Furthermore, many datasets have overlapping labels. For instance,
pickups are labeled as trucks in VIPER, cars in Vistas, and vans in ADE20k. We
address this challenge by considering labels as unions of universal visual
concepts. This allows seamless and principled learning on multi-domain dataset
collections without requiring any relabeling effort. Our method achieves
competitive within-dataset and cross-dataset generalization, as well as ability
to learn visual concepts which are not separately labeled in any of the
training datasets. Experiments reveal competitive or state-of-the-art
performance on two multi-domain dataset collections and on the WildDash 2
benchmark.Comment: 27 pages, 16 figures, 10 table
Weakly Supervised Open-Vocabulary Object Detection
Despite weakly supervised object detection (WSOD) being a promising step
toward evading strong instance-level annotations, its capability is confined to
closed-set categories within a single training dataset. In this paper, we
propose a novel weakly supervised open-vocabulary object detection framework,
namely WSOVOD, to extend traditional WSOD to detect novel concepts and utilize
diverse datasets with only image-level annotations. To achieve this, we explore
three vital strategies, including dataset-level feature adaptation, image-level
salient object localization, and region-level vision-language alignment. First,
we perform data-aware feature extraction to produce an input-conditional
coefficient, which is leveraged into dataset attribute prototypes to identify
dataset bias and help achieve cross-dataset generalization. Second, a
customized location-oriented weakly supervised region proposal network is
proposed to utilize high-level semantic layouts from the category-agnostic
segment anything model to distinguish object boundaries. Lastly, we introduce a
proposal-concept synchronized multiple-instance network, i.e., object mining
and refinement with visual-semantic alignment, to discover objects matched to
the text embeddings of concepts. Extensive experiments on Pascal VOC and MS
COCO demonstrate that the proposed WSOVOD achieves new state-of-the-art
compared with previous WSOD methods in both close-set object localization and
detection tasks. Meanwhile, WSOVOD enables cross-dataset and open-vocabulary
learning to achieve on-par or even better performance than well-established
fully-supervised open-vocabulary object detection (FSOVOD).Comment: Accepted by AAAI202
Learning Aligned Cross-Modal Representations from Weakly Aligned Data
People can recognize scenes across many different modalities beyond natural
images. In this paper, we investigate how to learn cross-modal scene
representations that transfer across modalities. To study this problem, we
introduce a new cross-modal scene dataset. While convolutional neural networks
can categorize cross-modal scenes well, they also learn an intermediate
representation not aligned across modalities, which is undesirable for
cross-modal transfer applications. We present methods to regularize cross-modal
convolutional neural networks so that they have a shared representation that is
agnostic of the modality. Our experiments suggest that our scene representation
can help transfer representations across modalities for retrieval. Moreover,
our visualizations suggest that units emerge in the shared representation that
tend to activate on consistent concepts independently of the modality.Comment: Conference paper at CVPR 201
Multimodal Grounding for Language Processing
This survey discusses how recent developments in multimodal processing
facilitate conceptual grounding of language. We categorize the information flow
in multimodal processing with respect to cognitive models of human information
processing and analyze different methods for combining multimodal
representations. Based on this methodological inventory, we discuss the benefit
of multimodal grounding for a variety of language processing tasks and the
challenges that arise. We particularly focus on multimodal grounding of verbs
which play a crucial role for the compositional power of language.Comment: The paper has been published in the Proceedings of the 27 Conference
of Computational Linguistics. Please refer to this version for citations:
https://www.aclweb.org/anthology/papers/C/C18/C18-1197
- …