2,288 research outputs found
An Empirical Evaluation of Visual Question Answering for Novel Objects
We study the problem of answering questions about images in the harder
setting, where the test questions and corresponding images contain novel
objects, which were not queried about in the training data. Such setting is
inevitable in real world-owing to the heavy tailed distribution of the visual
categories, there would be some objects which would not be annotated in the
train set. We show that the performance of two popular existing methods drop
significantly (up to 28%) when evaluated on novel objects cf. known objects. We
propose methods which use large existing external corpora of (i) unlabeled
text, i.e. books, and (ii) images tagged with classes, to achieve novel object
based visual question answering. We do systematic empirical studies, for both
an oracle case where the novel objects are known textually, as well as a fully
automatic case without any explicit knowledge of the novel objects, but with
the minimal assumption that the novel objects are semantically related to the
existing objects in training. The proposed methods for novel object based
visual question answering are modular and can potentially be used with many
visual question answering architectures. We show consistent improvements with
the two popular architectures and give qualitative analysis of the cases where
the model does well and of those where it fails to bring improvements.Comment: 11 pages, 4 figures, accepted in CVPR 2017 (poster
Detection-by-Localization: Maintenance-Free Change Object Detector
Recent researches demonstrate that self-localization performance is a very
useful measure of likelihood-of-change (LoC) for change detection. In this
paper, this "detection-by-localization" scheme is studied in a novel
generalized task of object-level change detection. In our framework, a given
query image is segmented into object-level subimages (termed "scene parts"),
which are then converted to subimage-level pixel-wise LoC maps via the
detection-by-localization scheme. Our approach models a self-localization
system as a ranking function, outputting a ranked list of reference images,
without requiring relevance score. Thanks to this new setting, we can
generalize our approach to a broad class of self-localization systems. Our
ranking based self-localization model allows to fuse self-localization results
from different modalities via an unsupervised rank fusion derived from a field
of multi-modal information retrieval (MMR).Comment: 7 pages, 3 figures, Technical repor
ViTac: Feature Sharing between Vision and Tactile Sensing for Cloth Texture Recognition
Vision and touch are two of the important sensing modalities for humans and they offer complementary information for sensing the environment. Robots could also benefit from such multi-modal sensing ability. In this paper, addressing for the first time (to the best of our knowledge) texture recognition from tactile images and vision, we propose a new fusion method named Deep Maximum Covariance Analysis (DMCA) to learn a joint latent space for sharing features through vision and tactile sensing. The features of camera images and tactile data acquired from a GelSight sensor are learned by deep neural networks. But the learned features are of a high dimensionality and are redundant due to the differences between the two sensing modalities, which deteriorates the perception performance. To address this, the learned features are paired using maximum covariance analysis. Results of the algorithm on a newly collected dataset of paired visual and tactile data relating to cloth textures show that a good recognition performance of greater than 90% can be achieved by using the proposed DMCA framework. In addition, we find that the perception performance of either vision or tactile sensing can be improved by employing the shared representation space, compared to learning from unimodal data
Learnable PINs: Cross-Modal Embeddings for Person Identity
We propose and investigate an identity sensitive joint embedding of face and
voice. Such an embedding enables cross-modal retrieval from voice to face and
from face to voice. We make the following four contributions: first, we show
that the embedding can be learnt from videos of talking faces, without
requiring any identity labels, using a form of cross-modal self-supervision;
second, we develop a curriculum learning schedule for hard negative mining
targeted to this task, that is essential for learning to proceed successfully;
third, we demonstrate and evaluate cross-modal retrieval for identities unseen
and unheard during training over a number of scenarios and establish a
benchmark for this novel task; finally, we show an application of using the
joint embedding for automatically retrieving and labelling characters in TV
dramas.Comment: To appear in ECCV 201
FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified Retrieval and Captioning
Multimodal tasks in the fashion domain have significant potential for
e-commerce, but involve challenging vision-and-language learning problems -
e.g., retrieving a fashion item given a reference image plus text feedback from
a user. Prior works on multimodal fashion tasks have either been limited by the
data in individual benchmarks, or have leveraged generic vision-and-language
pre-training but have not taken advantage of the characteristics of fashion
data. Additionally, these works have mainly been restricted to multimodal
understanding tasks. To address these gaps, we make two key contributions.
First, we propose a novel fashion-specific pre-training framework based on
weakly-supervised triplets constructed from fashion image-text pairs. We show
the triplet-based tasks are an effective addition to standard multimodal
pre-training tasks. Second, we propose a flexible decoder-based model
architecture capable of both fashion retrieval and captioning tasks. Together,
our model design and pre-training approach are competitive on a diverse set of
fashion tasks, including cross-modal retrieval, image retrieval with text
feedback, image captioning, relative image captioning, and multimodal
categorization.Comment: 14 pages, 4 figures. To appear at Conference on Empirical Methods in
Natural Language Processing (EMNLP) 202
- …