12,151 research outputs found
Unified Embedding and Metric Learning for Zero-Exemplar Event Detection
Event detection in unconstrained videos is conceived as a content-based video
retrieval with two modalities: textual and visual. Given a text describing a
novel event, the goal is to rank related videos accordingly. This task is
zero-exemplar, no video examples are given to the novel event.
Related works train a bank of concept detectors on external data sources.
These detectors predict confidence scores for test videos, which are ranked and
retrieved accordingly. In contrast, we learn a joint space in which the visual
and textual representations are embedded. The space casts a novel event as a
probability of pre-defined events. Also, it learns to measure the distance
between an event and its related videos.
Our model is trained end-to-end on publicly available EventNet. When applied
to TRECVID Multimedia Event Detection dataset, it outperforms the
state-of-the-art by a considerable margin.Comment: IEEE CVPR 201
Cycle-Consistent Deep Generative Hashing for Cross-Modal Retrieval
In this paper, we propose a novel deep generative approach to cross-modal
retrieval to learn hash functions in the absence of paired training samples
through the cycle consistency loss. Our proposed approach employs adversarial
training scheme to lean a couple of hash functions enabling translation between
modalities while assuming the underlying semantic relationship. To induce the
hash codes with semantics to the input-output pair, cycle consistency loss is
further proposed upon the adversarial training to strengthen the correlations
between inputs and corresponding outputs. Our approach is generative to learn
hash functions such that the learned hash codes can maximally correlate each
input-output correspondence, meanwhile can also regenerate the inputs so as to
minimize the information loss. The learning to hash embedding is thus performed
to jointly optimize the parameters of the hash functions across modalities as
well as the associated generative models. Extensive experiments on a variety of
large-scale cross-modal data sets demonstrate that our proposed method achieves
better retrieval results than the state-of-the-arts.Comment: To appeared on IEEE Trans. Image Processing. arXiv admin note: text
overlap with arXiv:1703.10593 by other author
Are All Combinations Equal? Combining Textual and Visual Features with Multiple Space Learning for Text-Based Video Retrieval
In this paper we tackle the cross-modal video retrieval problem and, more
specifically, we focus on text-to-video retrieval. We investigate how to
optimally combine multiple diverse textual and visual features into feature
pairs that lead to generating multiple joint feature spaces, which encode
text-video pairs into comparable representations. To learn these
representations our proposed network architecture is trained by following a
multiple space learning procedure. Moreover, at the retrieval stage, we
introduce additional softmax operations for revising the inferred query-video
similarities. Extensive experiments in several setups based on three
large-scale datasets (IACC.3, V3C1, and MSR-VTT) lead to conclusions on how to
best combine text-visual features and document the performance of the proposed
network. Source code is made publicly available at:
https://github.com/bmezaris/TextToVideoRetrieval-TtimesVComment: Accepted for publication; to be included in Proc. ECCV Workshops
2022. The version posted here is the "submitted manuscript" versio
- …