382 research outputs found
Fine-grained Incident Video Retrieval with Video Similarity Learning.
PhD ThesesIn this thesis, we address the problem of Fine-grained Incident Video Retrieval (FIVR)
using video similarity learning methods. FIVR is a video retrieval task that aims to
retrieve all videos that depict the same incident given a query video { related video
retrieval tasks adopt either very narrow or very broad scopes, considering only nearduplicate
or same event videos. To formulate the case of same incident videos, we
de ne three video associations taking into account the spatio-temporal spans captured
by video pairs. To cover the benchmarking needs of FIVR, we construct a large-scale
dataset, called FIVR-200K, consisting of 225,960 YouTube videos from major news
events crawled from Wikipedia. The dataset contains four annotation labels according
to FIVR de nitions; hence, it can simulate several retrieval scenarios with the same
video corpus. To address FIVR, we propose two video-level approaches leveraging
features extracted from intermediate layers of Convolutional Neural Networks (CNN).
The rst is an unsupervised method that relies on a modi ed Bag-of-Word scheme,
which generates video representations from the aggregation of the frame descriptors
based on learned visual codebooks. The second is a supervised method based on Deep
Metric Learning, which learns an embedding function that maps videos in a feature
space where relevant video pairs are closer than the irrelevant ones. However, videolevel
approaches generate global video representations, losing all spatial and temporal
relations between compared videos. Therefore, we propose a video similarity learning
approach that captures ne-grained relations between videos for accurate similarity
calculation. We train a CNN architecture to compute video-to-video similarity from
re ned frame-to-frame similarity matrices derived from a pairwise region-level similarity
function. The proposed approaches have been extensively evaluated on FIVR-
200K and other large-scale datasets, demonstrating their superiority over other video
retrieval methods and highlighting the challenging aspect of the FIVR problem
FIVR: Fine-Grained Incident Video Retrieval
This paper introduces the problem of Fine-grained Incident Video Retrieval
(FIVR). Given a query video, the objective is to retrieve all associated
videos, considering several types of associations that range from duplicate
videos to videos from the same incident. FIVR offers a single framework that
contains several retrieval tasks as special cases. To address the benchmarking
needs of all such tasks, we construct and present a large-scale annotated video
dataset, which we call FIVR-200K, and it comprises 225,960 videos. To create
the dataset, we devise a process for the collection of YouTube videos based on
major news events from recent years crawled from Wikipedia and deploy a
retrieval pipeline for the automatic selection of query videos based on their
estimated suitability as benchmarks. We also devise a protocol for the
annotation of the dataset with respect to the four types of video associations
defined by FIVR. Finally, we report the results of an experimental study on the
dataset comparing five state-of-the-art methods developed based on a variety of
visual descriptors, highlighting the challenges of the current problem
DnS: Distill-and-Select for Efficient and Accurate Video Indexing and Retrieval
In this paper, we address the problem of high performance and computationally
efficient content-based video retrieval in large-scale datasets. Current
methods typically propose either: (i) fine-grained approaches employing
spatio-temporal representations and similarity calculations, achieving high
performance at a high computational cost or (ii) coarse-grained approaches
representing/indexing videos as global vectors, where the spatio-temporal
structure is lost, providing low performance but also having low computational
cost. In this work, we propose a Knowledge Distillation framework, which we
call Distill-and-Select (DnS), that starting from a well-performing
fine-grained Teacher Network learns: a) Student Networks at different retrieval
performance and computational efficiency trade-offs and b) a Selection Network
that at test time rapidly directs samples to the appropriate student to
maintain both high retrieval performance and high computational efficiency. We
train several students with different architectures and arrive at different
trade-offs of performance and efficiency, i.e., speed and storage requirements,
including fine-grained students that store index videos using binary
representations. Importantly, the proposed scheme allows Knowledge Distillation
in large, unlabelled datasets -- this leads to good students. We evaluate DnS
on five public datasets on three different video retrieval tasks and
demonstrate a) that our students achieve state-of-the-art performance in
several cases and b) that our DnS framework provides an excellent trade-off
between retrieval performance, computational speed, and storage space. In
specific configurations, our method achieves similar mAP with the teacher but
is 20 times faster and requires 240 times less storage space. Our collected
dataset and implementation are publicly available:
https://github.com/mever-team/distill-and-select
Video Infringement Detection via Feature Disentanglement and Mutual Information Maximization
The self-media era provides us tremendous high quality videos. Unfortunately,
frequent video copyright infringements are now seriously damaging the interests
and enthusiasm of video creators. Identifying infringing videos is therefore a
compelling task. Current state-of-the-art methods tend to simply feed
high-dimensional mixed video features into deep neural networks and count on
the networks to extract useful representations. Despite its simplicity, this
paradigm heavily relies on the original entangled features and lacks
constraints guaranteeing that useful task-relevant semantics are extracted from
the features.
In this paper, we seek to tackle the above challenges from two aspects: (1)
We propose to disentangle an original high-dimensional feature into multiple
sub-features, explicitly disentangling the feature into exclusive
lower-dimensional components. We expect the sub-features to encode
non-overlapping semantics of the original feature and remove redundant
information.
(2) On top of the disentangled sub-features, we further learn an auxiliary
feature to enhance the sub-features. We theoretically analyzed the mutual
information between the label and the disentangled features, arriving at a loss
that maximizes the extraction of task-relevant information from the original
feature.
Extensive experiments on two large-scale benchmark datasets (i.e., SVD and
VCSL) demonstrate that our method achieves 90.1% TOP-100 mAP on the large-scale
SVD dataset and also sets the new state-of-the-art on the VCSL benchmark
dataset. Our code and model have been released at
https://github.com/yyyooooo/DMI/, hoping to contribute to the community.Comment: This paper is accepted by ACM MM 202
Let All be Whitened: Multi-teacher Distillation for Efficient Visual Retrieval
Visual retrieval aims to search for the most relevant visual items, e.g.,
images and videos, from a candidate gallery with a given query item. Accuracy
and efficiency are two competing objectives in retrieval tasks. Instead of
crafting a new method pursuing further improvement on accuracy, in this paper
we propose a multi-teacher distillation framework Whiten-MTD, which is able to
transfer knowledge from off-the-shelf pre-trained retrieval models to a
lightweight student model for efficient visual retrieval. Furthermore, we
discover that the similarities obtained by different retrieval models are
diversified and incommensurable, which makes it challenging to jointly distill
knowledge from multiple models. Therefore, we propose to whiten the output of
teacher models before fusion, which enables effective multi-teacher
distillation for retrieval models. Whiten-MTD is conceptually simple and
practically effective. Extensive experiments on two landmark image retrieval
datasets and one video retrieval dataset demonstrate the effectiveness of our
proposed method, and its good balance of retrieval performance and efficiency.
Our source code is released at https://github.com/Maryeon/whiten_mtd.Comment: Accepted by AAAI 202
- …