8 research outputs found
An overview on the evaluated video retrieval tasks at TRECVID 2022
The TREC Video Retrieval Evaluation (TRECVID) is a TREC-style video analysis
and retrieval evaluation with the goal of promoting progress in research and
development of content-based exploitation and retrieval of information from
digital video via open, tasks-based evaluation supported by metrology. Over the
last twenty-one years this effort has yielded a better understanding of how
systems can effectively accomplish such processing and how one can reliably
benchmark their performance. TRECVID has been funded by NIST (National
Institute of Standards and Technology) and other US government agencies. In
addition, many organizations and individuals worldwide contribute significant
time and effort. TRECVID 2022 planned for the following six tasks: Ad-hoc video
search, Video to text captioning, Disaster scene description and indexing,
Activity in extended videos, deep video understanding, and movie summarization.
In total, 35 teams from various research organizations worldwide signed up to
join the evaluation campaign this year. This paper introduces the tasks,
datasets used, evaluation frameworks and metrics, as well as a high-level
results overview.Comment: arXiv admin note: substantial text overlap with arXiv:2104.13473,
arXiv:2009.0998
Long-term Leap Attention, Short-term Periodic Shift for Video Classification
Video transformer naturally incurs a heavier computation burden than a static
vision transformer, as the former processes times longer sequence than the
latter under the current attention of quadratic complexity . The
existing works treat the temporal axis as a simple extension of spatial axes,
focusing on shortening the spatio-temporal sequence by either generic pooling
or local windowing without utilizing temporal redundancy.
However, videos naturally contain redundant information between neighboring
frames; thereby, we could potentially suppress attention on visually similar
frames in a dilated manner. Based on this hypothesis, we propose the LAPS, a
long-term ``\textbf{\textit{Leap Attention}}'' (LA), short-term
``\textbf{\textit{Periodic Shift}}'' (\textit{P}-Shift) module for video
transformers, with complexity. Specifically, the ``LA'' groups
long-term frames into pairs, then refactors each discrete pair via attention.
The ``\textit{P}-Shift'' exchanges features between temporal neighbors to
confront the loss of short-term dynamics. By replacing a vanilla 2D attention
with the LAPS, we could adapt a static transformer into a video one, with zero
extra parameters and neglectable computation overhead (2.6\%).
Experiments on the standard Kinetics-400 benchmark demonstrate that our LAPS
transformer could achieve competitive performances in terms of accuracy, FLOPs,
and Params among CNN and transformer SOTAs. We open-source our project in
\sloppy
\href{https://github.com/VideoNetworks/LAPS-transformer}{\textit{\color{magenta}{https://github.com/VideoNetworks/LAPS-transformer}}} .Comment: Accepted by ACM Multimedia 2022, 10 pages, 4 figure
KeyPosS: Plug-and-Play Facial Landmark Detection through GPS-Inspired True-Range Multilateration
In the realm of facial analysis, accurate landmark detection is crucial for
various applications, ranging from face recognition and expression analysis to
animation. Conventional heatmap or coordinate regression-based techniques,
however, often face challenges in terms of computational burden and
quantization errors. To address these issues, we present the KeyPoint
Positioning System (KeyPosS) - a groundbreaking facial landmark detection
framework that stands out from existing methods. The framework utilizes a fully
convolutional network to predict a distance map, which computes the distance
between a Point of Interest (POI) and multiple anchor points. These anchor
points are ingeniously harnessed to triangulate the POI's position through the
True-range Multilateration algorithm. Notably, the plug-and-play nature of
KeyPosS enables seamless integration into any decoding stage, ensuring a
versatile and adaptable solution. We conducted a thorough evaluation of
KeyPosS's performance by benchmarking it against state-of-the-art models on
four different datasets. The results show that KeyPosS substantially
outperforms leading methods in low-resolution settings while requiring a
minimal time overhead. The code is available at
https://github.com/zhiqic/KeyPosS.Comment: Accepted to ACM Multimedia 2023; 10 pages, 7 figures, 6 tables; the
code is at https://github.com/zhiqic/KeyPos
Machine Learning Architectures for Video Annotation and Retrieval
PhDIn this thesis we are designing machine learning methodologies for solving the problem
of video annotation and retrieval using either pre-defined semantic concepts or ad-hoc
queries. Concept-based video annotation refers to the annotation of video fragments
with one or more semantic concepts (e.g. hand, sky, running), chosen from a predefined concept list. Ad-hoc queries refer to textual descriptions that may contain
objects, activities, locations etc., and combinations of the former. Our contributions
are: i) A thorough analysis on extending and using different local descriptors towards
improved concept-based video annotation and a stacking architecture that uses in the
first layer, concept classifiers trained on local descriptors and improves their prediction
accuracy by implicitly capturing concept relations, in the last layer of the stack. ii)
A cascade architecture that orders and combines many classifiers, trained on different
visual descriptors, for the same concept. iii) A deep learning architecture that exploits
concept relations at two different levels. At the first level, we build on ideas from
multi-task learning, and propose an approach to learn concept-specific representations
that are sparse, linear combinations of representations of latent concepts. At a second
level, we build on ideas from structured output learning, and propose the introduction,
at training time, of a new cost term that explicitly models the correlations between
the concepts. By doing so, we explicitly model the structure in the output space
(i.e., the concept labels). iv) A fully-automatic ad-hoc video search architecture that
combines concept-based video annotation and textual query analysis, and transforms
concept-based keyframe and query representations into a common semantic embedding
space. Our architectures have been extensively evaluated on the TRECVID SIN 2013,
the TRECVID AVS 2016, and other large-scale datasets presenting their effectiveness
compared to other similar approaches