62,355 research outputs found
Search Tracker: Human-derived object tracking in-the-wild through large-scale search and retrieval
Humans use context and scene knowledge to easily localize moving objects in
conditions of complex illumination changes, scene clutter and occlusions. In
this paper, we present a method to leverage human knowledge in the form of
annotated video libraries in a novel search and retrieval based setting to
track objects in unseen video sequences. For every video sequence, a document
that represents motion information is generated. Documents of the unseen video
are queried against the library at multiple scales to find videos with similar
motion characteristics. This provides us with coarse localization of objects in
the unseen video. We further adapt these retrieved object locations to the new
video using an efficient warping scheme. The proposed method is validated on
in-the-wild video surveillance datasets where we outperform state-of-the-art
appearance-based trackers. We also introduce a new challenging dataset with
complex object appearance changes.Comment: Under review with the IEEE Transactions on Circuits and Systems for
Video Technolog
A fuzzy measure approach to motion frame analysis for scene detection
This paper addresses a solution to the problem of scene estimation of motion video data in the fuzzy set theoretic framework. Using fuzzy image feature extractors, a new algorithm is developed to compute the change of information in each of two successive frames to classify scenes. This classification process of raw input visual data can be used to establish structure for correlation. The algorithm attempts to fulfill the need for nonlinear, frame-accurate access to video data for applications such as video editing and visual document archival/retrieval systems in multimedia environments
Tri-Modal Motion Retrieval by Learning a Joint Embedding Space
Information retrieval is an ever-evolving and crucial research domain. The
substantial demand for high-quality human motion data especially in online
acquirement has led to a surge in human motion research works. Prior works have
mainly concentrated on dual-modality learning, such as text and motion tasks,
but three-modality learning has been rarely explored. Intuitively, an extra
introduced modality can enrich a model's application scenario, and more
importantly, an adequate choice of the extra modality can also act as an
intermediary and enhance the alignment between the other two disparate
modalities. In this work, we introduce LAVIMO (LAnguage-VIdeo-MOtion
alignment), a novel framework for three-modality learning integrating
human-centric videos as an additional modality, thereby effectively bridging
the gap between text and motion. Moreover, our approach leverages a specially
designed attention mechanism to foster enhanced alignment and synergistic
effects among text, video, and motion modalities. Empirically, our results on
the HumanML3D and KIT-ML datasets show that LAVIMO achieves state-of-the-art
performance in various motion-related cross-modal retrieval tasks, including
text-to-motion, motion-to-text, video-to-motion and motion-to-video
TVPR: Text-to-Video Person Retrieval and a New Benchmark
Most existing methods for text-based person retrieval focus on text-to-image
person retrieval. Nevertheless, due to the lack of dynamic information provided
by isolated frames, the performance is hampered when the person is obscured in
isolated frames or variable motion details are given in the textual
description. In this paper, we propose a new task called Text-to-Video Person
Retrieval(TVPR) which aims to effectively overcome the limitations of isolated
frames. Since there is no dataset or benchmark that describes person videos
with natural language, we construct a large-scale cross-modal person video
dataset containing detailed natural language annotations, such as person's
appearance, actions and interactions with environment, etc., termed as
Text-to-Video Person Re-identification (TVPReid) dataset, which will be
publicly available. To this end, a Text-to-Video Person Retrieval Network
(TVPRN) is proposed. Specifically, TVPRN acquires video representations by
fusing visual and motion representations of person videos, which can deal with
temporal occlusion and the absence of variable motion details in isolated
frames. Meanwhile, we employ the pre-trained BERT to obtain caption
representations and the relationship between caption and video representations
to reveal the most relevant person videos. To evaluate the effectiveness of the
proposed TVPRN, extensive experiments have been conducted on TVPReid dataset.
To the best of our knowledge, TVPRN is the first successful attempt to use
video for text-based person retrieval task and has achieved state-of-the-art
performance on TVPReid dataset. The TVPReid dataset will be publicly available
to benefit future research
Improving Bag-of-visual-Words model with spatial-temporal correlation for video retrieval
Most of the state-of-art approaches to Query-by-Example (QBE) video retrieval are based on the Bag-of-visual-Words (BovW) representation of visual content. It, however, ig- nores the spatial-temporal information, which is important for similarity measurement between videos. Direct incorpo- ration of such information into the video data representa- tion for a large scale data set is computationally expensive in terms of storage and similarity measurement. It is also static regardless of the change of discriminative power of vi- sual words with respect to di↵erent queries. To tackle these limitations, in this paper, we propose to discover Spatial- Temporal Correlations (STC) imposed by the query exam- ple to improve the BovW model for video retrieval. The STC, in terms of spatial proximity and relative motion co- herence between di↵erent visual words, is crucial to identify the discriminative power of the visual words. We develop a novel technique to emphasize the most discriminative visual words for similarity measurement, and incorporate this STC-based approach into the standard inverted index archi- tecture. Our approach is evaluated on the TRECVID2002 and CC WEB VIDEO datasets for two typical QBE video retrieval tasks respectively. The experimental results demon- strate that it substantially improves the BovW model as well as a state of the art method that also utilizes spatial- temporal information for QBE video retrieval
Objects that Sound
In this paper our objectives are, first, networks that can embed audio and
visual inputs into a common space that is suitable for cross-modal retrieval;
and second, a network that can localize the object that sounds in an image,
given the audio signal. We achieve both these objectives by training from
unlabelled video using only audio-visual correspondence (AVC) as the objective
function. This is a form of cross-modal self-supervision from video.
To this end, we design new network architectures that can be trained for
cross-modal retrieval and localizing the sound source in an image, by using the
AVC task. We make the following contributions: (i) show that audio and visual
embeddings can be learnt that enable both within-mode (e.g. audio-to-audio) and
between-mode retrieval; (ii) explore various architectures for the AVC task,
including those for the visual stream that ingest a single image, or multiple
images, or a single image and multi-frame optical flow; (iii) show that the
semantic object that sounds within an image can be localized (using only the
sound, no motion or flow information); and (iv) give a cautionary tale on how
to avoid undesirable shortcuts in the data preparation.Comment: Appears in: European Conference on Computer Vision (ECCV) 201
LOCALIZED TEMPORAL PROFILE OF SURVEILLANCE VIDEO
Surveillance videos are recorded pervasively and their retrieval currently still relies on human operators. As an intermediate representation, this work develops a new temporal profile of video to convey accurate temporal information in the video while keeping certain spatial characteristics of targets of interest for recognition. The profile is obtained at critical positions where major target flow appears. We set a sampling line crossing the motion direction to profile passing targets in the temporal domain. In order to add spatial information to the temporal profile to certain extent, we integrate multiple profiles from a set of lines with blending method to reflect the target motion direction and position in the temporal profile. Different from mosaicing/montage methods for video synopsis in spatial domain, our temporal profile has no limit on the time length, and the created profile significantly reduces the data size for brief indexing and fast search of video
Use What You Have: Video Retrieval Using Representations From Collaborative Experts
The rapid growth of video on the internet has made searching for video
content using natural language queries a significant challenge. Human-generated
queries for video datasets `in the wild' vary a lot in terms of degree of
specificity, with some queries describing specific details such as the names of
famous identities, content from speech, or text available on the screen. Our
goal is to condense the multi-modal, extremely high dimensional information
from videos into a single, compact video representation for the task of video
retrieval using free-form text queries, where the degree of specificity is
open-ended.
For this we exploit existing knowledge in the form of pre-trained semantic
embeddings which include 'general' features such as motion, appearance, and
scene features from visual content. We also explore the use of more 'specific'
cues from ASR and OCR which are intermittently available for videos and find
that these signals remain challenging to use effectively for retrieval. We
propose a collaborative experts model to aggregate information from these
different pre-trained experts and assess our approach empirically on five
retrieval benchmarks: MSR-VTT, LSMDC, MSVD, DiDeMo, and ActivityNet. Code and
data can be found at www.robots.ox.ac.uk/~vgg/research/collaborative-experts/.
This paper contains a correction to results reported in the previous version.Comment: This update contains a correction to previously reported result
Semantic Sketch-Based Video Retrieval with Autocompletion
The IMOTION system is a content-based video search engine that provides fast and intuitive known item search in large video collections. User interaction consists mainly of sketching, which the system recognizes in real-time and makes suggestions based on both visual appearance of the sketch (what does the sketch look like in terms of colors, edge distribution, etc.) and semantic content (what object is the user sketching). The latter is enabled by a predictive sketch-based UI that identifies likely candidates for the sketched object via state-of-the-art sketch recognition techniques and offers on-screen completion suggestions. In this demo, we show how the sketch-based video retrieval of the IMOTION system is used in a collection of roughly 30,000 video shots. The system indexes collection data with over 30 visual features describing color, edge, motion, and semantic information. Resulting feature data is stored in ADAM, an efficient database system optimized for fast retrieval
- …