103 research outputs found
Circulant temporal encoding for video retrieval and temporal alignment
We address the problem of specific video event retrieval. Given a query video
of a specific event, e.g., a concert of Madonna, the goal is to retrieve other
videos of the same event that temporally overlap with the query. Our approach
encodes the frame descriptors of a video to jointly represent their appearance
and temporal order. It exploits the properties of circulant matrices to
efficiently compare the videos in the frequency domain. This offers a
significant gain in complexity and accurately localizes the matching parts of
videos. The descriptors can be compressed in the frequency domain with a
product quantizer adapted to complex numbers. In this case, video retrieval is
performed without decompressing the descriptors. We also consider the temporal
alignment of a set of videos. We exploit the matching confidence and an
estimate of the temporal offset computed for all pairs of videos by our
retrieval approach. Our robust algorithm aligns the videos on a global timeline
by maximizing the set of temporally consistent matches. The global temporal
alignment enables synchronous playback of the videos of a given scene
LAMV: Learning to align and match videos with kernelized temporal layers
This paper considers a learnable approach for comparing and aligning videos. Our architecture builds upon and revisits temporal match kernels within neural networks: we propose a new temporal layer that finds temporal alignments by maximizing the scores between two sequences of vectors, according to a time-sensitive similarity metric parametrized in the Fourier domain. We learn this layer with a temporal proposal strategy, in which we minimize a triplet loss that takes into account both the localization accuracy and the recognition rate. We evaluate our approach on video alignment, copy detection and event retrieval. Our approach outperforms the state on the art on temporal video alignment and video copy detection datasets in comparable setups. It also attains the best reported results for particular event search, while precisely aligning videos
Text-based Localization of Moments in a Video Corpus
Prior works on text-based video moment localization focus on temporally
grounding the textual query in an untrimmed video. These works assume that the
relevant video is already known and attempt to localize the moment on that
relevant video only. Different from such works, we relax this assumption and
address the task of localizing moments in a corpus of videos for a given
sentence query. This task poses a unique challenge as the system is required to
perform: (i) retrieval of the relevant video where only a segment of the video
corresponds with the queried sentence, and (ii) temporal localization of moment
in the relevant video based on sentence query. Towards overcoming this
challenge, we propose Hierarchical Moment Alignment Network (HMAN) which learns
an effective joint embedding space for moments and sentences. In addition to
learning subtle differences between intra-video moments, HMAN focuses on
distinguishing inter-video global semantic concepts based on sentence queries.
Qualitative and quantitative results on three benchmark text-based video moment
retrieval datasets - Charades-STA, DiDeMo, and ActivityNet Captions -
demonstrate that our method achieves promising performance on the proposed task
of temporal localization of moments in a corpus of videos
VADER: Video Alignment Differencing and Retrieval
We propose VADER, a spatio-temporal matching, alignment, and change
summarization method to help fight misinformation spread via manipulated
videos. VADER matches and coarsely aligns partial video fragments to candidate
videos using a robust visual descriptor and scalable search over adaptively
chunked video content. A transformer-based alignment module then refines the
temporal localization of the query fragment within the matched video. A
space-time comparator module identifies regions of manipulation between aligned
content, invariant to any changes due to any residual temporal misalignments or
artifacts arising from non-editorial changes of the content. Robustly matching
video to a trusted source enables conclusions to be drawn on video provenance,
enabling informed trust decisions on content encountered
Surgical video retrieval using deep neural networks
Although the amount of raw surgical videos, namely videos
captured during surgical interventions, is growing fast, automatic retrieval
and search remains a challenge. This is mainly due to the nature
of the content, i.e. visually non-consistent tissue, diversity of internal organs,
abrupt viewpoint changes and illumination variation. We propose
a framework for retrieving surgical videos and a protocol for evaluating
the results. The method is composed of temporal shot segmentation and
representation based on deep features, and the protocol introduces novel
criteria to the field. The experimental results prove the superiority of
the proposed method and highlight the path towards a more effective
protocol for evaluating surgical videos
- …