22 research outputs found
DiffusionVMR: Diffusion Model for Video Moment Retrieval
Video moment retrieval is a fundamental visual-language task that aims to
retrieve target moments from an untrimmed video based on a language query.
Existing methods typically generate numerous proposals manually or via
generative networks in advance as the support set for retrieval, which is not
only inflexible but also time-consuming. Inspired by the success of diffusion
models on object detection, this work aims at reformulating video moment
retrieval as a denoising generation process to get rid of the inflexible and
time-consuming proposal generation. To this end, we propose a novel
proposal-free framework, namely DiffusionVMR, which directly samples random
spans from noise as candidates and introduces denoising learning to ground
target moments. During training, Gaussian noise is added to the real moments,
and the model is trained to learn how to reverse this process. In inference, a
set of time spans is progressively refined from the initial noise to the final
output. Notably, the training and inference of DiffusionVMR are decoupled, and
an arbitrary number of random spans can be used in inference without being
consistent with the training phase. Extensive experiments conducted on three
widely-used benchmarks (i.e., QVHighlight, Charades-STA, and TACoS) demonstrate
the effectiveness of the proposed DiffusionVMR by comparing it with
state-of-the-art methods
Boundary Proposal Network for Two-Stage Natural Language Video Localization
We aim to address the problem of Natural Language Video Localization
(NLVL)-localizing the video segment corresponding to a natural language
description in a long and untrimmed video. State-of-the-art NLVL methods are
almost in one-stage fashion, which can be typically grouped into two
categories: 1) anchor-based approach: it first pre-defines a series of video
segment candidates (e.g., by sliding window), and then does classification for
each candidate; 2) anchor-free approach: it directly predicts the probabilities
for each video frame as a boundary or intermediate frame inside the positive
segment. However, both kinds of one-stage approaches have inherent drawbacks:
the anchor-based approach is susceptible to the heuristic rules, further
limiting the capability of handling videos with variant length. While the
anchor-free approach fails to exploit the segment-level interaction thus
achieving inferior results. In this paper, we propose a novel Boundary Proposal
Network (BPNet), a universal two-stage framework that gets rid of the issues
mentioned above. Specifically, in the first stage, BPNet utilizes an
anchor-free model to generate a group of high-quality candidate video segments
with their boundaries. In the second stage, a visual-language fusion layer is
proposed to jointly model the multi-modal interaction between the candidate and
the language query, followed by a matching score rating layer that outputs the
alignment score for each candidate. We evaluate our BPNet on three challenging
NLVL benchmarks (i.e., Charades-STA, TACoS and ActivityNet-Captions). Extensive
experiments and ablative studies on these datasets demonstrate that the BPNet
outperforms the state-of-the-art methods.Comment: AAAI 202
MRTNet: Multi-Resolution Temporal Network for Video Sentence Grounding
Given an untrimmed video and natural language query, video sentence grounding
aims to localize the target temporal moment in the video. Existing methods
mainly tackle this task by matching and aligning semantics of the descriptive
sentence and video segments on a single temporal resolution, while neglecting
the temporal consistency of video content in different resolutions. In this
work, we propose a novel multi-resolution temporal video sentence grounding
network: MRTNet, which consists of a multi-modal feature encoder, a
Multi-Resolution Temporal (MRT) module, and a predictor module. MRT module is
an encoder-decoder network, and output features in the decoder part are in
conjunction with Transformers to predict the final start and end timestamps.
Particularly, our MRT module is hot-pluggable, which means it can be seamlessly
incorporated into any anchor-free models. Besides, we utilize a hybrid loss to
supervise cross-modal features in MRT module for more accurate grounding in
three scales: frame-level, clip-level and sequence-level. Extensive experiments
on three prevalent datasets have shown the effectiveness of MRTNet.Comment: work in progres
Exploring Domain Incremental Video Highlights Detection with the LiveFood Benchmark
Video highlights detection (VHD) is an active research field in computer
vision, aiming to locate the most user-appealing clips given raw video inputs.
However, most VHD methods are based on the closed world assumption, i.e., a
fixed number of highlight categories is defined in advance and all training
data are available beforehand. Consequently, existing methods have poor
scalability with respect to increasing highlight domains and training data. To
address above issues, we propose a novel video highlights detection method
named Global Prototype Encoding (GPE) to learn incrementally for adapting to
new domains via parameterized prototypes. To facilitate this new research
direction, we collect a finely annotated dataset termed LiveFood, including
over 5,100 live gourmet videos that consist of four domains: ingredients,
cooking, presentation, and eating. To the best of our knowledge, this is the
first work to explore video highlights detection in the incremental learning
setting, opening up new land to apply VHD for practical scenarios where both
the concerned highlight domains and training data increase over time. We
demonstrate the effectiveness of GPE through extensive experiments. Notably,
GPE surpasses popular domain incremental learning methods on LiveFood,
achieving significant mAP improvements on all domains. Concerning the classic
datasets, GPE also yields comparable performance as previous arts. The code is
available at: https://github.com/ForeverPs/IncrementalVHD_GPE.Comment: AAAI 202
Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos
Query-based moment retrieval aims to localize the most relevant moment in an
untrimmed video according to the given natural language query. Existing works
often only focus on one aspect of this emerging task, such as the query
representation learning, video context modeling or multi-modal fusion, thus
fail to develop a comprehensive system for further performance improvement. In
this paper, we introduce a novel Cross-Modal Interaction Network (CMIN) to
consider multiple crucial factors for this challenging task, including (1) the
syntactic structure of natural language queries; (2) long-range semantic
dependencies in video context and (3) the sufficient cross-modal interaction.
Specifically, we devise a syntactic GCN to leverage the syntactic structure of
queries for fine-grained representation learning, propose a multi-head
self-attention to capture long-range semantic dependencies from video context,
and next employ a multi-stage cross-modal interaction to explore the potential
relations of video and query contents. The extensive experiments demonstrate
the effectiveness of our proposed method.Comment: Accepted by SIGIR 2019 as a full pape