82 research outputs found
Boundary Proposal Network for Two-Stage Natural Language Video Localization
We aim to address the problem of Natural Language Video Localization
(NLVL)-localizing the video segment corresponding to a natural language
description in a long and untrimmed video. State-of-the-art NLVL methods are
almost in one-stage fashion, which can be typically grouped into two
categories: 1) anchor-based approach: it first pre-defines a series of video
segment candidates (e.g., by sliding window), and then does classification for
each candidate; 2) anchor-free approach: it directly predicts the probabilities
for each video frame as a boundary or intermediate frame inside the positive
segment. However, both kinds of one-stage approaches have inherent drawbacks:
the anchor-based approach is susceptible to the heuristic rules, further
limiting the capability of handling videos with variant length. While the
anchor-free approach fails to exploit the segment-level interaction thus
achieving inferior results. In this paper, we propose a novel Boundary Proposal
Network (BPNet), a universal two-stage framework that gets rid of the issues
mentioned above. Specifically, in the first stage, BPNet utilizes an
anchor-free model to generate a group of high-quality candidate video segments
with their boundaries. In the second stage, a visual-language fusion layer is
proposed to jointly model the multi-modal interaction between the candidate and
the language query, followed by a matching score rating layer that outputs the
alignment score for each candidate. We evaluate our BPNet on three challenging
NLVL benchmarks (i.e., Charades-STA, TACoS and ActivityNet-Captions). Extensive
experiments and ablative studies on these datasets demonstrate that the BPNet
outperforms the state-of-the-art methods.Comment: AAAI 202
Temporal Deformable Convolutional Encoder-Decoder Networks for Video Captioning
It is well believed that video captioning is a fundamental but challenging
task in both computer vision and artificial intelligence fields. The prevalent
approach is to map an input video to a variable-length output sentence in a
sequence to sequence manner via Recurrent Neural Network (RNN). Nevertheless,
the training of RNN still suffers to some degree from vanishing/exploding
gradient problem, making the optimization difficult. Moreover, the inherently
recurrent dependency in RNN prevents parallelization within a sequence during
training and therefore limits the computations. In this paper, we present a
novel design --- Temporal Deformable Convolutional Encoder-Decoder Networks
(dubbed as TDConvED) that fully employ convolutions in both encoder and decoder
networks for video captioning. Technically, we exploit convolutional block
structures that compute intermediate states of a fixed number of inputs and
stack several blocks to capture long-term relationships. The structure in
encoder is further equipped with temporal deformable convolution to enable
free-form deformation of temporal sampling. Our model also capitalizes on
temporal attention mechanism for sentence generation. Extensive experiments are
conducted on both MSVD and MSR-VTT video captioning datasets, and superior
results are reported when comparing to conventional RNN-based encoder-decoder
techniques. More remarkably, TDConvED increases CIDEr-D performance from 58.8%
to 67.2% on MSVD.Comment: AAAI 201
- …