62 research outputs found
SEM-POS: Grammatically and Semantically Correct Video Captioning
Generating grammatically and semantically correct captions in video
captioning is a challenging task. The captions generated from the existing
methods are either word-by-word that do not align with grammatical structure or
miss key information from the input videos. To address these issues, we
introduce a novel global-local fusion network, with a Global-Local Fusion Block
(GLFB) that encodes and fuses features from different parts of speech (POS)
components with visual-spatial features. We use novel combinations of different
POS components - 'determinant + subject', 'auxiliary verb', 'verb', and
'determinant + object' for supervision of the POS blocks - Det + Subject, Aux
Verb, Verb, and Det + Object respectively. The novel global-local fusion
network together with POS blocks helps align the visual features with language
description to generate grammatically and semantically correct captions.
Extensive qualitative and quantitative experiments on benchmark MSVD and MSRVTT
datasets demonstrate that the proposed approach generates more grammatically
and semantically correct captions compared to the existing methods, achieving
the new state-of-the-art. Ablations on the POS blocks and the GLFB demonstrate
the impact of the contributions on the proposed method
Non-Autoregressive Coarse-to-Fine Video Captioning
It is encouraged to see that progress has been made to bridge videos and
natural language. However, mainstream video captioning methods suffer from slow
inference speed due to the sequential manner of autoregressive decoding, and
prefer generating generic descriptions due to the insufficient training of
visual words (e.g., nouns and verbs) and inadequate decoding paradigm. In this
paper, we propose a non-autoregressive decoding based model with a
coarse-to-fine captioning procedure to alleviate these defects. In
implementations, we employ a bi-directional self-attention based network as our
language model for achieving inference speedup, based on which we decompose the
captioning procedure into two stages, where the model has different focuses.
Specifically, given that visual words determine the semantic correctness of
captions, we design a mechanism of generating visual words to not only promote
the training of scene-related words but also capture relevant details from
videos to construct a coarse-grained sentence "template". Thereafter, we devise
dedicated decoding algorithms that fill in the "template" with suitable words
and modify inappropriate phrasing via iterative refinement to obtain a
fine-grained description. Extensive experiments on two mainstream video
captioning benchmarks, i.e., MSVD and MSR-VTT, demonstrate that our approach
achieves state-of-the-art performance, generates diverse descriptions, and
obtains high inference efficiency. Our code is available at
https://github.com/yangbang18/Non-Autoregressive-Video-Captioning.Comment: 9 pages, 6 figures, to be published in AAAI2021. Our code is
available at
https://github.com/yangbang18/Non-Autoregressive-Video-Captionin
- …