Search CORE

2,294 research outputs found

Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos

Author: He Dongliang
Huang Jizhou
Li Fu
Liu Xiao
Wen Shilei
Zhao Xiang
Publication venue
Publication date: 21/01/2019
Field of study

The task of video grounding, which temporally localizes a natural language description in a video, plays an important role in understanding videos. Existing studies have adopted strategies of sliding window over the entire video or exhaustively ranking all possible clip-sentence pairs in a pre-segmented video, which inevitably suffer from exhaustively enumerated candidates. To alleviate this problem, we formulate this task as a problem of sequential decision making by learning an agent which regulates the temporal grounding boundaries progressively based on its policy. Specifically, we propose a reinforcement learning based framework improved by multi-task learning and it shows steady performance gains by considering additional supervised boundary information during training. Our proposed framework achieves state-of-the-art performance on ActivityNet'18 DenseCaption dataset and Charades-STA dataset while observing only 10 or less clips per video.Comment: AAAI 201

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Conditional Image-Text Embedding Networks

Author: A Gordo
A Rohrbach
BA Plummer
CL Zitnick
F Radenović
L Yu
M Wang
R Krishna
Publication venue
Publication date: 28/07/2018
Field of study

This paper presents an approach for grounding phrases in images which jointly learns multiple text-conditioned embeddings in a single end-to-end model. In order to differentiate text phrases into semantically distinct subspaces, we propose a concept weight branch that automatically assigns phrases to embeddings, whereas prior works predefine such assignments. Our proposed solution simplifies the representation requirements for individual embeddings and allows the underrepresented concepts to take advantage of the shared representations before feeding them into concept-specific layers. Comprehensive experiments verify the effectiveness of our approach across three phrase grounding datasets, Flickr30K Entities, ReferIt Game, and Visual Genome, where we obtain a (resp.) 4%, 3%, and 4% improvement in grounding performance over a strong region-phrase embedding baseline.Comment: ECCV 2018 accepted pape

arXiv.org e-Print Archive

Crossref