DiffusionVMR: Diffusion Model for Video Moment Retrieval

Li, Zechao; Lin, Kevin Qinghong; Yan, Rui; Zhao, Henghao

DiffusionVMR: Diffusion Model for Video Moment Retrieval

Authors: Zechao Li
Kevin Qinghong Lin
Rui Yan
Henghao Zhao
Publication date: 29 August 2023
Publisher

Abstract

Video moment retrieval is a fundamental visual-language task that aims to retrieve target moments from an untrimmed video based on a language query. Existing methods typically generate numerous proposals manually or via generative networks in advance as the support set for retrieval, which is not only inflexible but also time-consuming. Inspired by the success of diffusion models on object detection, this work aims at reformulating video moment retrieval as a denoising generation process to get rid of the inflexible and time-consuming proposal generation. To this end, we propose a novel proposal-free framework, namely DiffusionVMR, which directly samples random spans from noise as candidates and introduces denoising learning to ground target moments. During training, Gaussian noise is added to the real moments, and the model is trained to learn how to reverse this process. In inference, a set of time spans is progressively refined from the initial noise to the final output. Notably, the training and inference of DiffusionVMR are decoupled, and an arbitrary number of random spans can be used in inference without being consistent with the training phase. Extensive experiments conducted on three widely-used benchmarks (i.e., QVHighlight, Charades-STA, and TACoS) demonstrate the effectiveness of the proposed DiffusionVMR by comparing it with state-of-the-art methods

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2308.15109

Last time updated on 10/09/2023