766 research outputs found
DiffusionVMR: Diffusion Model for Video Moment Retrieval
Video moment retrieval is a fundamental visual-language task that aims to
retrieve target moments from an untrimmed video based on a language query.
Existing methods typically generate numerous proposals manually or via
generative networks in advance as the support set for retrieval, which is not
only inflexible but also time-consuming. Inspired by the success of diffusion
models on object detection, this work aims at reformulating video moment
retrieval as a denoising generation process to get rid of the inflexible and
time-consuming proposal generation. To this end, we propose a novel
proposal-free framework, namely DiffusionVMR, which directly samples random
spans from noise as candidates and introduces denoising learning to ground
target moments. During training, Gaussian noise is added to the real moments,
and the model is trained to learn how to reverse this process. In inference, a
set of time spans is progressively refined from the initial noise to the final
output. Notably, the training and inference of DiffusionVMR are decoupled, and
an arbitrary number of random spans can be used in inference without being
consistent with the training phase. Extensive experiments conducted on three
widely-used benchmarks (i.e., QVHighlight, Charades-STA, and TACoS) demonstrate
the effectiveness of the proposed DiffusionVMR by comparing it with
state-of-the-art methods
DDRF: Denoising Diffusion Model for Remote Sensing Image Fusion
Denosing diffusion model, as a generative model, has received a lot of
attention in the field of image generation recently, thanks to its powerful
generation capability. However, diffusion models have not yet received
sufficient research in the field of image fusion. In this article, we introduce
diffusion model to the image fusion field, treating the image fusion task as
image-to-image translation and designing two different conditional injection
modulation modules (i.e., style transfer modulation and wavelet modulation) to
inject coarse-grained style information and fine-grained high-frequency and
low-frequency information into the diffusion UNet, thereby generating fused
images. In addition, we also discussed the residual learning and the selection
of training objectives of the diffusion model in the image fusion task.
Extensive experimental results based on quantitative and qualitative
assessments compared with benchmarks demonstrates state-of-the-art results and
good generalization performance in image fusion tasks. Finally, it is hoped
that our method can inspire other works and gain insight into this field to
better apply the diffusion model to image fusion tasks. Code shall be released
for better reproducibility
- …