214 research outputs found
Real-time Controllable Denoising for Image and Video
Controllable image denoising aims to generate clean samples with human
perceptual priors and balance sharpness and smoothness. In traditional
filter-based denoising methods, this can be easily achieved by adjusting the
filtering strength. However, for NN (Neural Network)-based models, adjusting
the final denoising strength requires performing network inference each time,
making it almost impossible for real-time user interaction. In this paper, we
introduce Real-time Controllable Denoising (RCD), the first deep image and
video denoising pipeline that provides a fully controllable user interface to
edit arbitrary denoising levels in real-time with only one-time network
inference. Unlike existing controllable denoising methods that require multiple
denoisers and training stages, RCD replaces the last output layer (which
usually outputs a single noise map) of an existing CNN-based model with a
lightweight module that outputs multiple noise maps. We propose a novel Noise
Decorrelation process to enforce the orthogonality of the noise feature maps,
allowing arbitrary noise level control through noise map interpolation. This
process is network-free and does not require network inference. Our experiments
show that RCD can enable real-time editable image and video denoising for
various existing heavy-weight models without sacrificing their original
performance.Comment: CVPR 202
DiffusionMat: Alpha Matting as Sequential Refinement Learning
In this paper, we introduce DiffusionMat, a novel image matting framework
that employs a diffusion model for the transition from coarse to refined alpha
mattes. Diverging from conventional methods that utilize trimaps merely as
loose guidance for alpha matte prediction, our approach treats image matting as
a sequential refinement learning process. This process begins with the addition
of noise to trimaps and iteratively denoises them using a pre-trained diffusion
model, which incrementally guides the prediction towards a clean alpha matte.
The key innovation of our framework is a correction module that adjusts the
output at each denoising step, ensuring that the final result is consistent
with the input image's structures. We also introduce the Alpha Reliability
Propagation, a novel technique designed to maximize the utility of available
guidance by selectively enhancing the trimap regions with confident alpha
information, thus simplifying the correction task. To train the correction
module, we devise specialized loss functions that target the accuracy of the
alpha matte's edges and the consistency of its opaque and transparent regions.
We evaluate our model across several image matting benchmarks, and the results
indicate that DiffusionMat consistently outperforms existing methods. Project
page at~\url{https://cnnlstm.github.io/DiffusionMa
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
Instruction tuning large language model (LLM) on image-text pairs has
achieved unprecedented vision-language multimodal abilities. However, their
vision-language alignments are only built on image-level, the lack of
region-level alignment limits their advancements to fine-grained multimodal
understanding. In this paper, we propose instruction tuning on
region-of-interest. The key design is to reformulate the bounding box as the
format of spatial instruction. The interleaved sequences of visual features
extracted by the spatial instruction and the language embedding are input to
LLM, and trained on the transformed region-text data in instruction tuning
format. Our region-level vision-language model, termed as GPT4RoI, brings brand
new conversational and interactive experience beyond image-level understanding.
(1) Controllability: Users can interact with our model by both language and
spatial instructions to flexibly adjust the detail level of the question. (2)
Capacities: Our model supports not only single-region spatial instruction but
also multi-region. This unlocks more region-level multimodal capacities such as
detailed region caption and complex region reasoning. (3) Composition: Any
off-the-shelf object detector can be a spatial instruction provider so as to
mine informative object attributes from our model, like color, shape, material,
action, relation to other objects, etc. The code, data, and demo can be found
at https://github.com/jshilong/GPT4RoI.Comment: Code has been released at https://github.com/jshilong/GPT4Ro
- …