235 research outputs found
Cached Transformers: Improving Transformers with Differentiable Memory Cache
This work introduces a new Transformer model called Cached Transformer, which
uses Gated Recurrent Cached (GRC) attention to extend the self-attention
mechanism with a differentiable memory cache of tokens. GRC attention enables
attending to both past and current tokens, increasing the receptive field of
attention and allowing for exploring long-range dependencies. By utilizing a
recurrent gating unit to continuously update the cache, our model achieves
significant advancements in \textbf{six} language and vision tasks, including
language modeling, machine translation, ListOPs, image classification, object
detection, and instance segmentation. Furthermore, our approach surpasses
previous memory-based techniques in tasks such as language modeling and
displays the ability to be applied to a broader range of situations.Comment: AAAI 202
Real-time Controllable Denoising for Image and Video
Controllable image denoising aims to generate clean samples with human
perceptual priors and balance sharpness and smoothness. In traditional
filter-based denoising methods, this can be easily achieved by adjusting the
filtering strength. However, for NN (Neural Network)-based models, adjusting
the final denoising strength requires performing network inference each time,
making it almost impossible for real-time user interaction. In this paper, we
introduce Real-time Controllable Denoising (RCD), the first deep image and
video denoising pipeline that provides a fully controllable user interface to
edit arbitrary denoising levels in real-time with only one-time network
inference. Unlike existing controllable denoising methods that require multiple
denoisers and training stages, RCD replaces the last output layer (which
usually outputs a single noise map) of an existing CNN-based model with a
lightweight module that outputs multiple noise maps. We propose a novel Noise
Decorrelation process to enforce the orthogonality of the noise feature maps,
allowing arbitrary noise level control through noise map interpolation. This
process is network-free and does not require network inference. Our experiments
show that RCD can enable real-time editable image and video denoising for
various existing heavy-weight models without sacrificing their original
performance.Comment: CVPR 202
DiffusionMat: Alpha Matting as Sequential Refinement Learning
In this paper, we introduce DiffusionMat, a novel image matting framework
that employs a diffusion model for the transition from coarse to refined alpha
mattes. Diverging from conventional methods that utilize trimaps merely as
loose guidance for alpha matte prediction, our approach treats image matting as
a sequential refinement learning process. This process begins with the addition
of noise to trimaps and iteratively denoises them using a pre-trained diffusion
model, which incrementally guides the prediction towards a clean alpha matte.
The key innovation of our framework is a correction module that adjusts the
output at each denoising step, ensuring that the final result is consistent
with the input image's structures. We also introduce the Alpha Reliability
Propagation, a novel technique designed to maximize the utility of available
guidance by selectively enhancing the trimap regions with confident alpha
information, thus simplifying the correction task. To train the correction
module, we devise specialized loss functions that target the accuracy of the
alpha matte's edges and the consistency of its opaque and transparent regions.
We evaluate our model across several image matting benchmarks, and the results
indicate that DiffusionMat consistently outperforms existing methods. Project
page at~\url{https://cnnlstm.github.io/DiffusionMa
ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning
Charts play a vital role in data visualization, understanding data patterns,
and informed decision-making. However, their unique combination of graphical
elements (e.g., bars, lines) and textual components (e.g., labels, legends)
poses challenges for general-purpose multimodal models. While vision-language
models trained on chart data excel in comprehension, they struggle with
generalization. To address these challenges, we propose ChartAssistant, a
chart-based vision-language model for universal chart comprehension and
reasoning. ChartAssistant leverages ChartSFT, a comprehensive dataset covering
diverse chart-related tasks with basic (e.g. bars and pies) and specialized
(e.g. radars, and bubbles) chart types. It undergoes a two-stage training
process, starting with pre-training on chart-to-table parsing to align chart
and text, followed by multitask instruction-following fine-tuning. This
approach enables ChartAssistant to achieve competitive performance across
various chart tasks. Experimental results demonstrate significant performance
gains over the state-of-the-art UniChart and Chartllama method, especially
outperforming them on real-world chart data with zero-shot setting. The code
and data are available at https://github.com/OpenGVLab/ChartAst.Comment: Updated and corrected experimental results, removal of inappropriate
experiments, and a more comprehensive experimental setu
AVIBench: Towards Evaluating the Robustness of Large Vision-Language Model on Adversarial Visual-Instructions
Large Vision-Language Models (LVLMs) have shown significant progress in well
responding to visual-instructions from users. However, these instructions,
encompassing images and text, are susceptible to both intentional and
inadvertent attacks. Despite the critical importance of LVLMs' robustness
against such threats, current research in this area remains limited. To bridge
this gap, we introduce AVIBench, a framework designed to analyze the robustness
of LVLMs when facing various adversarial visual-instructions (AVIs), including
four types of image-based AVIs, ten types of text-based AVIs, and nine types of
content bias AVIs (such as gender, violence, cultural, and racial biases, among
others). We generate 260K AVIs encompassing five categories of multimodal
capabilities (nine tasks) and content bias. We then conduct a comprehensive
evaluation involving 14 open-source LVLMs to assess their performance. AVIBench
also serves as a convenient tool for practitioners to evaluate the robustness
of LVLMs against AVIs. Our findings and extensive experimental results shed
light on the vulnerabilities of LVLMs, and highlight that inherent biases exist
even in advanced closed-source LVLMs like GeminiProVision and GPT-4V. This
underscores the importance of enhancing the robustness, security, and fairness
of LVLMs. The source code and benchmark will be made publicly available
- …