20 research outputs found
PixelFolder: An Efficient Progressive Pixel Synthesis Network for Image Generation
Pixel synthesis is a promising research paradigm for image generation, which
can well exploit pixel-wise prior knowledge for generation. However, existing
methods still suffer from excessive memory footprint and computation overhead.
In this paper, we propose a progressive pixel synthesis network towards
efficient image generation, coined as PixelFolder. Specifically, PixelFolder
formulates image generation as a progressive pixel regression problem and
synthesizes images by a multi-stage paradigm, which can greatly reduce the
overhead caused by large tensor transformations. In addition, we introduce
novel pixel folding operations to further improve model efficiency while
maintaining pixel-wise prior knowledge for end-to-end regression. With these
innovative designs, we greatly reduce the expenditure of pixel synthesis, e.g.,
reducing 90% computation and 57% parameters compared to the latest pixel
synthesis method called CIPS. To validate our approach, we conduct extensive
experiments on two benchmark datasets, namely FFHQ and LSUN Church. The
experimental results show that with much less expenditure, PixelFolder obtains
new state-of-the-art (SOTA) performance on two benchmark datasets, i.e., 3.77
FID and 2.45 FID on FFHQ and LSUN Church, respectively. Meanwhile, PixelFolder
is also more efficient than the SOTA methods like StyleGAN2, reducing about 74%
computation and 36% parameters, respectively. These results greatly validate
the effectiveness of the proposed PixelFolder.Comment: 11 pages, 7 figure
Shadow-Aware Dynamic Convolution for Shadow Removal
With a wide range of shadows in many collected images, shadow removal has
aroused increasing attention since uncontaminated images are of vital
importance for many downstream multimedia tasks. Current methods consider the
same convolution operations for both shadow and non-shadow regions while
ignoring the large gap between the color mappings for the shadow region and the
non-shadow region, leading to poor quality of reconstructed images and a heavy
computation burden. To solve this problem, this paper introduces a novel
plug-and-play Shadow-Aware Dynamic Convolution (SADC) module to decouple the
interdependence between the shadow region and the non-shadow region. Inspired
by the fact that the color mapping of the non-shadow region is easier to learn,
our SADC processes the non-shadow region with a lightweight convolution module
in a computationally cheap manner and recovers the shadow region with a more
complicated convolution module to ensure the quality of image reconstruction.
Given that the non-shadow region often contains more background color
information, we further develop a novel intra-convolution distillation loss to
strengthen the information flow from the non-shadow region to the shadow
region. Extensive experiments on the ISTD and SRD datasets show our method
achieves better performance in shadow removal over many state-of-the-arts. Our
code is available at https://github.com/xuyimin0926/SADC
Open Vocabulary Object Detection with Proposal Mining and Prediction Equalization
Open-vocabulary object detection (OVD) aims to scale up vocabulary size to
detect objects of novel categories beyond the training vocabulary. Recent work
resorts to the rich knowledge in pre-trained vision-language models. However,
existing methods are ineffective in proposal-level vision-language alignment.
Meanwhile, the models usually suffer from confidence bias toward base
categories and perform worse on novel ones. To overcome the challenges, we
present MEDet, a novel and effective OVD framework with proposal mining and
prediction equalization. First, we design an online proposal mining to refine
the inherited vision-semantic knowledge from coarse to fine, allowing for
proposal-level detection-oriented feature alignment. Second, based on causal
inference theory, we introduce a class-wise backdoor adjustment to reinforce
the predictions on novel categories to improve the overall OVD performance.
Extensive experiments on COCO and LVIS benchmarks verify the superiority of
MEDet over the competing approaches in detecting objects of novel categories,
e.g., 32.6% AP50 on COCO and 22.4% mask mAP on LVIS
Fine-grained Data Distribution Alignment for Post-Training Quantization
While post-training quantization receives popularity mostly due to its
evasion in accessing the original complete training dataset, its poor
performance also stems from scarce images. To alleviate this limitation, in
this paper, we leverage the synthetic data introduced by zero-shot quantization
with calibration dataset and propose a fine-grained data distribution alignment
(FDDA) method to boost the performance of post-training quantization. The
method is based on two important properties of batch normalization statistics
(BNS) we observed in deep layers of the trained network, (i.e.), inter-class
separation and intra-class incohesion. To preserve this fine-grained
distribution information: 1) We calculate the per-class BNS of the calibration
dataset as the BNS centers of each class and propose a BNS-centralized loss to
force the synthetic data distributions of different classes to be close to
their own centers. 2) We add Gaussian noise into the centers to imitate the
incohesion and propose a BNS-distorted loss to force the synthetic data
distribution of the same class to be close to the distorted centers. By
utilizing these two fine-grained losses, our method manifests the
state-of-the-art performance on ImageNet, especially when both the first and
last layers are quantized to the low-bit. Code is at
\url{https://github.com/zysxmu/FDDA}.Comment: ECCV202
Woodpecker: Hallucination Correction for Multimodal Large Language Models
Hallucination is a big shadow hanging over the rapidly evolving Multimodal
Large Language Models (MLLMs), referring to the phenomenon that the generated
text is inconsistent with the image content. In order to mitigate
hallucinations, existing studies mainly resort to an instruction-tuning manner
that requires retraining the models with specific data. In this paper, we pave
a different way, introducing a training-free method named Woodpecker. Like a
woodpecker heals trees, it picks out and corrects hallucinations from the
generated text. Concretely, Woodpecker consists of five stages: key concept
extraction, question formulation, visual knowledge validation, visual claim
generation, and hallucination correction. Implemented in a post-remedy manner,
Woodpecker can easily serve different MLLMs, while being interpretable by
accessing intermediate outputs of the five stages. We evaluate Woodpecker both
quantitatively and qualitatively and show the huge potential of this new
paradigm. On the POPE benchmark, our method obtains a 30.66%/24.33% improvement
in accuracy over the baseline MiniGPT-4/mPLUG-Owl. The source code is released
at https://github.com/BradyFU/Woodpecker.Comment: 16 pages, 7 figures. Code Website:
https://github.com/BradyFU/Woodpecke
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Multimodal Large Language Model (MLLM) relies on the powerful LLM to perform
multimodal tasks, showing amazing emergent abilities in recent studies, such as
writing poems based on an image. However, it is difficult for these case
studies to fully reflect the performance of MLLM, lacking a comprehensive
evaluation. In this paper, we fill in this blank, presenting the first MLLM
Evaluation benchmark MME. It measures both perception and cognition abilities
on a total of 14 subtasks. In order to avoid data leakage that may arise from
direct use of public datasets for evaluation, the annotations of
instruction-answer pairs are all manually designed. The concise instruction
design allows us to fairly compare MLLMs, instead of struggling in prompt
engineering. Besides, with such an instruction, we can also easily carry out
quantitative statistics. A total of 10 advanced MLLMs are comprehensively
evaluated on our MME, which not only suggests that existing MLLMs still have a
large room for improvement, but also reveals the potential directions for the
subsequent model optimization.Comment: https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Model
Super Vision Transformer
We attempt to reduce the computational costs in vision transformers (ViTs),
which increase quadratically in the token number. We present a novel training
paradigm that trains only one ViT model at a time, but is capable of providing
improved image recognition performance with various computational costs. Here,
the trained ViT model, termed super vision transformer (SuperViT), is empowered
with the versatile ability to solve incoming patches of multiple sizes as well
as preserve informative tokens with multiple keeping rates (the ratio of
keeping tokens) to achieve good hardware efficiency for inference, given that
the available hardware resources often change from time to time. Experimental
results on ImageNet demonstrate that our SuperViT can considerably reduce the
computational costs of ViT models with even performance increase. For example,
we reduce 2x FLOPs of DeiT-S while increasing the Top-1 accuracy by 0.2% and
0.7% for 1.5x reduction. Also, our SuperViT significantly outperforms existing
studies on efficient vision transformers. For example, when consuming the same
amount of FLOPs, our SuperViT surpasses the recent state-of-the-art (SoTA) EViT
by 1.1% when using DeiT-S as their backbones. The project of this work is made
publicly available at https://github.com/lmbxmu/SuperViT
FoPro: Few-Shot Guided Robust Webly-Supervised Prototypical Learning
Recently, webly supervised learning (WSL) has been studied to leverage numerous and accessible data from the Internet. Most existing methods focus on learning noise-robust models from web images while neglecting the performance drop caused by the differences between web domain and real-world domain. However, only by tackling the performance gap above can we fully exploit the practical value of web datasets. To this end, we propose a Few-shot guided Prototypical (FoPro) representation learning method, which only needs a few labeled examples from reality and can significantly improve the performance in the real-world domain. Specifically, we initialize each class center with few-shot real-world data as the ``realistic" prototype. Then, the intra-class distance between web instances and ``realistic" prototypes is narrowed by contrastive learning. Finally, we measure image-prototype distance with a learnable metric. Prototypes are polished by adjacent high-quality web images and involved in removing distant out-of-distribution samples. In experiments, FoPro is trained on web datasets with a few real-world examples guided and evaluated on real-world datasets. Our method achieves the state-of-the-art performance on three fine-grained datasets and two large-scale datasets. Compared with existing WSL methods under the same few-shot settings, FoPro still excels in real-world generalization. Code is available at https://github.com/yuleiqin/fopro
End-to-End Zero-Shot HOI Detection via Vision and Language Knowledge Distillation
Most existing Human-Object Interaction (HOI) Detection methods rely heavily on full annotations with predefined HOI categories, which is limited in diversity and costly to scale further. We aim at advancing zero-shot HOI detection to detect both seen and unseen HOIs simultaneously. The fundamental challenges are to discover potential human-object pairs and identify novel HOI categories. To overcome the above challenges, we propose a novel End-to-end zero-shot HOI Detection (EoID) framework via vision-language knowledge distillation. We first design an Interactive Score module combined with a Two-stage Bipartite Matching algorithm to achieve interaction distinguishment for human-object pairs in an action-agnostic manner.
Then we transfer the distribution of action probability from the pretrained vision-language teacher as well as the seen ground truth to the HOI model to attain zero-shot HOI classification. Extensive experiments on HICO-Det dataset demonstrate that our model discovers potential interactive pairs and enables the recognition of unseen HOIs. Finally, our method outperforms the previous SOTA under various zero-shot settings. Moreover, our method is generalizable to large-scale object detection data to further scale up the action sets. The source code is available at: https://github.com/mrwu-mac/EoID
CF-ViT: A General Coarse-to-Fine Method for Vision Transformer
Vision Transformers (ViT) have made many breakthroughs in computer vision tasks. However, considerable redundancy arises in the spatial dimension of an input image, leading to massive computational costs. Therefore, We propose a coarse-to-fine vision transformer (CF-ViT) to relieve computational burden while retaining performance in this paper. Our proposed CF-ViT is motivated by two important observations in modern ViT models: (1) The coarse-grained patch splitting can locate informative regions of an input image. (2) Most images can be well recognized by a ViT model in a small-length token sequence. Therefore, our CF-ViT implements network inference in a two-stage manner. At coarse inference stage, an input image is split into a small-length patch sequence for a computationally economical classification. If not well recognized, the informative patches are identified and further re-split in a fine-grained granularity. Extensive experiments demonstrate the efficacy of our CF-ViT. For example, without any compromise on performance, CF-ViT reduces 53% FLOPs of LV-ViT, and also achieves 2.01x throughput. Code of this project is at https://github.com/ChenMnZ/CF-