43 research outputs found
One More Step: A Versatile Plug-and-Play Module for Rectifying Diffusion Schedule Flaws and Enhancing Low-Frequency Controls
It is well known that many open-released foundational diffusion models have
difficulty in generating images that substantially depart from average
brightness, despite such images being present in the training data. This is due
to an inconsistency: while denoising starts from pure Gaussian noise during
inference, the training noise schedule retains residual data even in the final
timestep distribution, due to difficulties in numerical conditioning in
mainstream formulation, leading to unintended bias during inference. To
mitigate this issue, certain -prediction models are combined with an
ad-hoc offset-noise methodology. In parallel, some contemporary models have
adopted zero-terminal SNR noise schedules together with
-prediction, which necessitate major alterations to pre-trained
models. However, such changes risk destabilizing a large multitude of
community-driven applications anchored on these pre-trained models. In light of
this, our investigation revisits the fundamental causes, leading to our
proposal of an innovative and principled remedy, called One More Step (OMS). By
integrating a compact network and incorporating an additional simple yet
effective step during inference, OMS elevates image fidelity and harmonizes the
dichotomy between training and inference, while preserving original model
parameters. Once trained, various pre-trained diffusion models with the same
latent domain can share the same OMS module.Comment: Project Page: https://jabir-zheng.github.io/OneMoreStep/, Demo Page:
https://huggingface.co/spaces/h1t/oms_sdxl_lc
MagicFusion: Boosting Text-to-Image Generation Performance by Fusing Diffusion Models
The advent of open-source AI communities has produced a cornucopia of
powerful text-guided diffusion models that are trained on various datasets.
While few explorations have been conducted on ensembling such models to combine
their strengths. In this work, we propose a simple yet effective method called
Saliency-aware Noise Blending (SNB) that can empower the fused text-guided
diffusion models to achieve more controllable generation. Specifically, we
experimentally find that the responses of classifier-free guidance are highly
related to the saliency of generated images. Thus we propose to trust different
models in their areas of expertise by blending the predicted noises of two
diffusion models in a saliency-aware manner. SNB is training-free and can be
completed within a DDIM sampling process. Additionally, it can automatically
align the semantics of two noise spaces without requiring additional
annotations such as masks. Extensive experiments show the impressive
effectiveness of SNB in various applications. Project page is available at
https://magicfusion.github.io/
FakeCLR: Exploring Contrastive Learning for Solving Latent Discontinuity in Data-Efficient GANs
Data-Efficient GANs (DE-GANs), which aim to learn generative models with a
limited amount of training data, encounter several challenges for generating
high-quality samples. Since data augmentation strategies have largely
alleviated the training instability, how to further improve the generative
performance of DE-GANs becomes a hotspot. Recently, contrastive learning has
shown the great potential of increasing the synthesis quality of DE-GANs, yet
related principles are not well explored. In this paper, we revisit and compare
different contrastive learning strategies in DE-GANs, and identify (i) the
current bottleneck of generative performance is the discontinuity of latent
space; (ii) compared to other contrastive learning strategies,
Instance-perturbation works towards latent space continuity, which brings the
major improvement to DE-GANs. Based on these observations, we propose FakeCLR,
which only applies contrastive learning on perturbed fake samples, and devises
three related training techniques: Noise-related Latent Augmentation,
Diversity-aware Queue, and Forgetting Factor of Queue. Our experimental results
manifest the new state of the arts on both few-shot generation and limited-data
generation. On multiple datasets, FakeCLR acquires more than 15% FID
improvement compared to existing DE-GANs. Code is available at
https://github.com/iceli1007/FakeCLR.Comment: Accepted by ECCV202
Cross-Modal Contrastive Learning for Robust Reasoning in VQA
Multi-modal reasoning in visual question answering (VQA) has witnessed rapid
progress recently. However, most reasoning models heavily rely on shortcuts
learned from training data, which prevents their usage in challenging
real-world scenarios. In this paper, we propose a simple but effective
cross-modal contrastive learning strategy to get rid of the shortcut reasoning
caused by imbalanced annotations and improve the overall performance. Different
from existing contrastive learning with complex negative categories on coarse
(Image, Question, Answer) triplet level, we leverage the correspondences
between the language and image modalities to perform finer-grained cross-modal
contrastive learning. We treat each Question-Answer (QA) pair as a whole, and
differentiate between images that conform with it and those against it. To
alleviate the issue of sampling bias, we further build connected graphs among
images. For each positive pair, we regard the images from different graphs as
negative samples and deduct the version of multi-positive contrastive learning.
To our best knowledge, it is the first paper that reveals a general contrastive
learning strategy without delicate hand-craft rules can contribute to robust
VQA reasoning. Experiments on several mainstream VQA datasets demonstrate our
superiority compared to the state of the arts. Code is available at
\url{https://github.com/qizhust/cmcl_vqa_pl}
Unified Discrete Diffusion for Simultaneous Vision-Language Generation
The recently developed discrete diffusion models perform extraordinarily well
in the text-to-image task, showing significant promise for handling the
multi-modality signals. In this work, we harness these traits and present a
unified multimodal generation model that can conduct both the "modality
translation" and "multi-modality generation" tasks using a single model,
performing text-based, image-based, and even vision-language simultaneous
generation. Specifically, we unify the discrete diffusion process for
multimodal signals by proposing a unified transition matrix. Moreover, we
design a mutual attention module with fused embedding layer and a unified
objective function to emphasise the inter-modal linkages, which are vital for
multi-modality generation. Extensive experiments indicate that our proposed
method can perform comparably to the state-of-the-art solutions in various
generation tasks
PartSeg: Few-shot Part Segmentation via Part-aware Prompt Learning
In this work, we address the task of few-shot part segmentation, which aims
to segment the different parts of an unseen object using very few labeled
examples. It is found that leveraging the textual space of a powerful
pre-trained image-language model (such as CLIP) can be beneficial in learning
visual features. Therefore, we develop a novel method termed PartSeg for
few-shot part segmentation based on multimodal learning. Specifically, we
design a part-aware prompt learning method to generate part-specific prompts
that enable the CLIP model to better understand the concept of ``part'' and
fully utilize its textual space. Furthermore, since the concept of the same
part under different object categories is general, we establish relationships
between these parts during the prompt learning process. We conduct extensive
experiments on the PartImageNet and PascalPart datasets, and the
experimental results demonstrated that our proposed method achieves
state-of-the-art performance
Toward Understanding the Influence of Individual Clients in Federated Learning
Federated learning allows mobile clients to jointly train a global model
without sending their private data to a central server. Extensive works have
studied the performance guarantee of the global model, however, it is still
unclear how each individual client influences the collaborative training
process. In this work, we defined a new notion, called {\em Fed-Influence}, to
quantify this influence over the model parameters, and proposed an effective
and efficient algorithm to estimate this metric. In particular, our design
satisfies several desirable properties: (1) it requires neither retraining nor
retracing, adding only linear computational overhead to clients and the server;
(2) it strictly maintains the tenets of federated learning, without revealing
any client's local private data; and (3) it works well on both convex and
non-convex loss functions, and does not require the final model to be optimal.
Empirical results on a synthetic dataset and the FEMNIST dataset demonstrate
that our estimation method can approximate Fed-Influence with small bias.
Further, we show an application of Fed-Influence in model debugging.Comment: Accepted at AAAI 202