48 research outputs found
Efficient Storage of Fine-Tuned Models via Low-Rank Approximation of Weight Residuals
In this paper, we present an efficient method for storing fine-tuned models
by leveraging the low-rank properties of weight residuals. Our key observation
is that weight residuals in large overparameterized models exhibit even
stronger low-rank characteristics. Based on this insight, we propose Efficient
Residual Encoding (ERE), a novel approach that achieves efficient storage of
fine-tuned model weights by approximating the low-rank weight residuals.
Furthermore, we analyze the robustness of weight residuals and push the limit
of storage efficiency by utilizing additional quantization and layer-wise rank
allocation. Our experimental results demonstrate that our method significantly
reduces memory footprint while preserving performance in various tasks and
modalities. We release our code.Comment: 16 pages, 8 figure
TopP&R: Robust Support Estimation Approach for Evaluating Fidelity and Diversity in Generative Models
We propose a robust and reliable evaluation metric for generative models by
introducing topological and statistical treatments for rigorous support
estimation. Existing metrics, such as Inception Score (IS), Frechet Inception
Distance (FID), and the variants of Precision and Recall (P&R), heavily rely on
supports that are estimated from sample features. However, the reliability of
their estimation has not been seriously discussed (and overlooked) even though
the quality of the evaluation entirely depends on it. In this paper, we propose
Topological Precision and Recall (TopP&R, pronounced 'topper'), which provides
a systematic approach to estimating supports, retaining only topologically and
statistically important features with a certain level of confidence. This not
only makes TopP&R strong for noisy features, but also provides statistical
consistency. Our theoretical and experimental results show that TopP&R is
robust to outliers and non-independent and identically distributed (Non-IID)
perturbations, while accurately capturing the true trend of change in samples.
To the best of our knowledge, this is the first evaluation metric focused on
the robust estimation of the support and provides its statistical consistency
under noise.Comment: Accepted to NeurIPS 202
RADIO: Reference-Agnostic Dubbing Video Synthesis
One of the most challenging problems in audio-driven talking head generation
is achieving high-fidelity detail while ensuring precise synchronization. Given
only a single reference image, extracting meaningful identity attributes
becomes even more challenging, often causing the network to mirror the facial
and lip structures too closely. To address these issues, we introduce RADIO, a
framework engineered to yield high-quality dubbed videos regardless of the pose
or expression in reference images. The key is to modulate the decoder layers
using latent space composed of audio and reference features. Additionally, we
incorporate ViT blocks into the decoder to emphasize high-fidelity details,
especially in the lip region. Our experimental results demonstrate that RADIO
displays high synchronization without the loss of fidelity. Especially in harsh
scenarios where the reference frame deviates significantly from the ground
truth, our method outperforms state-of-the-art methods, highlighting its
robustness. Pre-trained model and codes will be made public after the review.Comment: Under revie
LANIT: Language-Driven Image-to-Image Translation for Unlabeled Data
Existing techniques for image-to-image translation commonly have suffered
from two critical problems: heavy reliance on per-sample domain annotation
and/or inability of handling multiple attributes per image. Recent
truly-unsupervised methods adopt clustering approaches to easily provide
per-sample one-hot domain labels. However, they cannot account for the
real-world setting: one sample may have multiple attributes. In addition, the
semantics of the clusters are not easily coupled to the human understanding. To
overcome these, we present a LANguage-driven Image-to-image Translation model,
dubbed LANIT. We leverage easy-to-obtain candidate attributes given in texts
for a dataset: the similarity between images and attributes indicates
per-sample domain labels. This formulation naturally enables multi-hot label so
that users can specify the target domain with a set of attributes in language.
To account for the case that the initial prompts are inaccurate, we also
present prompt learning. We further present domain regularization loss that
enforces translated images be mapped to the corresponding domain. Experiments
on several standard benchmarks demonstrate that LANIT achieves comparable or
superior performance to existing models.Comment: Accepted to CVPR 2023. Project Page:
https://ku-cvlab.github.io/LANIT