90 research outputs found
Improved Techniques for Maximum Likelihood Estimation for Diffusion ODEs
Diffusion models have exhibited excellent performance in various domains. The
probability flow ordinary differential equation (ODE) of diffusion models
(i.e., diffusion ODEs) is a particular case of continuous normalizing flows
(CNFs), which enables deterministic inference and exact likelihood evaluation.
However, the likelihood estimation results by diffusion ODEs are still far from
those of the state-of-the-art likelihood-based generative models. In this work,
we propose several improved techniques for maximum likelihood estimation for
diffusion ODEs, including both training and evaluation perspectives. For
training, we propose velocity parameterization and explore variance reduction
techniques for faster convergence. We also derive an error-bounded high-order
flow matching objective for finetuning, which improves the ODE likelihood and
smooths its trajectory. For evaluation, we propose a novel training-free
truncated-normal dequantization to fill the training-evaluation gap commonly
existing in diffusion ODEs. Building upon these techniques, we achieve
state-of-the-art likelihood estimation results on image datasets (2.56 on
CIFAR-10, 3.43/3.69 on ImageNet-32) without variational dequantization or data
augmentation.Comment: Accepted in ICML202
DPM-Solver-v3: Improved Diffusion ODE Solver with Empirical Model Statistics
Diffusion probabilistic models (DPMs) have exhibited excellent performance
for high-fidelity image generation while suffering from inefficient sampling.
Recent works accelerate the sampling procedure by proposing fast ODE solvers
that leverage the specific ODE form of DPMs. However, they highly rely on
specific parameterization during inference (such as noise/data prediction),
which might not be the optimal choice. In this work, we propose a novel
formulation towards the optimal parameterization during sampling that minimizes
the first-order discretization error of the ODE solution. Based on such
formulation, we propose DPM-Solver-v3, a new fast ODE solver for DPMs by
introducing several coefficients efficiently computed on the pretrained model,
which we call empirical model statistics. We further incorporate multistep
methods and a predictor-corrector framework, and propose some techniques for
improving sample quality at small numbers of function evaluations (NFE) or
large guidance scales. Experiments show that DPM-Solver-v3 achieves
consistently better or comparable performance in both unconditional and
conditional sampling with both pixel-space and latent-space DPMs, especially in
510 NFEs. We achieve FIDs of 12.21 (5 NFE), 2.51 (10 NFE) on
unconditional CIFAR10, and MSE of 0.55 (5 NFE, 7.5 guidance scale) on Stable
Diffusion, bringing a speed-up of 15%30% compared to previous
state-of-the-art training-free methods. Code is available at
https://github.com/thu-ml/DPM-Solver-v3.Comment: Accepted at NeurIPS 202
An Adaptive Incremental Gradient Method With Support for Non-Euclidean Norms
Stochastic variance reduced methods have shown strong performance in solving
finite-sum problems. However, these methods usually require the users to
manually tune the step-size, which is time-consuming or even infeasible for
some large-scale optimization tasks. To overcome the problem, we propose and
analyze several novel adaptive variants of the popular SAGA algorithm.
Eventually, we design a variant of Barzilai-Borwein step-size which is tailored
for the incremental gradient method to ensure memory efficiency and fast
convergence. We establish its convergence guarantees under general settings
that allow non-Euclidean norms in the definition of smoothness and the
composite objectives, which cover a broad range of applications in machine
learning. We improve the analysis of SAGA to support non-Euclidean norms, which
fills the void of existing work. Numerical experiments on standard datasets
demonstrate a competitive performance of the proposed algorithm compared with
existing variance-reduced methods and their adaptive variants
Efficient Private SCO for Heavy-Tailed Data via Clipping
We consider stochastic convex optimization for heavy-tailed data with the
guarantee of being differentially private (DP). Prior work on this problem is
restricted to the gradient descent (GD) method, which is inefficient for
large-scale problems. In this paper, we resolve this issue and derive the first
high-probability bounds for the private stochastic method with clipping. For
general convex problems, we derive excess population risks
\Tilde{O}\left(\frac{d^{1/7}\sqrt{\ln\frac{(n \epsilon)^2}{\beta
d}}}{(n\epsilon)^{2/7}}\right) and
\Tilde{O}\left(\frac{d^{1/7}\ln\frac{(n\epsilon)^2}{\beta
d}}{(n\epsilon)^{2/7}}\right) under bounded or unbounded domain assumption,
respectively (here is the sample size, is the dimension of the data,
is the confidence level and is the private level). Then, we
extend our analysis to the strongly convex case and non-smooth case (which
works for generalized smooth objectives with Hlder-continuous
gradients). We establish new excess risk bounds without bounded domain
assumption. The results above achieve lower excess risks and gradient
complexities than existing methods in their corresponding cases. Numerical
experiments are conducted to justify the theoretical improvement
Convolutional Embedding for Edit Distance
Edit-distance-based string similarity search has many applications such as
spell correction, data de-duplication, and sequence alignment. However,
computing edit distance is known to have high complexity, which makes string
similarity search challenging for large datasets. In this paper, we propose a
deep learning pipeline (called CNN-ED) that embeds edit distance into Euclidean
distance for fast approximate similarity search. A convolutional neural network
(CNN) is used to generate fixed-length vector embeddings for a dataset of
strings and the loss function is a combination of the triplet loss and the
approximation error. To justify our choice of using CNN instead of other
structures (e.g., RNN) as the model, theoretical analysis is conducted to show
that some basic operations in our CNN model preserve edit distance.
Experimental results show that CNN-ED outperforms data-independent CGK
embedding and RNN-based GRU embedding in terms of both accuracy and efficiency
by a large margin. We also show that string similarity search can be
significantly accelerated using CNN-based embeddings, sometimes by orders of
magnitude.Comment: Accepted by the 43rd International ACM SIGIR Conference on Research
and Development in Information Retrieval, 202
Positional Information Matters for Invariant In-Context Learning: A Case Study of Simple Function Classes
In-context learning (ICL) refers to the ability of a model to condition on a
few in-context demonstrations (input-output examples of the underlying task) to
generate the answer for a new query input, without updating parameters. Despite
the impressive ICL ability of LLMs, it has also been found that ICL in LLMs is
sensitive to input demonstrations and limited to short context lengths. To
understand the limitations and principles for successful ICL, we conduct an
investigation with ICL linear regression of transformers. We characterize
several Out-of-Distribution (OOD) cases for ICL inspired by realistic LLM ICL
failures and compare transformers with DeepSet, a simple yet powerful
architecture for ICL. Surprisingly, DeepSet outperforms transformers across a
variety of distribution shifts, implying that preserving permutation invariance
symmetry to input demonstrations is crucial for OOD ICL. The phenomenon
specifies a fundamental requirement by ICL, which we termed as ICL invariance.
Nevertheless, the positional encodings in LLMs will break ICL invariance. To
this end, we further evaluate transformers with identical positional encodings
and find preserving ICL invariance in transformers achieves state-of-the-art
performance across various ICL distribution shiftsComment: Ongoing work; preliminary versio
Does Invariant Graph Learning via Environment Augmentation Learn Invariance?
Invariant graph representation learning aims to learn the invariance among
data from different environments for out-of-distribution generalization on
graphs. As the graph environment partitions are usually expensive to obtain,
augmenting the environment information has become the de facto approach.
However, the usefulness of the augmented environment information has never been
verified. In this work, we find that it is fundamentally impossible to learn
invariant graph representations via environment augmentation without additional
assumptions. Therefore, we develop a set of minimal assumptions, including
variation sufficiency and variation consistency, for feasible invariant graph
learning. We then propose a new framework Graph invAriant Learning Assistant
(GALA). GALA incorporates an assistant model that needs to be sensitive to
graph environment changes or distribution shifts. The correctness of the proxy
predictions by the assistant model hence can differentiate the variations in
spurious subgraphs. We show that extracting the maximally invariant subgraph to
the proxy predictions provably identifies the underlying invariant subgraph for
successful OOD generalization under the established minimal assumptions.
Extensive experiments on datasets including DrugOOD with various graph
distribution shifts confirm the effectiveness of GALA.Comment: NeurIPS 2023, 34 pages, 35 figure
- …