6 research outputs found
Ambient Diffusion: Learning Clean Distributions from Corrupted Data
We present the first diffusion-based framework that can learn an unknown
distribution using only highly-corrupted samples. This problem arises in
scientific applications where access to uncorrupted samples is impossible or
expensive to acquire. Another benefit of our approach is the ability to train
generative models that are less likely to memorize individual training samples
since they never observe clean training data. Our main idea is to introduce
additional measurement distortion during the diffusion process and require the
model to predict the original corrupted image from the further corrupted image.
We prove that our method leads to models that learn the conditional expectation
of the full uncorrupted image given this additional measurement corruption.
This holds for any corruption process that satisfies some technical conditions
(and in particular includes inpainting and compressed sensing). We train models
on standard benchmarks (CelebA, CIFAR-10 and AFHQ) and show that we can learn
the distribution even when all the training samples have of their pixels
missing. We also show that we can finetune foundation models on small corrupted
datasets (e.g. MRI scans with block corruptions) and learn the clean
distribution without memorizing the training set.Comment: 24 pages, 11 figure
Ambient Diffusion Posterior Sampling: Solving Inverse Problems with Diffusion Models trained on Corrupted Data
We provide a framework for solving inverse problems with diffusion models
learned from linearly corrupted data. Our method, Ambient Diffusion Posterior
Sampling (A-DPS), leverages a generative model pre-trained on one type of
corruption (e.g. image inpainting) to perform posterior sampling conditioned on
measurements from a potentially different forward process (e.g. image
blurring). We test the efficacy of our approach on standard natural image
datasets (CelebA, FFHQ, and AFHQ) and we show that A-DPS can sometimes
outperform models trained on clean data for several image restoration tasks in
both speed and performance. We further extend the Ambient Diffusion framework
to train MRI models with access only to Fourier subsampled multi-coil MRI
measurements at various acceleration factors (R=2, 4, 6, 8). We again observe
that models trained on highly subsampled data are better priors for solving
inverse problems in the high acceleration regime than models trained on fully
sampled data. We open-source our code and the trained Ambient Diffusion MRI
models: https://github.com/utcsilab/ambient-diffusion-mri .Comment: Pre-print, work in progres
DataComp: In search of the next generation of multimodal datasets
Multimodal datasets are a critical component in recent breakthroughs such as
Stable Diffusion and GPT-4, yet their design does not receive the same research
attention as model architectures or training algorithms. To address this
shortcoming in the ML ecosystem, we introduce DataComp, a testbed for dataset
experiments centered around a new candidate pool of 12.8 billion image-text
pairs from Common Crawl. Participants in our benchmark design new filtering
techniques or curate new data sources and then evaluate their new dataset by
running our standardized CLIP training code and testing the resulting model on
38 downstream test sets. Our benchmark consists of multiple compute scales
spanning four orders of magnitude, which enables the study of scaling trends
and makes the benchmark accessible to researchers with varying resources. Our
baseline experiments show that the DataComp workflow leads to better training
sets. In particular, our best baseline, DataComp-1B, enables training a CLIP
ViT-L/14 from scratch to 79.2% zero-shot accuracy on ImageNet, outperforming
OpenAI's CLIP ViT-L/14 by 3.7 percentage points while using the same training
procedure and compute. We release DataComp and all accompanying code at
www.datacomp.ai