Search CORE

9 research outputs found

Diffusion Art or Digital Forgery? Investigating Data Replication in Diffusion Models

Author: Geiping Jonas
Goldblum Micah
Goldstein Tom
Singla Vasu
Somepalli Gowthami
Publication venue
Publication date: 12/12/2022
Field of study

Cutting-edge diffusion models produce images with high quality and customizability, enabling them to be used for commercial art and graphic design purposes. But do diffusion models create unique works of art, or are they replicating content directly from their training sets? In this work, we study image retrieval frameworks that enable us to compare generated images with training samples and detect when content has been replicated. Applying our frameworks to diffusion models trained on multiple datasets including Oxford flowers, Celeb-A, ImageNet, and LAION, we discuss how factors such as training set size impact rates of content replication. We also identify cases where diffusion models, including the popular Stable Diffusion model, blatantly copy from their training data.Comment: Updated draft with the following changes (1) Clarified the LAION Aesthetics versions everywhere (2) Correction on which LAION Aesthetics version SD - 1.4 is finetuned on and updated figure 12 based on this (3) A section on possible causes of replicatio

arXiv.org e-Print Archive

Understanding and Mitigating Copying in Diffusion Models

Author: Geiping Jonas
Goldblum Micah
Goldstein Tom
Singla Vasu
Somepalli Gowthami
Publication venue
Publication date: 31/05/2023
Field of study

Images generated by diffusion models like Stable Diffusion are increasingly widespread. Recent works and even lawsuits have shown that these models are prone to replicating their training data, unbeknownst to the user. In this paper, we first analyze this memorization problem in text-to-image diffusion models. While it is widely believed that duplicated images in the training set are responsible for content replication at inference time, we observe that the text conditioning of the model plays a similarly important role. In fact, we see in our experiments that data replication often does not happen for unconditional models, while it is common in the text-conditional case. Motivated by our findings, we then propose several techniques for reducing data replication at both training and inference time by randomizing and augmenting image captions in the training set.Comment: 17 pages, preprint. Code is available at https://github.com/somepago/DC

arXiv.org e-Print Archive

How Much Data Are Augmentations Worth? An Investigation into Scaling Laws, Invariance, and Implicit Regularization

Author: Geiping Jonas
Goldblum Micah
Goldstein Tom
Shwartz-Ziv Ravid
Somepalli Gowthami
Wilson Andrew Gordon
Publication venue
Publication date: 30/03/2023
Field of study

Despite the clear performance benefits of data augmentations, little is known about why they are so effective. In this paper, we disentangle several key mechanisms through which data augmentations operate. Establishing an exchange rate between augmented and additional real data, we find that in out-of-distribution testing scenarios, augmentations which yield samples that are diverse, but inconsistent with the data distribution can be even more valuable than additional training data. Moreover, we find that data augmentations which encourage invariances can be more valuable than invariance alone, especially on small and medium sized training sets. Following this observation, we show that augmentations induce additional stochasticity during training, effectively flattening the loss landscape.Comment: 31 pages, 29 figures. To be presented at ICLR 2023. Code at https://github.com/JonasGeiping/dataaug

arXiv.org e-Print Archive

A Performance-Driven Benchmark for Feature Selection in Tabular Deep Learning

Author: Bruss C. Bayan
Cherepanova Valeriia
Geiping Jonas
Goldblum Micah
Goldstein Tom
Levin Roman
Somepalli Gowthami
Wilson Andrew Gordon
Publication venue
Publication date: 10/11/2023
Field of study

Academic tabular benchmarks often contain small sets of curated features. In contrast, data scientists typically collect as many features as possible into their datasets, and even engineer new features from existing ones. To prevent overfitting in subsequent downstream modeling, practitioners commonly use automated feature selection methods that identify a reduced subset of informative features. Existing benchmarks for tabular feature selection consider classical downstream models, toy synthetic datasets, or do not evaluate feature selectors on the basis of downstream performance. Motivated by the increasing popularity of tabular deep learning, we construct a challenging feature selection benchmark evaluated on downstream neural networks including transformers, using real datasets and multiple methods for generating extraneous features. We also propose an input-gradient-based analogue of Lasso for neural networks that outperforms classical feature selection methods on challenging problems such as selecting from corrupted or second-order features

arXiv.org e-Print Archive

Baseline Defenses for Adversarial Attacks Against Aligned Language Models

Author: Chiang Ping-yeh
Geiping Jonas
Goldblum Micah
Goldstein Tom
Jain Neel
Kirchenbauer John
Saha Aniruddha
Schwarzschild Avi
Somepalli Gowthami
Wen Yuxin
Publication venue
Publication date: 04/09/2023
Field of study

As Large Language Models quickly become ubiquitous, it becomes critical to understand their security vulnerabilities. Recent work shows that text optimizers can produce jailbreaking prompts that bypass moderation and alignment. Drawing from the rich body of work on adversarial machine learning, we approach these attacks with three questions: What threat models are practically useful in this domain? How do baseline defense techniques perform in this new domain? How does LLM security differ from computer vision? We evaluate several baseline defense strategies against leading adversarial attacks on LLMs, discussing the various settings in which each is feasible and effective. Particularly, we look at three types of defenses: detection (perplexity based), input preprocessing (paraphrase and retokenization), and adversarial training. We discuss white-box and gray-box settings and discuss the robustness-performance trade-off for each of the defenses considered. We find that the weakness of existing discrete optimizers for text, combined with the relatively high costs of optimization, makes standard adaptive attacks more challenging for LLMs. Future research will be needed to uncover whether more powerful optimizers can be developed, or whether the strength of filtering and preprocessing defenses is greater in the LLMs domain than it has been in computer vision.Comment: 12 page

arXiv.org e-Print Archive

Battle of the Backbones: A Large-Scale Comparison of Pretrained Models across Computer Vision Tasks

Author: Bardes Adrien
Chattopadhyay Prithvijit
Chellappa Rama
Goldblum Micah
Goldstein Tom
Hoffman Judy
Ibrahim Mark
Ni Renkun
Prabhu Viraj
Shu Manli
Somepalli Gowthami
Souri Hossein
Wilson Andrew Gordon
Publication venue
Publication date: 19/11/2023
Field of study

Neural network based computer vision systems are typically built on a backbone, a pretrained or randomly initialized feature extractor. Several years ago, the default option was an ImageNet-trained convolutional neural network. However, the recent past has seen the emergence of countless backbones pretrained using various algorithms and datasets. While this abundance of choice has led to performance increases for a range of systems, it is difficult for practitioners to make informed decisions about which backbone to choose. Battle of the Backbones (BoB) makes this choice easier by benchmarking a diverse suite of pretrained models, including vision-language models, those trained via self-supervised learning, and the Stable Diffusion backbone, across a diverse set of computer vision tasks ranging from classification to object detection to OOD generalization and more. Furthermore, BoB sheds light on promising directions for the research community to advance computer vision by illuminating strengths and weakness of existing approaches through a comprehensive analysis conducted on more than 1500 training runs. While vision transformers (ViTs) and self-supervised learning (SSL) are increasingly popular, we find that convolutional neural networks pretrained in a supervised fashion on large training sets still perform best on most tasks among the models we consider. Moreover, in apples-to-apples comparisons on the same architectures and similarly sized pretraining datasets, we find that SSL backbones are highly competitive, indicating that future works should perform SSL pretraining with advanced architectures and larger pretraining datasets. We release the raw results of our experiments along with code that allows researchers to put their own backbones through the gauntlet here: https://github.com/hsouri/Battle-of-the-BackbonesComment: Accepted to NeurIPS 202

arXiv.org e-Print Archive

NEFTune: Noisy Embeddings Improve Instruction Finetuning

Author: Bartoldson Brian R.
Chiang Ping-yeh
Chu Hong-Min
Geiping Jonas
Goldblum Micah
Goldstein Tom
Jain Neel
Kailkhura Bhavya
Kirchenbauer John
Saha Aniruddha
Schwarzschild Avi
Somepalli Gowthami
Wen Yuxin
Publication venue
Publication date: 10/10/2023
Field of study

We show that language model finetuning can be improved, sometimes dramatically, with a simple augmentation. NEFTune adds noise to the embedding vectors during training. Standard finetuning of LLaMA-2-7B using Alpaca achieves 29.79% on AlpacaEval, which rises to 64.69% using noisy embeddings. NEFTune also improves over strong baselines on modern instruction datasets. Models trained with Evol-Instruct see a 10% improvement, with ShareGPT an 8% improvement, and with OpenPlatypus an 8% improvement. Even powerful models further refined with RLHF such as LLaMA-2-Chat benefit from additional training with NEFTune.Comment: 25 pages, Code is available on Github: https://github.com/neelsjain/NEFTun

arXiv.org e-Print Archive