9 research outputs found
Understanding and Mitigating Copying in Diffusion Models
Images generated by diffusion models like Stable Diffusion are increasingly
widespread. Recent works and even lawsuits have shown that these models are
prone to replicating their training data, unbeknownst to the user. In this
paper, we first analyze this memorization problem in text-to-image diffusion
models. While it is widely believed that duplicated images in the training set
are responsible for content replication at inference time, we observe that the
text conditioning of the model plays a similarly important role. In fact, we
see in our experiments that data replication often does not happen for
unconditional models, while it is common in the text-conditional case.
Motivated by our findings, we then propose several techniques for reducing data
replication at both training and inference time by randomizing and augmenting
image captions in the training set.Comment: 17 pages, preprint. Code is available at
https://github.com/somepago/DC
Diffusion Art or Digital Forgery? Investigating Data Replication in Diffusion Models
Cutting-edge diffusion models produce images with high quality and
customizability, enabling them to be used for commercial art and graphic design
purposes. But do diffusion models create unique works of art, or are they
replicating content directly from their training sets? In this work, we study
image retrieval frameworks that enable us to compare generated images with
training samples and detect when content has been replicated. Applying our
frameworks to diffusion models trained on multiple datasets including Oxford
flowers, Celeb-A, ImageNet, and LAION, we discuss how factors such as training
set size impact rates of content replication. We also identify cases where
diffusion models, including the popular Stable Diffusion model, blatantly copy
from their training data.Comment: Updated draft with the following changes (1) Clarified the LAION
Aesthetics versions everywhere (2) Correction on which LAION Aesthetics
version SD - 1.4 is finetuned on and updated figure 12 based on this (3) A
section on possible causes of replicatio
What Can We Learn from Unlearnable Datasets?
In an era of widespread web scraping, unlearnable dataset methods have the
potential to protect data privacy by preventing deep neural networks from
generalizing. But in addition to a number of practical limitations that make
their use unlikely, we make a number of findings that call into question their
ability to safeguard data. First, it is widely believed that neural networks
trained on unlearnable datasets only learn shortcuts, simpler rules that are
not useful for generalization. In contrast, we find that networks actually can
learn useful features that can be reweighed for high test performance,
suggesting that image privacy is not preserved. Unlearnable datasets are also
believed to induce learning shortcuts through linear separability of added
perturbations. We provide a counterexample, demonstrating that linear
separability of perturbations is not a necessary condition. To emphasize why
linearly separable perturbations should not be relied upon, we propose an
orthogonal projection attack which allows learning from unlearnable datasets
published in ICML 2021 and ICLR 2023. Our proposed attack is significantly less
complex than recently proposed techniques.Comment: 17 pages, 9 figure
Autoregressive Perturbations for Data Poisoning
The prevalence of data scraping from social media as a means to obtain
datasets has led to growing concerns regarding unauthorized use of data. Data
poisoning attacks have been proposed as a bulwark against scraping, as they
make data "unlearnable" by adding small, imperceptible perturbations.
Unfortunately, existing methods require knowledge of both the target
architecture and the complete dataset so that a surrogate network can be
trained, the parameters of which are used to generate the attack. In this work,
we introduce autoregressive (AR) poisoning, a method that can generate poisoned
data without access to the broader dataset. The proposed AR perturbations are
generic, can be applied across different datasets, and can poison different
architectures. Compared to existing unlearnable methods, our AR poisons are
more resistant against common defenses such as adversarial training and strong
data augmentations. Our analysis further provides insight into what makes an
effective data poison.Comment: 22 pages, 13 figures. Code available at
https://github.com/psandovalsegura/autoregressive-poisonin
Real world data on clinical profile, management and outcomes of venous thromboembolism from a tertiary care centre in India
Objectives: Venous thromboembolism (VTE) is a major cause of mortality and morbidity worldwide. This study describes a real-world scenario of VTE presenting to a tertiary care hospital in India. Methods: All patients presenting with acute VTE or associated complications from January 2017 to January 2020 were included in the study. Results: A total of 330 patient admissions related to VTE were included over 3 years, of which 303 had an acute episode of VTE. The median age was 50 years (IQR 38–64); 30% of patients were younger than 40 years of age. Only 24% of patients had provoked VTE with recent surgery (56%) and malignancy (16%) being the commonest risk factors. VTE manifested as isolated DVT (56%), isolated pulmonary embolism (PE; 19.1%), combined DVT/PE (22.4%), and upper limb DVT (2.3%). Patients with PE (n = 126) were classified as low-risk (15%), intermediate-risk (55%) and high-risk (29%). Reperfusion therapy was performed for 15.7% of patients with intermediate-risk and 75.6% with high-risk PE. In-hospital mortality for the entire cohort was 8.9%; 35% for high-risk PE and 11% for intermediate-risk PE. On multivariate analysis, the presence of active malignancy (OR = 5.8; 95% CI: 1.1–30.8, p = 0.038) and high-risk PE (OR = 4.8; 95% CI: 1.6–14.9, p = 0.006) were found to be independent predictors of mortality. Conclusion: Our data provides real-world perspectives on the demographic sand management of patients presenting with acute VTE in a referral hospital setting. We observed relatively high mortality for intermediate-risk PE, necessitating better subclassification of this group to identify candidates for more aggressive approaches