9 research outputs found

    Understanding and Mitigating Copying in Diffusion Models

    Full text link
    Images generated by diffusion models like Stable Diffusion are increasingly widespread. Recent works and even lawsuits have shown that these models are prone to replicating their training data, unbeknownst to the user. In this paper, we first analyze this memorization problem in text-to-image diffusion models. While it is widely believed that duplicated images in the training set are responsible for content replication at inference time, we observe that the text conditioning of the model plays a similarly important role. In fact, we see in our experiments that data replication often does not happen for unconditional models, while it is common in the text-conditional case. Motivated by our findings, we then propose several techniques for reducing data replication at both training and inference time by randomizing and augmenting image captions in the training set.Comment: 17 pages, preprint. Code is available at https://github.com/somepago/DC

    Diffusion Art or Digital Forgery? Investigating Data Replication in Diffusion Models

    Full text link
    Cutting-edge diffusion models produce images with high quality and customizability, enabling them to be used for commercial art and graphic design purposes. But do diffusion models create unique works of art, or are they replicating content directly from their training sets? In this work, we study image retrieval frameworks that enable us to compare generated images with training samples and detect when content has been replicated. Applying our frameworks to diffusion models trained on multiple datasets including Oxford flowers, Celeb-A, ImageNet, and LAION, we discuss how factors such as training set size impact rates of content replication. We also identify cases where diffusion models, including the popular Stable Diffusion model, blatantly copy from their training data.Comment: Updated draft with the following changes (1) Clarified the LAION Aesthetics versions everywhere (2) Correction on which LAION Aesthetics version SD - 1.4 is finetuned on and updated figure 12 based on this (3) A section on possible causes of replicatio

    What Can We Learn from Unlearnable Datasets?

    Full text link
    In an era of widespread web scraping, unlearnable dataset methods have the potential to protect data privacy by preventing deep neural networks from generalizing. But in addition to a number of practical limitations that make their use unlikely, we make a number of findings that call into question their ability to safeguard data. First, it is widely believed that neural networks trained on unlearnable datasets only learn shortcuts, simpler rules that are not useful for generalization. In contrast, we find that networks actually can learn useful features that can be reweighed for high test performance, suggesting that image privacy is not preserved. Unlearnable datasets are also believed to induce learning shortcuts through linear separability of added perturbations. We provide a counterexample, demonstrating that linear separability of perturbations is not a necessary condition. To emphasize why linearly separable perturbations should not be relied upon, we propose an orthogonal projection attack which allows learning from unlearnable datasets published in ICML 2021 and ICLR 2023. Our proposed attack is significantly less complex than recently proposed techniques.Comment: 17 pages, 9 figure

    Autoregressive Perturbations for Data Poisoning

    Full text link
    The prevalence of data scraping from social media as a means to obtain datasets has led to growing concerns regarding unauthorized use of data. Data poisoning attacks have been proposed as a bulwark against scraping, as they make data "unlearnable" by adding small, imperceptible perturbations. Unfortunately, existing methods require knowledge of both the target architecture and the complete dataset so that a surrogate network can be trained, the parameters of which are used to generate the attack. In this work, we introduce autoregressive (AR) poisoning, a method that can generate poisoned data without access to the broader dataset. The proposed AR perturbations are generic, can be applied across different datasets, and can poison different architectures. Compared to existing unlearnable methods, our AR poisons are more resistant against common defenses such as adversarial training and strong data augmentations. Our analysis further provides insight into what makes an effective data poison.Comment: 22 pages, 13 figures. Code available at https://github.com/psandovalsegura/autoregressive-poisonin

    Real world data on clinical profile, management and outcomes of venous thromboembolism from a tertiary care centre in India

    No full text
    Objectives: Venous thromboembolism (VTE) is a major cause of mortality and morbidity worldwide. This study describes a real-world scenario of VTE presenting to a tertiary care hospital in India. Methods: All patients presenting with acute VTE or associated complications from January 2017 to January 2020 were included in the study. Results: A total of 330 patient admissions related to VTE were included over 3 years, of which 303 had an acute episode of VTE. The median age was 50 years (IQR 38–64); 30% of patients were younger than 40 years of age. Only 24% of patients had provoked VTE with recent surgery (56%) and malignancy (16%) being the commonest risk factors. VTE manifested as isolated DVT (56%), isolated pulmonary embolism (PE; 19.1%), combined DVT/PE (22.4%), and upper limb DVT (2.3%). Patients with PE (n = 126) were classified as low-risk (15%), intermediate-risk (55%) and high-risk (29%). Reperfusion therapy was performed for 15.7% of patients with intermediate-risk and 75.6% with high-risk PE. In-hospital mortality for the entire cohort was 8.9%; 35% for high-risk PE and 11% for intermediate-risk PE. On multivariate analysis, the presence of active malignancy (OR = 5.8; 95% CI: 1.1–30.8, p = 0.038) and high-risk PE (OR = 4.8; 95% CI: 1.6–14.9, p = 0.006) were found to be independent predictors of mortality. Conclusion: Our data provides real-world perspectives on the demographic sand management of patients presenting with acute VTE in a referral hospital setting. We observed relatively high mortality for intermediate-risk PE, necessitating better subclassification of this group to identify candidates for more aggressive approaches
    corecore