19 research outputs found

    DP-Mix: Mixup-based Data Augmentation for Differentially Private Learning

    Full text link
    Data augmentation techniques, such as simple image transformations and combinations, are highly effective at improving the generalization of computer vision models, especially when training data is limited. However, such techniques are fundamentally incompatible with differentially private learning approaches, due to the latter's built-in assumption that each training image's contribution to the learned model is bounded. In this paper, we investigate why naive applications of multi-sample data augmentation techniques, such as mixup, fail to achieve good performance and propose two novel data augmentation techniques specifically designed for the constraints of differentially private learning. Our first technique, DP-Mix_Self, achieves SoTA classification performance across a range of datasets and settings by performing mixup on self-augmented data. Our second technique, DP-Mix_Diff, further improves performance by incorporating synthetic data from a pre-trained diffusion model into the mixup process. We open-source the code at https://github.com/wenxuan-Bao/DP-Mix.Comment: 17 pages, 2 figures, to be published in Neural Information Processing Systems 202

    SoK: Memorization in General-Purpose Large Language Models

    Full text link
    Large Language Models (LLMs) are advancing at a remarkable pace, with myriad applications under development. Unlike most earlier machine learning models, they are no longer built for one specific application but are designed to excel in a wide range of tasks. A major part of this success is due to their huge training datasets and the unprecedented number of model parameters, which allow them to memorize large amounts of information contained in the training data. This memorization goes beyond mere language, and encompasses information only present in a few documents. This is often desirable since it is necessary for performing tasks such as question answering, and therefore an important part of learning, but also brings a whole array of issues, from privacy and security to copyright and beyond. LLMs can memorize short secrets in the training data, but can also memorize concepts like facts or writing styles that can be expressed in text in many different ways. We propose a taxonomy for memorization in LLMs that covers verbatim text, facts, ideas and algorithms, writing styles, distributional properties, and alignment goals. We describe the implications of each type of memorization - both positive and negative - for model performance, privacy, security and confidentiality, copyright, and auditing, and ways to detect and prevent memorization. We further highlight the challenges that arise from the predominant way of defining memorization with respect to model behavior instead of model weights, due to LLM-specific phenomena such as reasoning capabilities or differences between decoding algorithms. Throughout the paper, we describe potential risks and opportunities arising from memorization in LLMs that we hope will motivate new research directions

    Leakage-Abuse Attacks against Order-Revealing Encryption

    Get PDF
    Order-preserving encryption and its generalization order-revealing encryption (OPE/ORE) are used in a variety of settings in practice in order to allow sorting, performing range queries, and filtering data — all while only having access to ciphertexts. But OPE and ORE ciphertexts necessarily leak information about plaintexts, and what level of security they provide has been unclear. In this work, we introduce new leakage-abuse attacks that show how to recover plaintexts from OPE/ORE-encrypted databases. Underlying our new attacks against practically-used schemes is a framework in which we cast the adversary’s challenge as a non- crossing bipartite matching problem. This allows easy tailoring of attacks to a specific scheme’s leakage profile. In a case study of customer records, we show attacks that recover 99% of first names, 97% of last names, and 90% of birthdates held in a database, despite all values being encrypted with the OPE scheme most widely used in practice. We also show the first attack against the recent frequency- hiding Kerschbaum scheme, to which no prior attacks have been demonstrated. Our attack recovers frequently occurring plaintexts most of the time

    The Inconvenient Truth about Web Certificates

    Get PDF
    HTTPS is the de facto standard for securing Internet communications. Although it is widely deployed, the security provided with HTTPS in practice is dubious. HTTPS may fail to provide security for multiple reasons, mostly due to certificate-based authentication failures. Given the importance of HTTPS, we investigate the current scale and practices of HTTPS and certificate-based deployment. We provide a large-scale empirical analysis that considers the top one million most popular websites. Our results show that very few websites implement certificate-based authentication properly. In most cases, domain mismatches between certificates and websites are observed. We study the economic, legal and social aspects of the problem. We identify causes and implications of the profit-oriented attitude of CAs and show how the current economic model leads to the distribution of cheap certificates for cheap security. Finally, we suggest possible changes to improve certificate-based authentication

    The Tao of Inference in Privacy-Protected Databases

    Get PDF
    To protect database confidentiality even in the face of full compromise while supporting standard functionality, recent academic proposals and commercial products rely on a mix of encryption schemes. The common recommendation is to apply strong, semantically secure encryption to the “sensitive” columns and protect other columns with property-revealing encryption (PRE) that supports operations such as sorting. We design, implement, and evaluate a new methodology for inferring data stored in such encrypted databases. The cornerstone is the multinomial attack, a new inference technique that is analytically optimal and empirically outperforms prior heuristic attacks against PRE-encrypted data. We also extend the multinomial attack to take advantage of correlations across multiple columns. These improvements recover PRE-encrypted data with sufficient accuracy to then apply machine learning and record linkage methods to infer the values of columns protected by semantically secure encryption or redaction. We evaluate our methodology on medical, census, and union-membership datasets, showing for the first time how to infer full database records. For PRE-encrypted attributes such as demographics and ZIP codes, our attack outperforms the best prior heuristic by a factor of 16. Unlike any prior technique, we also infer attributes, such as incomes and medical diagnoses, protected by strong encryption. For example, when we infer that a patient in a hospital-discharge dataset has a mental health or substance abuse condition, this prediction is 97% accurate

    Synthesizing Plausible Privacy-Preserving Location Traces

    Full text link
    10.1109/SP.2016.39IEEE Symposium on Security and Privacy (SP)546-56

    Plausible deniability for privacy-preserving data synthesis

    Full text link
    10.14778/3055540.3055542Proceedings of the VLDB Endowment105481-49
    corecore