19 research outputs found
DP-Mix: Mixup-based Data Augmentation for Differentially Private Learning
Data augmentation techniques, such as simple image transformations and
combinations, are highly effective at improving the generalization of computer
vision models, especially when training data is limited. However, such
techniques are fundamentally incompatible with differentially private learning
approaches, due to the latter's built-in assumption that each training image's
contribution to the learned model is bounded. In this paper, we investigate why
naive applications of multi-sample data augmentation techniques, such as mixup,
fail to achieve good performance and propose two novel data augmentation
techniques specifically designed for the constraints of differentially private
learning. Our first technique, DP-Mix_Self, achieves SoTA classification
performance across a range of datasets and settings by performing mixup on
self-augmented data. Our second technique, DP-Mix_Diff, further improves
performance by incorporating synthetic data from a pre-trained diffusion model
into the mixup process. We open-source the code at
https://github.com/wenxuan-Bao/DP-Mix.Comment: 17 pages, 2 figures, to be published in Neural Information Processing
Systems 202
SoK: Memorization in General-Purpose Large Language Models
Large Language Models (LLMs) are advancing at a remarkable pace, with myriad
applications under development. Unlike most earlier machine learning models,
they are no longer built for one specific application but are designed to excel
in a wide range of tasks. A major part of this success is due to their huge
training datasets and the unprecedented number of model parameters, which allow
them to memorize large amounts of information contained in the training data.
This memorization goes beyond mere language, and encompasses information only
present in a few documents. This is often desirable since it is necessary for
performing tasks such as question answering, and therefore an important part of
learning, but also brings a whole array of issues, from privacy and security to
copyright and beyond. LLMs can memorize short secrets in the training data, but
can also memorize concepts like facts or writing styles that can be expressed
in text in many different ways. We propose a taxonomy for memorization in LLMs
that covers verbatim text, facts, ideas and algorithms, writing styles,
distributional properties, and alignment goals. We describe the implications of
each type of memorization - both positive and negative - for model performance,
privacy, security and confidentiality, copyright, and auditing, and ways to
detect and prevent memorization. We further highlight the challenges that arise
from the predominant way of defining memorization with respect to model
behavior instead of model weights, due to LLM-specific phenomena such as
reasoning capabilities or differences between decoding algorithms. Throughout
the paper, we describe potential risks and opportunities arising from
memorization in LLMs that we hope will motivate new research directions
Leakage-Abuse Attacks against Order-Revealing Encryption
Order-preserving encryption and its generalization
order-revealing encryption (OPE/ORE) are used in a variety
of settings in practice in order to allow sorting, performing
range queries, and filtering data — all while only having access
to ciphertexts. But OPE and ORE ciphertexts necessarily leak
information about plaintexts, and what level of security they
provide has been unclear.
In this work, we introduce new leakage-abuse attacks that show
how to recover plaintexts from OPE/ORE-encrypted databases.
Underlying our new attacks against practically-used schemes is a
framework in which we cast the adversary’s challenge as a non-
crossing bipartite matching problem. This allows easy tailoring
of attacks to a specific scheme’s leakage profile. In a case study
of customer records, we show attacks that recover 99% of first
names, 97% of last names, and 90% of birthdates held in a
database, despite all values being encrypted with the OPE scheme
most widely used in practice.
We also show the first attack against the recent frequency-
hiding Kerschbaum scheme, to which no prior attacks have
been demonstrated. Our attack recovers frequently occurring
plaintexts most of the time
The Inconvenient Truth about Web Certificates
HTTPS is the de facto standard for securing Internet communications. Although it is widely deployed, the security provided with HTTPS in practice is dubious. HTTPS may fail to provide security for multiple reasons, mostly due to certificate-based authentication failures. Given the importance of HTTPS, we investigate the current scale and practices of HTTPS and certificate-based deployment. We provide a large-scale empirical analysis that considers the top one million most popular websites. Our results show that very few websites implement certificate-based authentication properly. In most cases, domain mismatches between certificates and websites are observed. We study the economic, legal and social aspects of the problem. We identify causes and implications of the profit-oriented attitude of CAs and show how the current economic model leads to the distribution of cheap certificates for cheap security. Finally, we suggest possible changes to improve certificate-based authentication
The Tao of Inference in Privacy-Protected Databases
To protect database confidentiality even in the face of full compromise while supporting standard functionality, recent academic proposals and commercial products rely on a mix of encryption schemes. The
common recommendation is to apply strong, semantically secure encryption to the “sensitive” columns and protect other columns with property-revealing encryption (PRE) that supports operations such as sorting.
We design, implement, and evaluate a new methodology for inferring data stored in such encrypted databases. The cornerstone is the multinomial attack, a new inference technique that is analytically optimal and empirically outperforms prior heuristic attacks against PRE-encrypted data. We also extend the multinomial attack to take advantage of correlations across multiple columns. These improvements
recover PRE-encrypted data with sufficient accuracy to then apply machine learning and record linkage methods to infer the values of columns protected by semantically secure encryption or redaction.
We evaluate our methodology on medical, census, and union-membership datasets, showing for the first time how to infer full database records. For PRE-encrypted attributes such as demographics and
ZIP codes, our attack outperforms the best prior heuristic by a factor of 16. Unlike any prior technique, we also infer attributes, such as incomes and medical diagnoses, protected by strong encryption. For
example, when we infer that a patient in a hospital-discharge dataset has a mental health or substance abuse condition, this prediction is 97% accurate
Synthesizing Plausible Privacy-Preserving Location Traces
10.1109/SP.2016.39IEEE Symposium on Security and Privacy (SP)546-56
Plausible deniability for privacy-preserving data synthesis
10.14778/3055540.3055542Proceedings of the VLDB Endowment105481-49