2,880 research outputs found
How Much Does Each Datapoint Leak Your Privacy? Quantifying the Per-datum Membership Leakage
We study the per-datum Membership Inference Attacks (MIAs), where an attacker
aims to infer whether a fixed target datum has been included in the input
dataset of an algorithm and thus, violates privacy. First, we define the
membership leakage of a datum as the advantage of the optimal adversary
targeting to identify it. Then, we quantify the per-datum membership leakage
for the empirical mean, and show that it depends on the Mahalanobis distance
between the target datum and the data-generating distribution. We further
assess the effect of two privacy defences, i.e. adding Gaussian noise and
sub-sampling. We quantify exactly how both of them decrease the per-datum
membership leakage. Our analysis builds on a novel proof technique that
combines an Edgeworth expansion of the likelihood ratio test and a
Lindeberg-Feller central limit theorem. Our analysis connects the existing
likelihood ratio and scalar product attacks, and also justifies different
canary selection strategies used in the privacy auditing literature. Finally,
our experiments demonstrate the impacts of the leakage score, the sub-sampling
ratio and the noise scale on the per-datum membership leakage as indicated by
the theory
Automated Big Text Security Classification
In recent years, traditional cybersecurity safeguards have proven ineffective
against insider threats. Famous cases of sensitive information leaks caused by
insiders, including the WikiLeaks release of diplomatic cables and the Edward
Snowden incident, have greatly harmed the U.S. government's relationship with
other governments and with its own citizens. Data Leak Prevention (DLP) is a
solution for detecting and preventing information leaks from within an
organization's network. However, state-of-art DLP detection models are only
able to detect very limited types of sensitive information, and research in the
field has been hindered due to the lack of available sensitive texts. Many
researchers have focused on document-based detection with artificially labeled
"confidential documents" for which security labels are assigned to the entire
document, when in reality only a portion of the document is sensitive. This
type of whole-document based security labeling increases the chances of
preventing authorized users from accessing non-sensitive information within
sensitive documents. In this paper, we introduce Automated Classification
Enabled by Security Similarity (ACESS), a new and innovative detection model
that penetrates the complexity of big text security classification/detection.
To analyze the ACESS system, we constructed a novel dataset, containing
formerly classified paragraphs from diplomatic cables made public by the
WikiLeaks organization. To our knowledge this paper is the first to analyze a
dataset that contains actual formerly sensitive information annotated at
paragraph granularity.Comment: Pre-print of Best Paper Award IEEE Intelligence and Security
Informatics (ISI) 2016 Manuscrip
Auditing and Generating Synthetic Data with Controllable Trust Trade-offs
Data collected from the real world tends to be biased, unbalanced, and at
risk of exposing sensitive and private information. This reality has given rise
to the idea of creating synthetic datasets to alleviate risk, bias, harm, and
privacy concerns inherent in the real data. This concept relies on Generative
AI models to produce unbiased, privacy-preserving synthetic data while being
true to the real data. In this new paradigm, how can we tell if this approach
delivers on its promises? We present an auditing framework that offers a
holistic assessment of synthetic datasets and AI models trained on them,
centered around bias and discrimination prevention, fidelity to the real data,
utility, robustness, and privacy preservation. We showcase our framework by
auditing multiple generative models on diverse use cases, including education,
healthcare, banking, human resources, and across different modalities, from
tabular, to time-series, to natural language. Our use cases demonstrate the
importance of a holistic assessment in order to ensure compliance with
socio-technical safeguards that regulators and policymakers are increasingly
enforcing. For this purpose, we introduce the trust index that ranks multiple
synthetic datasets based on their prescribed safeguards and their desired
trade-offs. Moreover, we devise a trust-index-driven model selection and
cross-validation procedure via auditing in the training loop that we showcase
on a class of transformer models that we dub TrustFormers, across different
modalities. This trust-driven model selection allows for controllable trust
trade-offs in the resulting synthetic data. We instrument our auditing
framework with workflows that connect different stakeholders from model
development to audit and certification via a synthetic data auditing report.Comment: 49 pages; submitte
On the privacy risks of machine learning models
Machine learning (ML) has made huge progress in the last decade and has been applied to a wide range of critical applications. However, driven by the increasing adoption of machine learning models, the significance of privacy risks has become more crucial than ever. These risks can be classified into two categories depending on the role played by ML models: one in which the models themselves are vulnerable to leaking sensitive information, and the other in which the models are abused to violate privacy. In this dissertation, we investigate the privacy risks of machine learning models from two perspectives, i.e., the vulnerability of ML models and the abuse of ML models. To study the vulnerability of ML models to privacy risks, we conduct two studies on one of the most severe privacy attacks against ML models, namely the membership inference attack (MIA). Firstly, we explore membership leakage in label-only exposure of ML models. We present the first label-only membership inference attack and reveal that membership leakage is more severe than previously shown. Secondly, we perform the first privacy analysis of multi-exit networks through the lens of membership leakage. We leverage existing attack methodologies to quantify the vulnerability of multi-exit networks to membership inference attacks and propose a hybrid attack that exploits the exit information to improve the attack performance. From the perspective of abusing ML models to violate privacy, we focus on deepfake face manipulation that can create visual misinformation. We propose the first defense system \system against GAN-based face manipulation by jeopardizing the process of GAN inversion, which is an essential step for subsequent face manipulation. All findings contribute to the community's insight into the privacy risks of machine learning models. We appeal to the community's consideration of the in-depth investigation of privacy risks, like ours, against the rapidly-evolving machine learning techniques.Das maschinelle Lernen (ML) hat in den letzten zehn Jahren enorme Fortschritte gemacht und wurde für eine breite Palette wichtiger Anwendungen eingesetzt. Durch den zunehmenden Einsatz von Modellen des maschinellen Lernens ist die Bedeutung von Datenschutzrisiken jedoch wichtiger denn je geworden. Diese Risiken können je nach der Rolle, die ML-Modelle spielen, in zwei Kategorien eingeteilt werden: in eine, in der die Modelle selbst anfällig für das Durchsickern sensibler Informationen sind, und in die andere, in der die Modelle zur Verletzung der Privatsphäre missbraucht werden. In dieser Dissertation untersuchen wir die Datenschutzrisiken von Modellen des maschinellen Lernens aus zwei Blickwinkeln, nämlich der Anfälligkeit von ML-Modellen und dem Missbrauch von ML-Modellen. Um die Anfälligkeit von ML-Modellen für Datenschutzrisiken zu untersuchen, führen wir zwei Studien zu einem der schwerwiegendsten Angriffe auf den Datenschutz von ML-Modellen durch, nämlich dem Angriff auf die Mitgliedschaft (membership inference attack, MIA). Erstens erforschen wir das Durchsickern von Mitgliedschaften in ML-Modellen, die sich nur auf Labels beziehen. Wir präsentieren den ersten "label-only membership inference"-Angriff und stellen fest, dass das "membership leakage" schwerwiegender ist als bisher gezeigt. Zweitens führen wir die erste Analyse der Privatsphäre von Netzwerken mit mehreren Ausgängen durch die Linse des Mitgliedschaftsverlustes durch. Wir nutzen bestehende Angriffsmethoden, um die Anfälligkeit von Multi-Exit-Netzwerken für Membership-Inference-Angriffe zu quantifizieren und schlagen einen hybriden Angriff vor, der die Exit-Informationen ausnutzt, um die Angriffsleistung zu verbessern. Unter dem Gesichtspunkt des Missbrauchs von ML-Modellen zur Verletzung der Privatsphäre konzentrieren wir uns auf die Manipulation von Gesichtern, die visuelle Fehlinformationen erzeugen können. Wir schlagen das erste Abwehrsystem \system gegen GAN-basierte Gesichtsmanipulationen vor, indem wir den Prozess der GAN-Inversion gefährden, der ein wesentlicher Schritt für die anschließende Gesichtsmanipulation ist. Alle Ergebnisse tragen dazu bei, dass die Community einen Einblick in die Datenschutzrisiken von maschinellen Lernmodellen erhält. Wir appellieren an die Gemeinschaft, eine eingehende Untersuchung der Risiken für die Privatsphäre, wie die unsere, im Hinblick auf die sich schnell entwickelnden Techniken des maschinellen Lernens in Betracht zu ziehen
Development and Analysis of Deterministic Privacy-Preserving Policies Using Non-Stochastic Information Theory
A deterministic privacy metric using non-stochastic information theory is
developed. Particularly, minimax information is used to construct a measure of
information leakage, which is inversely proportional to the measure of privacy.
Anyone can submit a query to a trusted agent with access to a non-stochastic
uncertain private dataset. Optimal deterministic privacy-preserving policies
for responding to the submitted query are computed by maximizing the measure of
privacy subject to a constraint on the worst-case quality of the response
(i.e., the worst-case difference between the response by the agent and the
output of the query computed on the private dataset). The optimal
privacy-preserving policy is proved to be a piecewise constant function in the
form of a quantization operator applied on the output of the submitted query.
The measure of privacy is also used to analyze the performance of -anonymity
methodology (a popular deterministic mechanism for privacy-preserving release
of datasets using suppression and generalization techniques), proving that it
is in fact not privacy-preserving.Comment: improved introduction and numerical exampl
Assessing database and network threats in traditional and cloud computing
Cloud Computing is currently one of the most widely-spoken terms in IT. While it offers a range of technological and financial benefits, its wide acceptance by organizations is not yet wide spread. Security concerns are a main reason for this and this paper studies the data and network threats posed in both traditional and cloud paradigms in an effort to assert in which areas cloud computing addresses security issues and where it does introduce new ones. This evaluation is based on Microsoft’s STRIDE threat model and discusses the stakeholders, the impact and recommendations for tackling each threat
XRay: Enhancing the Web's Transparency with Differential Correlation
Today's Web services - such as Google, Amazon, and Facebook - leverage user
data for varied purposes, including personalizing recommendations, targeting
advertisements, and adjusting prices. At present, users have little insight
into how their data is being used. Hence, they cannot make informed choices
about the services they choose. To increase transparency, we developed XRay,
the first fine-grained, robust, and scalable personal data tracking system for
the Web. XRay predicts which data in an arbitrary Web account (such as emails,
searches, or viewed products) is being used to target which outputs (such as
ads, recommended products, or prices). XRay's core functions are service
agnostic and easy to instantiate for new services, and they can track data
within and across services. To make predictions independent of the audited
service, XRay relies on the following insight: by comparing outputs from
different accounts with similar, but not identical, subsets of data, one can
pinpoint targeting through correlation. We show both theoretically, and through
experiments on Gmail, Amazon, and YouTube, that XRay achieves high precision
and recall by correlating data from a surprisingly small number of extra
accounts.Comment: Extended version of a paper presented at the 23rd USENIX Security
Symposium (USENIX Security 14
Stealing Links from Graph Neural Networks
Graph data, such as chemical networks and social networks, may be deemed
confidential/private because the data owner often spends lots of resources
collecting the data or the data contains sensitive information, e.g., social
relationships. Recently, neural networks were extended to graph data, which are
known as graph neural networks (GNNs). Due to their superior performance, GNNs
have many applications, such as healthcare analytics, recommender systems, and
fraud detection. In this work, we propose the first attacks to steal a graph
from the outputs of a GNN model that is trained on the graph. Specifically,
given a black-box access to a GNN model, our attacks can infer whether there
exists a link between any pair of nodes in the graph used to train the model.
We call our attacks link stealing attacks. We propose a threat model to
systematically characterize an adversary's background knowledge along three
dimensions which in total leads to a comprehensive taxonomy of 8 different link
stealing attacks. We propose multiple novel methods to realize these 8 attacks.
Extensive experiments on 8 real-world datasets show that our attacks are
effective at stealing links, e.g., AUC (area under the ROC curve) is above 0.95
in multiple cases. Our results indicate that the outputs of a GNN model reveal
rich information about the structure of the graph used to train the model.Comment: To appear in the 30th Usenix Security Symposium, August 2021,
Vancouver, B.C., Canad
Light Auditor: Power Measurement Can Tell Private Data Leakage Through IoT Covert Channels
Despite many conveniences of using IoT devices, they have suffered from various attacks due to their weak security. Besides well-known botnet attacks, IoT devices are vulnerable to recent covert-channel attacks. However, no study to date has considered these IoT covert-channel attacks. Among these attacks, researchers have demonstrated exfiltrating users\u27 private data by exploiting the smart bulb\u27s capability of infrared emission.
In this paper, we propose a power-auditing-based system that defends the data exfiltration attack on the smart bulb as a case study. We first implement this infrared-based attack in a lab environment. With a newly-collected power consumption dataset, we pre-process the data and transform them into two-dimensional images through Continous Wavelet Transformation (CWT). Next, we design a two-dimensional convolutional neural network (2D-CNN) model to identify the CWT images generated by malicious behavior. Our experiment results show that the proposed design is efficient in identifying infrared-based anomalies: 1) With much fewer parameters than transfer-learning classifiers, it achieves an accuracy of 88% in identifying the attacks, including unseen patterns. The results are similarly accurate as the sophisticated transfer-learning CNNs, such as AlexNet and GoogLeNet; 2) We validate that our system can classify the CWT images in real time
- …