3,610 research outputs found
Towards Practical Federated Causal Structure Learning
Understanding causal relations is vital in scientific discovery. The process
of causal structure learning involves identifying causal graphs from
observational data to understand such relations. Usually, a central server
performs this task, but sharing data with the server poses privacy risks.
Federated learning can solve this problem, but existing solutions for federated
causal structure learning make unrealistic assumptions about data and lack
convergence guarantees. FedC2SL is a federated constraint-based causal
structure learning scheme that learns causal graphs using a federated
conditional independence test, which examines conditional independence between
two variables under a condition set without collecting raw data from clients.
FedC2SL requires weaker and more realistic assumptions about data and offers
stronger resistance to data variability among clients. FedPC and FedFCI are the
two variants of FedC2SL for causal structure learning in causal sufficiency and
causal insufficiency, respectively. The study evaluates FedC2SL using both
synthetic datasets and real-world data against existing solutions and finds it
demonstrates encouraging performance and strong resilience to data
heterogeneity among clients
Achieving Differential Privacy and Fairness in Machine Learning
Machine learning algorithms are used to make decisions in various applications, such as recruiting, lending and policing. These algorithms rely on large amounts of sensitive individual information to work properly. Hence, there are sociological concerns about machine learning algorithms on matters like privacy and fairness. Currently, many studies only focus on protecting individual privacy or ensuring fairness of algorithms separately without taking consideration of their connection. However, there are new challenges arising in privacy preserving and fairness-aware machine learning. On one hand, there is fairness within the private model, i.e., how to meet both privacy and fairness requirements simultaneously in machine learning algorithms. On the other hand, there is fairness between the private model and the non-private model, i.e., how to ensure the utility loss due to differential privacy is the same towards each group.
The goal of this dissertation is to address challenging issues in privacy preserving and fairness-aware machine learning: achieving differential privacy with satisfactory utility and efficiency in complex and emerging tasks, using generative models to generate fair data and to assist fair classification, achieving both differential privacy and fairness simultaneously within the same model, and achieving equal utility loss w.r.t. each group between the private model and the non-private model.
In this dissertation, we develop the following algorithms to address the above challenges.
(1) We develop PrivPC and DPNE algorithms to achieve differential privacy in complex and emerging tasks of causal graph discovery and network embedding, respectively.
(2) We develop the fair generative adversarial neural networks framework and three algorithms (FairGAN, FairGAN+ and CFGAN) to achieve fair data generation and classification through generative models based on different association-based and causation-based fairness notions.
(3) We develop PFLR and PFLR* algorithms to simultaneously achieve both differential privacy and fairness in logistic regression.
(4) We develop a DPSGD-F algorithm to remove the disparate impact of differential privacy on model accuracy w.r.t. each group
Causal Discovery Under Local Privacy
Differential privacy is a widely adopted framework designed to safeguard the
sensitive information of data providers within a data set. It is based on the
application of controlled noise at the interface between the server that stores
and processes the data, and the data consumers. Local differential privacy is a
variant that allows data providers to apply the privatization mechanism
themselves on their data individually. Therefore it provides protection also in
contexts in which the server, or even the data collector, cannot be trusted.
The introduction of noise, however, inevitably affects the utility of the data,
particularly by distorting the correlations between individual data components.
This distortion can prove detrimental to tasks such as causal discovery. In
this paper, we consider various well-known locally differentially private
mechanisms and compare the trade-off between the privacy they provide, and the
accuracy of the causal structure produced by algorithms for causal learning
when applied to data obfuscated by these mechanisms. Our analysis yields
valuable insights for selecting appropriate local differentially private
protocols for causal discovery tasks. We foresee that our findings will aid
researchers and practitioners in conducting locally private causal discovery
SoK: Differential Privacies
Shortly after it was first introduced in 2006, differential privacy became
the flagship data privacy definition. Since then, numerous variants and
extensions were proposed to adapt it to different scenarios and attacker
models. In this work, we propose a systematic taxonomy of these variants and
extensions. We list all data privacy definitions based on differential privacy,
and partition them into seven categories, depending on which aspect of the
original definition is modified.
These categories act like dimensions: variants from the same category cannot
be combined, but variants from different categories can be combined to form new
definitions. We also establish a partial ordering of relative strength between
these notions by summarizing existing results. Furthermore, we list which of
these definitions satisfy some desirable properties, like composition,
post-processing, and convexity by either providing a novel proof or collecting
existing ones.Comment: This is the full version of the SoK paper with the same title,
accepted at PETS (Privacy Enhancing Technologies Symposium) 202
Enabling Runtime Verification of Causal Discovery Algorithms with Automated Conditional Independence Reasoning (Extended Version)
Causal discovery is a powerful technique for identifying causal relationships
among variables in data. It has been widely used in various applications in
software engineering. Causal discovery extensively involves conditional
independence (CI) tests. Hence, its output quality highly depends on the
performance of CI tests, which can often be unreliable in practice. Moreover,
privacy concerns arise when excessive CI tests are performed.
Despite the distinct nature between unreliable and excessive CI tests, this
paper identifies a unified and principled approach to addressing both of them.
Generally, CI statements, the outputs of CI tests, adhere to Pearl's axioms,
which are a set of well-established integrity constraints on conditional
independence. Hence, we can either detect erroneous CI statements if they
violate Pearl's axioms or prune excessive CI statements if they are logically
entailed by Pearl's axioms. Holistically, both problems boil down to reasoning
about the consistency of CI statements under Pearl's axioms (referred to as CIR
problem).
We propose a runtime verification tool called CICheck, designed to harden
causal discovery algorithms from reliability and privacy perspectives. CICheck
employs a sound and decidable encoding scheme that translates CIR into SMT
problems. To solve the CIR problem efficiently, CICheck introduces a four-stage
decision procedure with three lightweight optimizations that actively prove or
refute consistency, and only resort to costly SMT-based reasoning when
necessary. Based on the decision procedure to CIR, CICheck includes two
variants: ED-CICheck and ED-CICheck, which detect erroneous CI tests (to
enhance reliability) and prune excessive CI tests (to enhance privacy),
respectively. [abridged due to length limit
The Fast and the Private: Task-based Dataset Search
Modern dataset search platforms employ ML task-based utility metrics instead
of relying on metadata-based keywords to comb through extensive dataset
repositories. In this setup, requesters provide an initial dataset, and the
platform identifies complementary datasets to augment (join or union) the
requester's dataset such that the ML model (e.g., linear regression)
performance is improved most. Although effective, current task-based data
searches are stymied by (1) high latency which deters users, (2) privacy
concerns for regulatory standards, and (3) low data quality which provides low
utility. We introduce Mileena, a fast, private, and high-quality task-based
dataset search platform. At its heart, Mileena is built on pre-computed
semi-ring sketches for efficient ML training and evaluation. Based on
semi-ring, we develop a novel Factorized Privacy Mechanism that makes the
search differentially private and scales to arbitrary corpus sizes and numbers
of requests without major quality degradation. We also demonstrate the early
promise in using LLM-based agents for automatic data transformation and
applying semi-rings to support causal discovery and treatment effect
estimation
A Comprehensive Survey on Trustworthy Graph Neural Networks: Privacy, Robustness, Fairness, and Explainability
Graph Neural Networks (GNNs) have made rapid developments in the recent
years. Due to their great ability in modeling graph-structured data, GNNs are
vastly used in various applications, including high-stakes scenarios such as
financial analysis, traffic predictions, and drug discovery. Despite their
great potential in benefiting humans in the real world, recent study shows that
GNNs can leak private information, are vulnerable to adversarial attacks, can
inherit and magnify societal bias from training data and lack interpretability,
which have risk of causing unintentional harm to the users and society. For
example, existing works demonstrate that attackers can fool the GNNs to give
the outcome they desire with unnoticeable perturbation on training graph. GNNs
trained on social networks may embed the discrimination in their decision
process, strengthening the undesirable societal bias. Consequently, trustworthy
GNNs in various aspects are emerging to prevent the harm from GNN models and
increase the users' trust in GNNs. In this paper, we give a comprehensive
survey of GNNs in the computational aspects of privacy, robustness, fairness,
and explainability. For each aspect, we give the taxonomy of the related
methods and formulate the general frameworks for the multiple categories of
trustworthy GNNs. We also discuss the future research directions of each aspect
and connections between these aspects to help achieve trustworthiness
Differentially Private Conditional Independence Testing
Conditional independence (CI) tests are widely used in statistical data
analysis, e.g., they are the building block of many algorithms for causal graph
discovery. The goal of a CI test is to accept or reject the null hypothesis
that , where . In this work, we investigate conditional
independence testing under the constraint of differential privacy. We design
two private CI testing procedures: one based on the generalized covariance
measure of Shah and Peters (2020) and another based on the conditional
randomization test of Cand\`es et al. (2016) (under the model-X assumption). We
provide theoretical guarantees on the performance of our tests and validate
them empirically. These are the first private CI tests with rigorous
theoretical guarantees that work for the general case when is continuous
PreFair: Privately Generating Justifiably Fair Synthetic Data
When a database is protected by Differential Privacy (DP), its usability is
limited in scope. In this scenario, generating a synthetic version of the data
that mimics the properties of the private data allows users to perform any
operation on the synthetic data, while maintaining the privacy of the original
data. Therefore, multiple works have been devoted to devising systems for DP
synthetic data generation. However, such systems may preserve or even magnify
properties of the data that make it unfair, endering the synthetic data unfit
for use. In this work, we present PreFair, a system that allows for DP fair
synthetic data generation. PreFair extends the state-of-the-art DP data
generation mechanisms by incorporating a causal fairness criterion that ensures
fair synthetic data. We adapt the notion of justifiable fairness to fit the
synthetic data generation scenario. We further study the problem of generating
DP fair synthetic data, showing its intractability and designing algorithms
that are optimal under certain assumptions. We also provide an extensive
experimental evaluation, showing that PreFair generates synthetic data that is
significantly fairer than the data generated by leading DP data generation
mechanisms, while remaining faithful to the private data.Comment: 15 pages, 11 figure
- …