917 research outputs found
Differentially-Private Decision Trees with Probabilistic Robustness to Data Poisoning
Decision trees are interpretable models that are well-suited to non-linear
learning problems. Much work has been done on extending decision tree learning
algorithms with differential privacy, a system that guarantees the privacy of
samples within the training data. However, current state-of-the-art algorithms
for this purpose sacrifice much utility for a small privacy benefit. These
solutions create random decision nodes that reduce decision tree accuracy or
spend an excessive share of the privacy budget on labeling leaves. Moreover,
many works do not support or leak information about feature values when data is
continuous. We propose a new method called PrivaTree based on private
histograms that chooses good splits while consuming a small privacy budget. The
resulting trees provide a significantly better privacy-utility trade-off and
accept mixed numerical and categorical data without leaking additional
information. Finally, while it is notoriously hard to give robustness
guarantees against data poisoning attacks, we prove bounds for the expected
success rates of backdoor attacks against differentially-private learners. Our
experimental results show that PrivaTree consistently outperforms previous
works on predictive accuracy and significantly improves robustness against
backdoor attacks compared to regular decision trees
Learning in the Real World: Constraints on Cost, Space, and Privacy
The sheer demand for machine learning in fields as varied as: healthcare, web-search ranking, factory automation, collision prediction, spam filtering, and many others, frequently outpaces the intended use-case of machine learning models. In fact, a growing number of companies hire machine learning researchers to rectify this very problem: to tailor and/or design new state-of-the-art models to the setting at hand.
However, we can generalize a large set of the machine learning problems encountered in practical settings into three categories: cost, space, and privacy. The first category (cost) considers problems that need to balance the accuracy of a machine learning model with the cost required to evaluate it. These include problems in web-search, where results need to be delivered to a user in under a second and be as accurate as possible. The second category (space) collects problems that require running machine learning algorithms on low-memory computing devices. For instance, in search-and-rescue operations we may opt to use many small unmanned aerial vehicles (UAVs) equipped with machine learning algorithms for object detection to find a desired search target. These algorithms should be small to fit within the physical memory limits of the UAV (and be energy efficient) while reliably detecting objects. The third category (privacy) considers problems where one wishes to run machine learning algorithms on sensitive data. It has been shown that seemingly innocuous analyses on such data can be exploited to reveal data individuals would prefer to keep private. Thus, nearly any algorithm that runs on patient or economic data falls under this set of problems.
We devise solutions for each of these problem categories including (i) a fast tree-based model for explicitly trading off accuracy and model evaluation time, (ii) a compression method for the k-nearest neighbor classifier, and (iii) a private causal inference algorithm that protects sensitive data
Privacy-Preserving Federated Learning over Vertically and Horizontally Partitioned Data for Financial Anomaly Detection
The effective detection of evidence of financial anomalies requires
collaboration among multiple entities who own a diverse set of data, such as a
payment network system (PNS) and its partner banks. Trust among these financial
institutions is limited by regulation and competition. Federated learning (FL)
enables entities to collaboratively train a model when data is either
vertically or horizontally partitioned across the entities. However, in
real-world financial anomaly detection scenarios, the data is partitioned both
vertically and horizontally and hence it is not possible to use existing FL
approaches in a plug-and-play manner.
Our novel solution, PV4FAD, combines fully homomorphic encryption (HE),
secure multi-party computation (SMPC), differential privacy (DP), and
randomization techniques to balance privacy and accuracy during training and to
prevent inference threats at model deployment time. Our solution provides input
privacy through HE and SMPC, and output privacy against inference time attacks
through DP. Specifically, we show that, in the honest-but-curious threat model,
banks do not learn any sensitive features about PNS transactions, and the PNS
does not learn any information about the banks' dataset but only learns
prediction labels. We also develop and analyze a DP mechanism to protect output
privacy during inference. Our solution generates high-utility models by
significantly reducing the per-bank noise level while satisfying distributed
DP. To ensure high accuracy, our approach produces an ensemble model, in
particular, a random forest. This enables us to take advantage of the
well-known properties of ensembles to reduce variance and increase accuracy.
Our solution won second prize in the first phase of the U.S. Privacy Enhancing
Technologies (PETs) Prize Challenge.Comment: Prize Winner in the U.S. Privacy Enhancing Technologies (PETs) Prize
Challeng
Ensembling Neural Networks for Improved Prediction and Privacy in Early Diagnosis of Sepsis
Ensembling neural networks is a long-standing technique for improving the
generalization error of neural networks by combining networks with orthogonal
properties via a committee decision. We show that this technique is an ideal
fit for machine learning on medical data: First, ensembles are amenable to
parallel and asynchronous learning, thus enabling efficient training of
patient-specific component neural networks. Second, building on the idea of
minimizing generalization error by selecting uncorrelated patient-specific
networks, we show that one can build an ensemble of a few selected
patient-specific models that outperforms a single model trained on much larger
pooled datasets. Third, the non-iterative ensemble combination step is an
optimal low-dimensional entry point to apply output perturbation to guarantee
the privacy of the patient-specific networks. We exemplify our framework of
differentially private ensembles on the task of early prediction of sepsis,
using real-life intensive care unit data labeled by clinical experts.Comment: Accepted at MLHC 202
- …