12,815 research outputs found
Privacy-Preserving Gradient Boosting Decision Trees
The Gradient Boosting Decision Tree (GBDT) is a popular machine learning
model for various tasks in recent years. In this paper, we study how to improve
model accuracy of GBDT while preserving the strong guarantee of differential
privacy. Sensitivity and privacy budget are two key design aspects for the
effectiveness of differential private models. Existing solutions for GBDT with
differential privacy suffer from the significant accuracy loss due to too loose
sensitivity bounds and ineffective privacy budget allocations (especially
across different trees in the GBDT model). Loose sensitivity bounds lead to
more noise to obtain a fixed privacy level. Ineffective privacy budget
allocations worsen the accuracy loss especially when the number of trees is
large. Therefore, we propose a new GBDT training algorithm that achieves
tighter sensitivity bounds and more effective noise allocations. Specifically,
by investigating the property of gradient and the contribution of each tree in
GBDTs, we propose to adaptively control the gradients of training data for each
iteration and leaf node clipping in order to tighten the sensitivity bounds.
Furthermore, we design a novel boosting framework to allocate the privacy
budget between trees so that the accuracy loss can be further reduced. Our
experiments show that our approach can achieve much better model accuracy than
other baselines
OpBoost: A Vertical Federated Tree Boosting Framework Based on Order-Preserving Desensitization
Vertical Federated Learning (FL) is a new paradigm that enables users with
non-overlapping attributes of the same data samples to jointly train a model
without directly sharing the raw data. Nevertheless, recent works show that
it's still not sufficient to prevent privacy leakage from the training process
or the trained model. This paper focuses on studying the privacy-preserving
tree boosting algorithms under the vertical FL. The existing solutions based on
cryptography involve heavy computation and communication overhead and are
vulnerable to inference attacks. Although the solution based on Local
Differential Privacy (LDP) addresses the above problems, it leads to the low
accuracy of the trained model.
This paper explores to improve the accuracy of the widely deployed tree
boosting algorithms satisfying differential privacy under vertical FL.
Specifically, we introduce a framework called OpBoost. Three order-preserving
desensitization algorithms satisfying a variant of LDP called distance-based
LDP (dLDP) are designed to desensitize the training data. In particular, we
optimize the dLDP definition and study efficient sampling distributions to
further improve the accuracy and efficiency of the proposed algorithms. The
proposed algorithms provide a trade-off between the privacy of pairs with large
distance and the utility of desensitized values. Comprehensive evaluations show
that OpBoost has a better performance on prediction accuracy of trained models
compared with existing LDP approaches on reasonable settings. Our code is open
source
Privet: A Privacy-Preserving Vertical Federated Learning Service for Gradient Boosted Decision Tables
Vertical federated learning (VFL) has recently emerged as an appealing
distributed paradigm empowering multi-party collaboration for training
high-quality models over vertically partitioned datasets. Gradient boosting has
been popularly adopted in VFL, which builds an ensemble of weak learners
(typically decision trees) to achieve promising prediction performance.
Recently there have been growing interests in using decision table as an
intriguing alternative weak learner in gradient boosting, due to its simpler
structure, good interpretability, and promising performance. In the literature,
there have been works on privacy-preserving VFL for gradient boosted decision
trees, but no prior work has been devoted to the emerging case of decision
tables. Training and inference on decision tables are different from that the
case of generic decision trees, not to mention gradient boosting with decision
tables in VFL. In light of this, we design, implement, and evaluate Privet, the
first system framework enabling privacy-preserving VFL service for gradient
boosted decision tables. Privet delicately builds on lightweight cryptography
and allows an arbitrary number of participants holding vertically partitioned
datasets to securely train gradient boosted decision tables. Extensive
experiments over several real-world datasets and synthetic datasets demonstrate
that Privet achieves promising performance, with utility comparable to
plaintext centralized learning.Comment: Accepted in IEEE Transactions on Services Computing (TSC
Confidential Boosting with Random Linear Classifiers for Outsourced User-generated Data
User-generated data is crucial to predictive modeling in many applications.
With a web/mobile/wearable interface, a data owner can continuously record data
generated by distributed users and build various predictive models from the
data to improve their operations, services, and revenue. Due to the large size
and evolving nature of users data, data owners may rely on public cloud service
providers (Cloud) for storage and computation scalability. Exposing sensitive
user-generated data and advanced analytic models to Cloud raises privacy
concerns. We present a confidential learning framework, SecureBoost, for data
owners that want to learn predictive models from aggregated user-generated data
but offload the storage and computational burden to Cloud without having to
worry about protecting the sensitive data. SecureBoost allows users to submit
encrypted or randomly masked data to designated Cloud directly. Our framework
utilizes random linear classifiers (RLCs) as the base classifiers in the
boosting framework to dramatically simplify the design of the proposed
confidential boosting protocols, yet still preserve the model quality. A
Cryptographic Service Provider (CSP) is used to assist the Cloud's processing,
reducing the complexity of the protocol constructions. We present two
constructions of SecureBoost: HE+GC and SecSh+GC, using combinations of
homomorphic encryption, garbled circuits, and random masking to achieve both
security and efficiency. For a boosted model, Cloud learns only the RLCs and
the CSP learns only the weights of the RLCs. Finally, the data owner collects
the two parts to get the complete model. We conduct extensive experiments to
understand the quality of the RLC-based boosting and the cost distribution of
the constructions. Our results show that SecureBoost can efficiently learn
high-quality boosting models from protected user-generated data
Privacy Preserving Text Recognition with Gradient-Boosting for Federated Learning
Typical machine learning approaches require centralized data for model
training, which may not be possible where restrictions on data sharing are in
place due to, for instance, privacy protection. The recently proposed Federated
Learning (FL) frame-work allows learning a shared model collaboratively without
data being centralized or data sharing among data owners. However, we show in
this paper that the generalization ability of the joint model is poor on
Non-Independent and Non-Identically Dis-tributed (Non-IID) data, particularly
when the Federated Averaging (FedAvg) strategy is used in this collaborative
learning framework thanks to the weight divergence phenomenon. We propose a
novel boosting algorithm for FL to address this generalisation issue, as well
as achieving much faster convergence in gradient based optimization. We
demonstrate our Federated Boosting (FedBoost) method on privacy-preserved text
recognition, which shows significant improvements in both performance and
efficiency. The text images are based on publicly available datasets for fair
comparison and we intend to make our implementation public to ensure
reproducibility.Comment: The paper has been submitted to BMVC2020 on April 30t
Efficient Algorithms for Privately Releasing Marginals via Convex Relaxations
Consider a database of people, each represented by a bit-string of length
corresponding to the setting of binary attributes. A -way marginal
query is specified by a subset of attributes, and a -dimensional
binary vector specifying their values. The result for this query is a
count of the number of people in the database whose attribute vector restricted
to agrees with .
Privately releasing approximate answers to a set of -way marginal queries
is one of the most important and well-motivated problems in differential
privacy. Information theoretically, the error complexity of marginal queries is
well-understood: the per-query additive error is known to be at least
and at most
. However, no polynomial
time algorithm with error complexity as low as the information theoretic upper
bound is known for small . In this work we present a polynomial time
algorithm that, for any distribution on marginal queries, achieves average
error at most . This error
bound is as good as the best known information theoretic upper bounds for
. This bound is an improvement over previous work on efficiently releasing
marginals when is small and when error is desirable. Using private
boosting we are also able to give nearly matching worst-case error bounds.
Our algorithms are based on the geometric techniques of Nikolov, Talwar, and
Zhang. The main new ingredients are convex relaxations and careful use of the
Frank-Wolfe algorithm for constrained convex minimization. To design our
relaxations, we rely on the Grothendieck inequality from functional analysis
- …