Privacy-preserving learning of gradient boosting decision trees (GBDT) has
the potential for strong utility-privacy tradeoffs for tabular data, such as
census data or medical meta data: classical GBDT learners can extract
non-linear patterns from small sized datasets. The state-of-the-art notion for
provable privacy-properties is differential privacy, which requires that the
impact of single data points is limited and deniable. We introduce a novel
differentially private GBDT learner and utilize four main techniques to improve
the utility-privacy tradeoff. (1) We use an improved noise scaling approach
with tighter accounting of privacy leakage of a decision tree leaf compared to
prior work, resulting in noise that in expectation scales with O(1/n), for
n data points. (2) We integrate individual R\'enyi filters to our method to
learn from data points that have been underutilized during an iterative
training process, which -- potentially of independent interest -- results in a
natural yet effective insight to learning streams of non-i.i.d. data. (3) We
incorporate the concept of random decision tree splits to concentrate privacy
budget on learning leaves. (4) We deploy subsampling for privacy amplification.
Our evaluation shows for the Abalone dataset (<4k training data points) a
R2-score of 0.39 for ε=0.15, which the closest prior work only
achieved for ε=10.0. On the Adult dataset (50k training data
points) we achieve test error of 18.7% for ε=0.07 which the
closest prior work only achieved for ε=1.0. For the Abalone dataset
for ε=0.54 we achieve R2-score of 0.47 which is very close to
the R2-score of 0.54 for the nonprivate version of GBDT. For the Adult
dataset for ε=0.54 we achieve test error 17.1% which is very
close to the test error 13.7% of the nonprivate version of GBDT.Comment: The first two authors equally contributed to this wor