7 research outputs found
Finding Influential Training Samples for Gradient Boosted Decision Trees
We address the problem of finding influential training samples for a
particular case of tree ensemble-based models, e.g., Random Forest (RF) or
Gradient Boosted Decision Trees (GBDT). A natural way of formalizing this
problem is studying how the model's predictions change upon leave-one-out
retraining, leaving out each individual training sample. Recent work has shown
that, for parametric models, this analysis can be conducted in a
computationally efficient way. We propose several ways of extending this
framework to non-parametric GBDT ensembles under the assumption that tree
structures remain fixed. Furthermore, we introduce a general scheme of
obtaining further approximations to our method that balance the trade-off
between performance and computational complexity. We evaluate our approaches on
various experimental setups and use-case scenarios and demonstrate both the
quality of our approach to finding influential training samples in comparison
to the baselines and its computational efficiency.Comment: Added the "Acknowledgements" sectio
Towards Efficient Data Valuation Based on the Shapley Value
"How much is my data worth?" is an increasingly common question posed by
organizations and individuals alike. An answer to this question could allow,
for instance, fairly distributing profits among multiple data contributors and
determining prospective compensation when data breaches happen. In this paper,
we study the problem of data valuation by utilizing the Shapley value, a
popular notion of value which originated in coopoerative game theory. The
Shapley value defines a unique payoff scheme that satisfies many desiderata for
the notion of data value. However, the Shapley value often requires exponential
time to compute. To meet this challenge, we propose a repertoire of efficient
algorithms for approximating the Shapley value. We also demonstrate the value
of each training instance for various benchmark datasets
Less Is Better: Unweighted Data Subsampling via Influence Function
In the time of Big Data, training complex models on large-scale data sets is
challenging, making it appealing to reduce data volume for saving computation
resources by subsampling. Most previous works in subsampling are weighted
methods designed to help the performance of subset-model approach the
full-set-model, hence the weighted methods have no chance to acquire a
subset-model that is better than the full-set-model. However, we question that
how can we achieve better model with less data? In this work, we propose a
novel Unweighted Influence Data Subsampling (UIDS) method, and prove that the
subset-model acquired through our method can outperform the full-set-model.
Besides, we show that overly confident on a given test set for sampling is
common in Influence-based subsampling methods, which can eventually cause our
subset-model's failure in out-of-sample test. To mitigate it, we develop a
probabilistic sampling scheme to control the worst-case risk over all
distributions close to the empirical distribution. The experiment results
demonstrate our methods superiority over existed subsampling methods in diverse
tasks, such as text classification, image classification, click-through
prediction, etc.Comment: AAAI 202
HYDRA: Hypergradient Data Relevance Analysis for Interpreting Deep Neural Networks
The behaviors of deep neural networks (DNNs) are notoriously resistant to
human interpretations. In this paper, we propose Hypergradient Data Relevance
Analysis, or HYDRA, which interprets the predictions made by DNNs as effects of
their training data. Existing approaches generally estimate data contributions
around the final model parameters and ignore how the training data shape the
optimization trajectory. By unrolling the hypergradient of test loss w.r.t. the
weights of training data, HYDRA assesses the contribution of training data
toward test data points throughout the training trajectory. In order to
accelerate computation, we remove the Hessian from the calculation and prove
that, under moderate conditions, the approximation error is bounded.
Corroborating this theoretical claim, empirical results indicate the error is
indeed small. In addition, we quantitatively demonstrate that HYDRA outperforms
influence functions in accurately estimating data contribution and detecting
noisy data labels. The source code is available at
https://github.com/cyyever/aaai_hydra_8686