We address the problem of finding influential training samples for a
particular case of tree ensemble-based models, e.g., Random Forest (RF) or
Gradient Boosted Decision Trees (GBDT). A natural way of formalizing this
problem is studying how the model's predictions change upon leave-one-out
retraining, leaving out each individual training sample. Recent work has shown
that, for parametric models, this analysis can be conducted in a
computationally efficient way. We propose several ways of extending this
framework to non-parametric GBDT ensembles under the assumption that tree
structures remain fixed. Furthermore, we introduce a general scheme of
obtaining further approximations to our method that balance the trade-off
between performance and computational complexity. We evaluate our approaches on
various experimental setups and use-case scenarios and demonstrate both the
quality of our approach to finding influential training samples in comparison
to the baselines and its computational efficiency.Comment: Added the "Acknowledgements" sectio