2,247 research outputs found
A Note on "Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms"
Data valuation is a growing research field that studies the influence of
individual data points for machine learning (ML) models. Data Shapley, inspired
by cooperative game theory and economics, is an effective method for data
valuation. However, it is well-known that the Shapley value (SV) can be
computationally expensive. Fortunately, Jia et al. (2019) showed that for
K-Nearest Neighbors (KNN) models, the computation of Data Shapley is
surprisingly simple and efficient.
In this note, we revisit the work of Jia et al. (2019) and propose a more
natural and interpretable utility function that better reflects the performance
of KNN models. We derive the corresponding calculation procedure for the Data
Shapley of KNN classifiers/regressors with the new utility functions. Our new
approach, dubbed soft-label KNN-SV, achieves the same time complexity as the
original method. We further provide an efficient approximation algorithm for
soft-label KNN-SV based on locality sensitive hashing (LSH). Our experimental
results demonstrate that Soft-label KNN-SV outperforms the original method on
most datasets in the task of mislabeled data detection, making it a better
baseline for future work on data valuation
Towards a Scalable Dynamic Spatial Database System
With the rise of GPS-enabled smartphones and other similar mobile devices,
massive amounts of location data are available. However, no scalable solutions
for soft real-time spatial queries on large sets of moving objects have yet
emerged. In this paper we explore and measure the limits of actual algorithms
and implementations regarding different application scenarios. And finally we
propose a novel distributed architecture to solve the scalability issues.Comment: (2012
Threshold KNN-Shapley: A Linear-Time and Privacy-Friendly Approach to Data Valuation
Data valuation aims to quantify the usefulness of individual data sources in
training machine learning (ML) models, and is a critical aspect of data-centric
ML research. However, data valuation faces significant yet frequently
overlooked privacy challenges despite its importance. This paper studies these
challenges with a focus on KNN-Shapley, one of the most practical data
valuation methods nowadays. We first emphasize the inherent privacy risks of
KNN-Shapley, and demonstrate the significant technical difficulties in adapting
KNN-Shapley to accommodate differential privacy (DP). To overcome these
challenges, we introduce TKNN-Shapley, a refined variant of KNN-Shapley that is
privacy-friendly, allowing for straightforward modifications to incorporate DP
guarantee (DP-TKNN-Shapley). We show that DP-TKNN-Shapley has several
advantages and offers a superior privacy-utility tradeoff compared to naively
privatized KNN-Shapley in discerning data quality. Moreover, even non-private
TKNN-Shapley achieves comparable performance as KNN-Shapley. Overall, our
findings suggest that TKNN-Shapley is a promising alternative to KNN-Shapley,
particularly for real-world applications involving sensitive data.Comment: NeurIPS 2023 Spotligh
- …