14 research outputs found
Towards Efficient Data Valuation Based on the Shapley Value
"How much is my data worth?" is an increasingly common question posed by
organizations and individuals alike. An answer to this question could allow,
for instance, fairly distributing profits among multiple data contributors and
determining prospective compensation when data breaches happen. In this paper,
we study the problem of data valuation by utilizing the Shapley value, a
popular notion of value which originated in coopoerative game theory. The
Shapley value defines a unique payoff scheme that satisfies many desiderata for
the notion of data value. However, the Shapley value often requires exponential
time to compute. To meet this challenge, we propose a repertoire of efficient
algorithms for approximating the Shapley value. We also demonstrate the value
of each training instance for various benchmark datasets
High-Dimensional Geometric Streaming in Polynomial Space
Many existing algorithms for streaming geometric data analysis have been
plagued by exponential dependencies in the space complexity, which are
undesirable for processing high-dimensional data sets. In particular, once
, there are no known non-trivial streaming algorithms for problems
such as maintaining convex hulls and L\"owner-John ellipsoids of points,
despite a long line of work in streaming computational geometry since [AHV04].
We simultaneously improve these results to bits of
space by trading off with a factor distortion. We
achieve these results in a unified manner, by designing the first streaming
algorithm for maintaining a coreset for subspace embeddings with
space and distortion. Our
algorithm also gives similar guarantees in the \emph{online coreset} model.
Along the way, we sharpen results for online numerical linear algebra by
replacing a log condition number dependence with a dependence,
answering a question of [BDM+20]. Our techniques provide a novel connection
between leverage scores, a fundamental object in numerical linear algebra, and
computational geometry.
For subspace embeddings, we give nearly optimal trade-offs between
space and distortion for one-pass streaming algorithms. For instance, we give a
deterministic coreset using space and
distortion for , whereas previous deterministic algorithms incurred a
factor in the space or the distortion [CDW18].
Our techniques have implications in the offline setting, where we give
optimal trade-offs between the space complexity and distortion of subspace
sketch data structures. To do this, we give an elementary proof of a "change of
density" theorem of [LT80] and make it algorithmic.Comment: Abstract shortened to meet arXiv limits; v2 fix statements concerning
online condition numbe
Streaming Low-Rank Matrix Approximation with an Application to Scientific Simulation
This paper argues that randomized linear sketching is a natural tool for on-the-fly compression of data matrices that arise from large-scale scientific simulations and data collection. The technical contribution consists in a new algorithm for constructing an accurate low-rank approximation of a matrix from streaming data. This method is accompanied by an a priori analysis that allows the user to set algorithm parameters with confidence and an a posteriori error estimator that allows the user to validate the quality of the reconstructed matrix. In comparison to previous techniques, the new method achieves smaller relative approximation errors and is less sensitive to parameter choices. As concrete applications, the paper outlines how the algorithm can be used to compress a Navier--Stokes simulation and a sea surface temperature dataset