15 research outputs found
Experimental Evaluation of Fully Dynamic k-Means via Coresets
For a set of points in , the Euclidean -means problems
consists of finding centers such that the sum of distances squared from
each data point to its closest center is minimized. Coresets are one the main
tools developed recently to solve this problem in a big data context. They
allow to compress the initial dataset while preserving its structure: running
any algorithm on the coreset provides a guarantee almost equivalent to running
it on the full data.
In this work, we study coresets in a fully-dynamic setting: points are added
and deleted with the goal to efficiently maintain a coreset with which a
k-means solution can be computed. Based on an algorithm from Henzinger and Kale
[ESA'20], we present an efficient and practical implementation of a fully
dynamic coreset algorithm, that improves the running time by up to a factor of
20 compared to our non-optimized implementation of the algorithm by Henzinger
and Kale, without sacrificing more than 7% on the quality of the k-means
solution.Comment: Accepted at ALENEX 2
Initializing Services in Interactive ML Systems for Diverse Users
This paper studies ML systems that interactively learn from users across
multiple subpopulations with heterogeneous data distributions. The primary
objective is to provide specialized services for different user groups while
also predicting user preferences. Once the users select a service based on how
well the service anticipated their preference, the services subsequently adapt
and refine themselves based on the user data they accumulate, resulting in an
iterative, alternating minimization process between users and services
(learning dynamics). Employing such tailored approaches has two main
challenges: (i) Unknown user preferences: Typically, data on user preferences
are unavailable without interaction, and uniform data collection across a large
and diverse user base can be prohibitively expensive. (ii) Suboptimal Local
Solutions: The total loss (sum of loss functions across all users and all
services) landscape is not convex even if the individual losses on a single
service are convex, making it likely for the learning dynamics to get stuck in
local minima. The final outcome of the aforementioned learning dynamics is thus
strongly influenced by the initial set of services offered to users, and is not
guaranteed to be close to the globally optimal outcome. In this work, we
propose a randomized algorithm to adaptively select very few users to collect
preference data from, while simultaneously initializing a set of services. We
prove that under mild assumptions on the loss functions, the expected total
loss achieved by the algorithm right after initialization is within a factor of
the globally optimal total loss with complete user preference data, and this
factor scales only logarithmically in the number of services. Our theory is
complemented by experiments on real as well as semi-synthetic datasets