Search CORE

312 research outputs found

Streaming Coreset Constructions for M-Estimators

Author: Braverman Vladimir
Feldman Dan
Lang Harry
Rus Daniela
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2019)
Publication date: 01/01/2019
Field of study

We introduce a new method of maintaining a (k,epsilon)-coreset for clustering M-estimators over insertion-only streams. Let (P,w) be a weighted set (where w : P - > [0,infty) is the weight function) of points in a rho-metric space (meaning a set X equipped with a positive-semidefinite symmetric function D such that D(x,z) <=rho(D(x,y) + D(y,z)) for all x,y,z in X). For any set of points C, we define COST(P,w,C) = sum_{p in P} w(p) min_{c in C} D(p,c). A (k,epsilon)-coreset for (P,w) is a weighted set (Q,v) such that for every set C of k points, (1-epsilon)COST(P,w,C) <= COST(Q,v,C) <= (1+epsilon)COST(P,w,C). Essentially, the coreset (Q,v) can be used in place of (P,w) for all operations concerning the COST function. Coresets, as a method of data reduction, are used to solve fundamental problems in machine learning of streaming and distributed data. M-estimators are functions D(x,y) that can be written as psi(d(x,y)) where ({X}, d) is a true metric (i.e. 1-metric) space. Special cases of M-estimators include the well-known k-median (psi(x) =x) and k-means (psi(x) = x^2) functions. Our technique takes an existing offline construction for an M-estimator coreset and converts it into the streaming setting, where n data points arrive sequentially. To our knowledge, this is the first streaming construction for any M-estimator that does not rely on the merge-and-reduce tree. For example, our coreset for streaming metric k-means uses O(epsilon^{-2} k log k log n) points of storage. The previous state-of-the-art required storing at least O(epsilon^{-2} k log k log^{4} n) points

Dagstuhl Research Online Publication Server

Improved Algorithms for Time Decay Streams

Author: Braverman Vladimir
Lang Harry
Ullah Enayat
Zhou Samson
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2019)
Publication date: 01/01/2019
Field of study

In the time-decay model for data streams, elements of an underlying data set arrive sequentially with the recently arrived elements being more important. A common approach for handling large data sets is to maintain a coreset, a succinct summary of the processed data that allows approximate recovery of a predetermined query. We provide a general framework that takes any offline-coreset and gives a time-decay coreset for polynomial time decay functions. We also consider the exponential time decay model for k-median clustering, where we provide a constant factor approximation algorithm that utilizes the online facility location algorithm. Our algorithm stores O(k log(h Delta)+h) points where h is the half-life of the decay function and Delta is the aspect ratio of the dataset. Our techniques extend to k-means clustering and M-estimators as well

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

New Frameworks for Offline and Streaming Coreset Constructions

Author: Braverman Vladimir
Feldman Dan
Lang Harry
Statman Adiel
Zhou Samson
Publication venue
Publication date: 18/09/2022
Field of study

A coreset for a set of points is a small subset of weighted points that approximately preserves important properties of the original set. Specifically, if

P

is a set of points,

Q

is a set of queries, and

f:P\times Q\to\mathbb{R}

is a cost function, then a set

S\subseteq P

with weights

w:P\to[0,\infty)

is an

\epsilon

-coreset for some parameter

\epsilon>0

\sum_{s\in S}w(s)f(s,q)

is a

(1+\epsilon)

multiplicative approximation to

\sum_{p\in P}f(p,q)

for all

q\in Q

. Coresets are used to solve fundamental problems in machine learning under various big data models of computation. Many of the suggested coresets in the recent decade used, or could have used a general framework for constructing coresets whose size depends quadratically on what is known as total sensitivity

t

. In this paper we improve this bound from

O(t^2)

O(t\log t)

. Thus our results imply more space efficient solutions to a number of problems, including projective clustering,

k

-line clustering, and subspace approximation. Moreover, we generalize the notion of sensitivity sampling for sup-sampling that supports non-multiplicative approximations, negative cost functions and more. The main technical result is a generic reduction to the sample complexity of learning a class of functions with bounded VC dimension. We show that obtaining an

(\nu,\alpha)

-sample for this class of functions with appropriate parameters

\nu

and

\alpha

suffices to achieve space efficient

\epsilon

-coresets. Our result implies more efficient coreset constructions for a number of interesting problems in machine learning; we show applications to

k

-median/

k

-means,

k

-line clustering,

j

-subspace approximation, and the integer

(j,k)

-projective clustering problem

arXiv.org e-Print Archive

La degradación del trabajo en el siglo XX

Author: Braverman Harry
Publication venue: Instituto Superior del Profesorado N° 3 "Eduardo Lafferriére", Villa Constitución
Publication date: 01/09/2011
Field of study

TEXTOS Articulo Publicado en Monthly Review, Volumen 34, núm. 1 (mayo 1982). Nota de Monthly Review: “Este artículo es la versión corregida de una conferencia que Harry Braverman pronunció en la primavera de 1975 en el Instituto de Tecnología de Virginia Occidental (EEUU). Por lo que sabemos, ésta fue su última intervención pública registrada.” La traducción de este artículo estuvo a cargo del equipo de Residencia del Traductorado en Inglés, del Instituto de Enseñanza Superior en Lenguas Vivas "Juan Ramón Fernández" (Buenos Aires) y fue realizada para Taller, Revista de Sociedad, Cultura y Política. Agradecemos a nuestros colegas y amigos de su Comité Editorial por permitirnos su publicación

Directory of Open Access Journals

Clustering on Sliding Windows in Polylogarithmic Space

Author: Braverman Vladimir
Lang Harry
Levin Keith
Monemizadeh Morteza
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 35th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2015)
Publication date: 01/01/2015
Field of study

In PODS 2003, Babcock, Datar, Motwani and O\u27Callaghan gave the first streaming solution for the k-median problem on sliding windows using O(frack k tau^4 W^2tau log^2 W) space, with a O(2^O(1/tau)) approximation factor, where W is the window size and tau in (0,1/2) is a user-specified parameter. They left as an open question whether it is possible to improve this to polylogarithmic space. Despite much progress on clustering and sliding windows, this question has remained open for more than a decade. In this paper, we partially answer the main open question posed by Babcock, Datar, Motwani and O\u27Callaghan. We present an algorithm yielding an exponential improvement in space compared to the previous result given in Babcock, et al. In particular, we give the first polylogarithmic space (alpha,beta)-approximation for metric k-median clustering in the sliding window model, where alpha and beta are constants, under the assumption, also made by Babcock et al., that the optimal k-median cost on any given window is bounded by a polynomial in the window size. We justify this assumption by showing that when the cost is exponential in the window size, no sublinear space approximation is possible. Our main technical contribution is a simple but elegant extension of smooth functions as introduced by Braverman and Ostrovsky, which allows us to apply well-known techniques for solving problems in the sliding window model to functions that are not smooth, such as the k-median cost

CiteSeerX

Dagstuhl Research Online Publication Server

Pregnancy after aortic root replacement in Marfan\u27s syndrome: A case series and review of the literature

Author: Braverman Alan C
Dietz Harry C
Habashi Jennifer
Lindley Kathryn J
Russo Melissa
Williams Dominique
Publication venue: Digital Commons@Becker
Publication date: 01/01/2018
Field of study

Digital Commons@Becker

Approximate Convex Hull of Data Streams

Author: Blum Avrim
Braverman Vladimir
Kumar Ananya
Lang Harry
Yang Lin F.
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 45th International Colloquium on Automata, Languages, and Programming (ICALP 2018)
Publication date: 14/12/2017
Field of study

Given a finite set of points P subseteq R^d, we would like to find a small subset S subseteq P such that the convex hull of S approximately contains P. More formally, every point in P is within distance epsilon from the convex hull of S. Such a subset S is called an epsilon-hull. Computing an epsilon-hull is an important problem in computational geometry, machine learning, and approximation algorithms. In many applications, the set P is too large to fit in memory. We consider the streaming model where the algorithm receives the points of P sequentially and strives to use a minimal amount of memory. Existing streaming algorithms for computing an epsilon-hull require O(epsilon^{(1-d)/2}) space, which is optimal for a worst-case input. However, this ignores the structure of the data. The minimal size of an epsilon-hull of P, which we denote by OPT, can be much smaller. A natural question is whether a streaming algorithm can compute an epsilon-hull using only O(OPT) space. We begin with lower bounds that show, under a reasonable streaming model, that it is not possible to have a single-pass streaming algorithm that computes an epsilon-hull with O(OPT) space. We instead propose three relaxations of the problem for which we can compute epsilon-hulls using space near-linear to the optimal size. Our first algorithm for points in R^2 that arrive in random-order uses O(log n * OPT) space. Our second algorithm for points in R^2 makes O(log(epsilon^{-1})) passes before outputting the epsilon-hull and requires O(OPT) space. Our third algorithm, for points in R^d for any fixed dimension d, outputs, with high probability, an epsilon-hull for all but delta-fraction of directions and requires O(OPT * log OPT) space

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server