Search CORE

242 research outputs found

New Frameworks for Offline and Streaming Coreset Constructions

Author: Braverman Vladimir
Feldman Dan
Lang Harry
Statman Adiel
Zhou Samson
Publication venue
Publication date: 18/09/2022
Field of study

A coreset for a set of points is a small subset of weighted points that approximately preserves important properties of the original set. Specifically, if

P

is a set of points,

Q

is a set of queries, and

f:P\times Q\to\mathbb{R}

is a cost function, then a set

S\subseteq P

with weights

w:P\to[0,\infty)

is an

\epsilon

-coreset for some parameter

\epsilon>0

\sum_{s\in S}w(s)f(s,q)

is a

(1+\epsilon)

multiplicative approximation to

\sum_{p\in P}f(p,q)

for all

q\in Q

. Coresets are used to solve fundamental problems in machine learning under various big data models of computation. Many of the suggested coresets in the recent decade used, or could have used a general framework for constructing coresets whose size depends quadratically on what is known as total sensitivity

t

. In this paper we improve this bound from

O(t^2)

O(t\log t)

. Thus our results imply more space efficient solutions to a number of problems, including projective clustering,

k

-line clustering, and subspace approximation. Moreover, we generalize the notion of sensitivity sampling for sup-sampling that supports non-multiplicative approximations, negative cost functions and more. The main technical result is a generic reduction to the sample complexity of learning a class of functions with bounded VC dimension. We show that obtaining an

(\nu,\alpha)

-sample for this class of functions with appropriate parameters

\nu

and

\alpha

suffices to achieve space efficient

\epsilon

-coresets. Our result implies more efficient coreset constructions for a number of interesting problems in machine learning; we show applications to

k

-median/

k

-means,

k

-line clustering,

j

-subspace approximation, and the integer

(j,k)

-projective clustering problem

arXiv.org e-Print Archive

Streaming Coreset Constructions for M-Estimators

Author: Braverman Vladimir
Feldman Dan
Lang Harry
Rus Daniela
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2019)
Publication date: 01/01/2019
Field of study

We introduce a new method of maintaining a (k,epsilon)-coreset for clustering M-estimators over insertion-only streams. Let (P,w) be a weighted set (where w : P - > [0,infty) is the weight function) of points in a rho-metric space (meaning a set X equipped with a positive-semidefinite symmetric function D such that D(x,z) <=rho(D(x,y) + D(y,z)) for all x,y,z in X). For any set of points C, we define COST(P,w,C) = sum_{p in P} w(p) min_{c in C} D(p,c). A (k,epsilon)-coreset for (P,w) is a weighted set (Q,v) such that for every set C of k points, (1-epsilon)COST(P,w,C) <= COST(Q,v,C) <= (1+epsilon)COST(P,w,C). Essentially, the coreset (Q,v) can be used in place of (P,w) for all operations concerning the COST function. Coresets, as a method of data reduction, are used to solve fundamental problems in machine learning of streaming and distributed data. M-estimators are functions D(x,y) that can be written as psi(d(x,y)) where ({X}, d) is a true metric (i.e. 1-metric) space. Special cases of M-estimators include the well-known k-median (psi(x) =x) and k-means (psi(x) = x^2) functions. Our technique takes an existing offline construction for an M-estimator coreset and converts it into the streaming setting, where n data points arrive sequentially. To our knowledge, this is the first streaming construction for any M-estimator that does not rely on the merge-and-reduce tree. For example, our coreset for streaming metric k-means uses O(epsilon^{-2} k log k log n) points of storage. The previous state-of-the-art required storing at least O(epsilon^{-2} k log k log^{4} n) points

Dagstuhl Research Online Publication Server

Effective and efficient algorithm for multiobjective optimization of hydrologic models

Author: Bastidas LA
Bouten W
Gupta HV
Sorooshian S
Vrugt JA
Publication venue: eScholarship, University of California
Publication date: 01/01/2003
Field of study

Practical experience with the calibration of hydrologic models suggests that any single-objective function, no matter how carefully chosen, is often inadequate to properly measure all of the characteristics of the observed data deemed to be important. One strategy to circumvent this problem is to define several optimization criteria (objective functions) that measure different (complementary) aspects of the system behavior and to use multicriteria optimization to identify the set of nondominated, efficient, or Pareto optimal solutions. In this paper, we present an efficient and effective Markov Chain Monte Carlo sampler, entitled the Multiobjective Shuffled Complex Evolution Metropolis (MOSCEM) algorithm, which is capable of solving the multiobjective optimization problem for hydrologic models. MOSCEM is an improvement over the Shuffled Complex Evolution Metropolis (SCEM-UA) global optimization algorithm, using the concept of Pareto dominance (rather than direct single-objective function evaluation) to evolve the initial population of points toward a set of solutions stemming from a stable distribution (Pareto set). The efficacy of the MOSCEM-UA algorithm is compared with the original MOCOM-UA algorithm for three hydrologic modeling case studies of increasing complexity

eScholarship - University of California

International Migration, Integration and Social Cohesion online publications

Spatially Lagged Choropleth Display

Author: Alan T. Murray
Publication venue
Publication date
Field of study

Choropleth display of spatial information is a fundamental feature of mapping andÂ geographic information system technologies. There has long been a desire to impart someÂ spatial influence in the class selection and delineation process of choropleth display. ThisÂ paper presents an approach for representing the spatial influence of neighboring areas inÂ the creation of choropleth classes. The usefulness of this approach is explored using suburbÂ level crime statistics for Brisbane, Australia.

Research Papers in Economics

Parallel Algorithms for Multicriteria Shortest Path Problems

Author: Sonnier David L.
Publication venue: ScholarWorks@UARK
Publication date: 01/01/2006
Field of study

This paper presents two strategies for solving multicriteria shortest path problems with more than two criteria. Given an undirected graph within vertices, medges, and a set of K weights associated with each edge, we define a path as a sequence of edges from vertex s to vertex t. We want to find the Pareto-optimal set of paths from s to t. The solutions proposed herein are based on cluster computing using the Message-Passing Interface (MPI) extensions to the C programming language. We solve problems with 3 and 4 criteria, using up to 8 processors in parallel and using solutions based on two strategies. The first strategy obtains an approximation of the Pareto-optimal set by solving for supported solutions in bi--criteria sub-problems using a weighted-sum approach, then merging the solutions. The second strategy applies the weighted-sum algorithm directly to the tri-criteria and quad-criteria problems to find the Pareto-optimal set of supported solutions, with each processor using a range of weights

ScholarWorks@UARK

Genome-wide analysis of the emigrant family of MITEs: amplification dynamics and evolution of genes in Arabidopsis thaliana

Author: Casacuberta Josep Maria
Goñi Ramon
Herráiz Cristina
Messeguer Peypoch Xavier
Santiago Néstor
Publication venue
Publication date: 01/01/2002
Field of study

MITEs are structurally similar to defective class II elements but their high copy number and the size and sequence conservation of most MITE families suggest that they can be amplified by a replicative mechanism. Here we present a genome-wide analysis of the Emigrant family of MITEs from Arabidopsis thaliana. In order to be able to detect divergent ancient copies and low copy number subfamilies with a different internal sequence we have developed a computer program (http://www.lsi.upc.es/~alggen) that allows looking for Emigrant elements based solely on its TIR sequence. Our results show that different bursts of amplification of one or very few active, or master, elements have occurred at different times during Arabidopsis evolution, with an insertion dynamics similar to that of some SINEs. The analysis of the insertion sites of the Emigrant elements show that, although Emigrant elements tend to integrate far from ORFs, the elements inserted within or close to genes are preferentially maintained during evolution.Postprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

An interior point algorithm for minimum sum-of-squares clustering

Author: Du Merle O
Hansen P
Jaumard B
Mladenović N
Publication venue: 'Society for Industrial & Applied Mathematics (SIAM)'
Publication date: 01/01/1999
Field of study

Copyright @ 2000 SIAM PublicationsAn exact algorithm is proposed for minimum sum-of-squares nonhierarchical clustering, i.e., for partitioning a given set of points from a Euclidean m-space into a given number of clusters in order to minimize the sum of squared distances from all points to the centroid of the cluster to which they belong. This problem is expressed as a constrained hyperbolic program in 0-1 variables. The resolution method combines an interior point algorithm, i.e., a weighted analytic center column generation method, with branch-and-bound. The auxiliary problem of determining the entering column (i.e., the oracle) is an unconstrained hyperbolic program in 0-1 variables with a quadratic numerator and linear denominator. It is solved through a sequence of unconstrained quadratic programs in 0-1 variables. To accelerate resolution, variable neighborhood search heuristics are used both to get a good initial solution and to solve quickly the auxiliary problem as long as global optimality is not reached. Estimated bounds for the dual variables are deduced from the heuristic solution and used in the resolution process as a trust region. Proved minimum sum-of-squares partitions are determined for the rst time for several fairly large data sets from the literature, including Fisher's 150 iris.This research was supported by the Fonds National de la Recherche Scientifique Suisse, NSERC-Canada, and FCAR-Quebec

CiteSeerX

PolyPublie

Brunel University Research Archive