242 research outputs found
New Frameworks for Offline and Streaming Coreset Constructions
A coreset for a set of points is a small subset of weighted points that
approximately preserves important properties of the original set. Specifically,
if is a set of points, is a set of queries, and is a cost function, then a set with weights
is an -coreset for some parameter if
is a multiplicative approximation to
for all . Coresets are used to solve fundamental
problems in machine learning under various big data models of computation. Many
of the suggested coresets in the recent decade used, or could have used a
general framework for constructing coresets whose size depends quadratically on
what is known as total sensitivity .
In this paper we improve this bound from to . Thus our
results imply more space efficient solutions to a number of problems, including
projective clustering, -line clustering, and subspace approximation.
Moreover, we generalize the notion of sensitivity sampling for sup-sampling
that supports non-multiplicative approximations, negative cost functions and
more. The main technical result is a generic reduction to the sample complexity
of learning a class of functions with bounded VC dimension. We show that
obtaining an -sample for this class of functions with appropriate
parameters and suffices to achieve space efficient
-coresets.
Our result implies more efficient coreset constructions for a number of
interesting problems in machine learning; we show applications to
-median/-means, -line clustering, -subspace approximation, and the
integer -projective clustering problem
Streaming Coreset Constructions for M-Estimators
We introduce a new method of maintaining a (k,epsilon)-coreset for clustering M-estimators over insertion-only streams. Let (P,w) be a weighted set (where w : P - > [0,infty) is the weight function) of points in a rho-metric space (meaning a set X equipped with a positive-semidefinite symmetric function D such that D(x,z) <=rho(D(x,y) + D(y,z)) for all x,y,z in X). For any set of points C, we define COST(P,w,C) = sum_{p in P} w(p) min_{c in C} D(p,c). A (k,epsilon)-coreset for (P,w) is a weighted set (Q,v) such that for every set C of k points, (1-epsilon)COST(P,w,C) <= COST(Q,v,C) <= (1+epsilon)COST(P,w,C). Essentially, the coreset (Q,v) can be used in place of (P,w) for all operations concerning the COST function. Coresets, as a method of data reduction, are used to solve fundamental problems in machine learning of streaming and distributed data.
M-estimators are functions D(x,y) that can be written as psi(d(x,y)) where ({X}, d) is a true metric (i.e. 1-metric) space. Special cases of M-estimators include the well-known k-median (psi(x) =x) and k-means (psi(x) = x^2) functions. Our technique takes an existing offline construction for an M-estimator coreset and converts it into the streaming setting, where n data points arrive sequentially. To our knowledge, this is the first streaming construction for any M-estimator that does not rely on the merge-and-reduce tree. For example, our coreset for streaming metric k-means uses O(epsilon^{-2} k log k log n) points of storage. The previous state-of-the-art required storing at least O(epsilon^{-2} k log k log^{4} n) points
Effective and efficient algorithm for multiobjective optimization of hydrologic models
Practical experience with the calibration of hydrologic models suggests that any single-objective function, no matter how carefully chosen, is often inadequate to properly measure all of the characteristics of the observed data deemed to be important. One strategy to circumvent this problem is to define several optimization criteria (objective functions) that measure different (complementary) aspects of the system behavior and to use multicriteria optimization to identify the set of nondominated, efficient, or Pareto optimal solutions. In this paper, we present an efficient and effective Markov Chain Monte Carlo sampler, entitled the Multiobjective Shuffled Complex Evolution Metropolis (MOSCEM) algorithm, which is capable of solving the multiobjective optimization problem for hydrologic models. MOSCEM is an improvement over the Shuffled Complex Evolution Metropolis (SCEM-UA) global optimization algorithm, using the concept of Pareto dominance (rather than direct single-objective function evaluation) to evolve the initial population of points toward a set of solutions stemming from a stable distribution (Pareto set). The efficacy of the MOSCEM-UA algorithm is compared with the original MOCOM-UA algorithm for three hydrologic modeling case studies of increasing complexity
Spatially Lagged Choropleth Display
Choropleth display of spatial information is a fundamental feature of mapping and geographic information system technologies. There has long been a desire to impart some spatial influence in the class selection and delineation process of choropleth display. This paper presents an approach for representing the spatial influence of neighboring areas in the creation of choropleth classes. The usefulness of this approach is explored using suburb level crime statistics for Brisbane, Australia.
Parallel Algorithms for Multicriteria Shortest Path Problems
This paper presents two strategies for solving multicriteria shortest path problems with more than two criteria. Given an undirected graph within vertices, medges, and a set of K weights associated with each edge, we define a path as a sequence of edges from vertex s to vertex t. We want to find the Pareto-optimal set of paths from s to t. The solutions proposed herein are based on cluster computing using the Message-Passing Interface (MPI) extensions to the C programming language. We solve problems with 3 and 4 criteria, using up to 8 processors in parallel and using solutions based on two strategies. The first strategy obtains an approximation of the Pareto-optimal set by solving for supported solutions in bi--criteria sub-problems using a weighted-sum approach, then merging the solutions. The second strategy applies the weighted-sum algorithm directly to the tri-criteria and quad-criteria problems to find the Pareto-optimal set of supported solutions, with each processor using a range of weights
Genome-wide analysis of the emigrant family of MITEs: amplification dynamics and evolution of genes in Arabidopsis thaliana
MITEs are structurally similar to defective class II elements but
their high copy number and the size and sequence conservation of most
MITE families suggest that they can be amplified by a replicative
mechanism. Here we present a genome-wide analysis of the Emigrant
family of MITEs from Arabidopsis thaliana. In order to be able to
detect divergent ancient copies and low copy number subfamilies with a
different internal sequence we have developed a computer program
(http://www.lsi.upc.es/~alggen) that allows looking for Emigrant
elements based solely on its TIR sequence. Our results show that
different bursts of amplification of one or very few active, or
master, elements have occurred at different times during Arabidopsis
evolution, with an insertion dynamics similar to that of some
SINEs. The analysis of the insertion sites of the Emigrant elements
show that, although Emigrant elements tend to integrate far from ORFs,
the elements inserted within or close to genes are preferentially
maintained during evolution.Postprint (published version
An interior point algorithm for minimum sum-of-squares clustering
Copyright @ 2000 SIAM PublicationsAn exact algorithm is proposed for minimum sum-of-squares nonhierarchical clustering, i.e., for partitioning a given set of points from a Euclidean m-space into a given number of clusters in order to minimize the sum of squared distances from all points to the centroid of the cluster to which they belong. This problem is expressed as a constrained hyperbolic program in 0-1 variables. The resolution method combines an interior point algorithm, i.e., a weighted analytic center column generation method, with branch-and-bound. The auxiliary problem of determining the entering column (i.e., the oracle) is an unconstrained hyperbolic program in 0-1 variables with a quadratic numerator and linear denominator. It is solved through a sequence of unconstrained quadratic programs in 0-1 variables. To accelerate resolution, variable neighborhood search heuristics are used both to get a good initial solution and to solve quickly the auxiliary problem as long as global optimality is not reached. Estimated bounds for the dual variables are deduced from the heuristic solution and used in the resolution process as a trust region. Proved minimum sum-of-squares partitions are determined for the rst time for several fairly large data sets from the literature, including Fisher's 150 iris.This research was supported by the Fonds
National de la Recherche Scientifique Suisse, NSERC-Canada, and FCAR-Quebec
- …