242 research outputs found

    New Frameworks for Offline and Streaming Coreset Constructions

    Full text link
    A coreset for a set of points is a small subset of weighted points that approximately preserves important properties of the original set. Specifically, if PP is a set of points, QQ is a set of queries, and f:P×QRf:P\times Q\to\mathbb{R} is a cost function, then a set SPS\subseteq P with weights w:P[0,)w:P\to[0,\infty) is an ϵ\epsilon-coreset for some parameter ϵ>0\epsilon>0 if sSw(s)f(s,q)\sum_{s\in S}w(s)f(s,q) is a (1+ϵ)(1+\epsilon) multiplicative approximation to pPf(p,q)\sum_{p\in P}f(p,q) for all qQq\in Q. Coresets are used to solve fundamental problems in machine learning under various big data models of computation. Many of the suggested coresets in the recent decade used, or could have used a general framework for constructing coresets whose size depends quadratically on what is known as total sensitivity tt. In this paper we improve this bound from O(t2)O(t^2) to O(tlogt)O(t\log t). Thus our results imply more space efficient solutions to a number of problems, including projective clustering, kk-line clustering, and subspace approximation. Moreover, we generalize the notion of sensitivity sampling for sup-sampling that supports non-multiplicative approximations, negative cost functions and more. The main technical result is a generic reduction to the sample complexity of learning a class of functions with bounded VC dimension. We show that obtaining an (ν,α)(\nu,\alpha)-sample for this class of functions with appropriate parameters ν\nu and α\alpha suffices to achieve space efficient ϵ\epsilon-coresets. Our result implies more efficient coreset constructions for a number of interesting problems in machine learning; we show applications to kk-median/kk-means, kk-line clustering, jj-subspace approximation, and the integer (j,k)(j,k)-projective clustering problem

    Streaming Coreset Constructions for M-Estimators

    Get PDF
    We introduce a new method of maintaining a (k,epsilon)-coreset for clustering M-estimators over insertion-only streams. Let (P,w) be a weighted set (where w : P - > [0,infty) is the weight function) of points in a rho-metric space (meaning a set X equipped with a positive-semidefinite symmetric function D such that D(x,z) <=rho(D(x,y) + D(y,z)) for all x,y,z in X). For any set of points C, we define COST(P,w,C) = sum_{p in P} w(p) min_{c in C} D(p,c). A (k,epsilon)-coreset for (P,w) is a weighted set (Q,v) such that for every set C of k points, (1-epsilon)COST(P,w,C) <= COST(Q,v,C) <= (1+epsilon)COST(P,w,C). Essentially, the coreset (Q,v) can be used in place of (P,w) for all operations concerning the COST function. Coresets, as a method of data reduction, are used to solve fundamental problems in machine learning of streaming and distributed data. M-estimators are functions D(x,y) that can be written as psi(d(x,y)) where ({X}, d) is a true metric (i.e. 1-metric) space. Special cases of M-estimators include the well-known k-median (psi(x) =x) and k-means (psi(x) = x^2) functions. Our technique takes an existing offline construction for an M-estimator coreset and converts it into the streaming setting, where n data points arrive sequentially. To our knowledge, this is the first streaming construction for any M-estimator that does not rely on the merge-and-reduce tree. For example, our coreset for streaming metric k-means uses O(epsilon^{-2} k log k log n) points of storage. The previous state-of-the-art required storing at least O(epsilon^{-2} k log k log^{4} n) points

    Effective and efficient algorithm for multiobjective optimization of hydrologic models

    Get PDF
    Practical experience with the calibration of hydrologic models suggests that any single-objective function, no matter how carefully chosen, is often inadequate to properly measure all of the characteristics of the observed data deemed to be important. One strategy to circumvent this problem is to define several optimization criteria (objective functions) that measure different (complementary) aspects of the system behavior and to use multicriteria optimization to identify the set of nondominated, efficient, or Pareto optimal solutions. In this paper, we present an efficient and effective Markov Chain Monte Carlo sampler, entitled the Multiobjective Shuffled Complex Evolution Metropolis (MOSCEM) algorithm, which is capable of solving the multiobjective optimization problem for hydrologic models. MOSCEM is an improvement over the Shuffled Complex Evolution Metropolis (SCEM-UA) global optimization algorithm, using the concept of Pareto dominance (rather than direct single-objective function evaluation) to evolve the initial population of points toward a set of solutions stemming from a stable distribution (Pareto set). The efficacy of the MOSCEM-UA algorithm is compared with the original MOCOM-UA algorithm for three hydrologic modeling case studies of increasing complexity

    Spatially Lagged Choropleth Display

    Get PDF
    Choropleth display of spatial information is a fundamental feature of mapping and geographic information system technologies. There has long been a desire to impart some spatial influence in the class selection and delineation process of choropleth display. This paper presents an approach for representing the spatial influence of neighboring areas in the creation of choropleth classes. The usefulness of this approach is explored using suburb level crime statistics for Brisbane, Australia.

    Parallel Algorithms for Multicriteria Shortest Path Problems

    Get PDF
    This paper presents two strategies for solving multicriteria shortest path problems with more than two criteria. Given an undirected graph within vertices, medges, and a set of K weights associated with each edge, we define a path as a sequence of edges from vertex s to vertex t. We want to find the Pareto-optimal set of paths from s to t. The solutions proposed herein are based on cluster computing using the Message-Passing Interface (MPI) extensions to the C programming language. We solve problems with 3 and 4 criteria, using up to 8 processors in parallel and using solutions based on two strategies. The first strategy obtains an approximation of the Pareto-optimal set by solving for supported solutions in bi--criteria sub-problems using a weighted-sum approach, then merging the solutions. The second strategy applies the weighted-sum algorithm directly to the tri-criteria and quad-criteria problems to find the Pareto-optimal set of supported solutions, with each processor using a range of weights

    Genome-wide analysis of the emigrant family of MITEs: amplification dynamics and evolution of genes in Arabidopsis thaliana

    Get PDF
    MITEs are structurally similar to defective class II elements but their high copy number and the size and sequence conservation of most MITE families suggest that they can be amplified by a replicative mechanism. Here we present a genome-wide analysis of the Emigrant family of MITEs from Arabidopsis thaliana. In order to be able to detect divergent ancient copies and low copy number subfamilies with a different internal sequence we have developed a computer program (http://www.lsi.upc.es/~alggen) that allows looking for Emigrant elements based solely on its TIR sequence. Our results show that different bursts of amplification of one or very few active, or master, elements have occurred at different times during Arabidopsis evolution, with an insertion dynamics similar to that of some SINEs. The analysis of the insertion sites of the Emigrant elements show that, although Emigrant elements tend to integrate far from ORFs, the elements inserted within or close to genes are preferentially maintained during evolution.Postprint (published version

    An interior point algorithm for minimum sum-of-squares clustering

    Get PDF
    Copyright @ 2000 SIAM PublicationsAn exact algorithm is proposed for minimum sum-of-squares nonhierarchical clustering, i.e., for partitioning a given set of points from a Euclidean m-space into a given number of clusters in order to minimize the sum of squared distances from all points to the centroid of the cluster to which they belong. This problem is expressed as a constrained hyperbolic program in 0-1 variables. The resolution method combines an interior point algorithm, i.e., a weighted analytic center column generation method, with branch-and-bound. The auxiliary problem of determining the entering column (i.e., the oracle) is an unconstrained hyperbolic program in 0-1 variables with a quadratic numerator and linear denominator. It is solved through a sequence of unconstrained quadratic programs in 0-1 variables. To accelerate resolution, variable neighborhood search heuristics are used both to get a good initial solution and to solve quickly the auxiliary problem as long as global optimality is not reached. Estimated bounds for the dual variables are deduced from the heuristic solution and used in the resolution process as a trust region. Proved minimum sum-of-squares partitions are determined for the rst time for several fairly large data sets from the literature, including Fisher's 150 iris.This research was supported by the Fonds National de la Recherche Scientifique Suisse, NSERC-Canada, and FCAR-Quebec
    corecore