44 research outputs found

    ε-Kernel Coresets for Stochastic Points

    Get PDF
    With the dramatic growth in the number of application domains that generate probabilistic, noisy and uncertain data, there has been an increasing interest in designing algorithms for geometric or combinatorial optimization problems over such data. In this paper, we initiate the study of constructing epsilon-kernel coresets for uncertain points. We consider uncertainty in the existential model where each point\u27s location is fixed but only occurs with a certain probability, and the locational model where each point has a probability distribution describing its location. An epsilon-kernel coreset approximates the width of a point set in any direction. We consider approximating the expected width (an ε-EXP-KERNEL), as well as the probability distribution on the width (an (ε, tau)-QUANT-KERNEL) for any direction. We show that there exists a set of O(ε^{-(d-1)/2}) deterministic points which approximate the expected width under the existential and locational models, and we provide efficient algorithms for constructing such coresets. We show, however, it is not always possible to find a subset of the original uncertain points which provides such an approximation. However, if the existential probability of each point is lower bounded by a constant, an ε-EXP-KERNEL is still possible. We also provide efficient algorithms for construct an (ε, τ)-QUANT-KERNEL coreset in nearly linear time. Our techniques utilize or connect to several important notions in probability and geometry, such as Kolmogorov distances, VC uniform convergence and Tukey depth, and may be useful in other geometric optimization problem in stochastic settings. Finally, combining with known techniques, we show a few applications to approximating the extent of uncertain functions, maintaining extent measures for stochastic moving points and some shape fitting problems under uncertainty

    Minimum Coresets for Maxima Representation of Multidimensional Data

    Get PDF
    Coresets are succinct summaries of large datasets such that, for a given problem, the solution obtained from a coreset is provably competitive with the solution obtained from the full dataset. As such, coreset-based data summarization techniques have been successfully applied to various problems, e.g., geometric optimization, clustering, and approximate query processing, for scaling them up to massive data. In this paper, we study coresets for the maxima representation of multidimensional data: Given a set P of points in R^d , where d is a small constant, and an error parameter ε ∈ (0, 1), a subset Q ⊆ P is an ε-coreset for the maxima representation of P iff the maximum of Q is an ε-approximation of the maximum of P for any vector u ∈ R^d , where the maximum is taken over the inner products between the set of points (P or Q) and u. We define a novel minimum ε-coreset problem that asks for an ε-coreset of the smallest size for the maxima representation of a point set. For the two-dimensional case, we develop an optimal polynomial-time algorithm for the minimum ε-coreset problem by transforming it into the shortest-cycle problem in a directed graph. Then, we prove that this problem is NP-hard in three or higher dimensions and present polynomial-time approximation algorithms in an arbitrary fixed dimension. Finally, we provide extensive experimental results on both real and synthetic datasets to demonstrate the superior performance of our proposed algorithms.Peer reviewe

    Determinantal Point Processes for Coresets

    Get PDF
    International audienceWhen one is faced with a dataset too large to be used all at once, an obvious solution is to retain only part of it. In practice this takes a wide variety of different forms, but among them " coresets " are especially appealing. A coreset is a (small) weighted sample of the original data that comes with a guarantee: that a cost function can be evaluated on the smaller set instead of the larger one, with low relative error. For some classes of problems, and via a careful choice of sampling distribution, iid random sampling has turned to be one of the most successful methods to build coresets efficiently. However, independent samples are sometimes overly redundant, and one could hope that enforcing diversity would lead to better performance. The difficulty lies in proving coreset properties in non-iid samples. We show that the coreset property holds for samples formed with determinantal point processes (DPP). DPPs are interesting because they are a rare example of repulsive point processes with tractable theoretical properties, enabling us to construct general coreset theorems. We apply our results to the k-means problem, and give empirical evidence of the superior performance of DPP samples over state of the art methods

    On the expected diameter, width, and complexity of a stochastic convex-hull

    Full text link
    We investigate several computational problems related to the stochastic convex hull (SCH). Given a stochastic dataset consisting of nn points in Rd\mathbb{R}^d each of which has an existence probability, a SCH refers to the convex hull of a realization of the dataset, i.e., a random sample including each point with its existence probability. We are interested in computing certain expected statistics of a SCH, including diameter, width, and combinatorial complexity. For diameter, we establish the first deterministic 1.633-approximation algorithm with a time complexity polynomial in both nn and dd. For width, two approximation algorithms are provided: a deterministic O(1)O(1)-approximation running in O(nd+1logn)O(n^{d+1} \log n) time, and a fully polynomial-time randomized approximation scheme (FPRAS). For combinatorial complexity, we propose an exact O(nd)O(n^d)-time algorithm. Our solutions exploit many geometric insights in Euclidean space, some of which might be of independent interest

    Doctor of Philosophy

    Get PDF
    dissertationKernel smoothing provides a simple way of finding structures in data sets without the imposition of a parametric model, for example, nonparametric regression and density estimates. However, in many data-intensive applications, the data set could be large. Thus, evaluating a kernel density estimate or kernel regression over the data set directly can be prohibitively expensive in big data. This dissertation is working on how to efficiently find a smaller data set that can approximate the original data set with a theoretical guarantee in the kernel smoothing setting and how to extend it to more general smooth range spaces. For kernel density estimates, we propose randomized and deterministic algorithms with quality guarantees that are orders of magnitude more efficient than previous algorithms, which do not require knowledge of the kernel or its bandwidth parameter and are easily parallelizable. Our algorithms are applicable to any large-scale data processing framework. We then further investigate how to measure the error between two kernel density estimates, which is usually measured either in L1 or L2 error. In this dissertation, we investigate the challenges in using a stronger error, L ∞ (or worst case) error. We present efficient solutions for how to estimate the L∞ error and how to choose the bandwidth parameter for a kernel density estimate built on a subsample of a large data set. We next extend smoothed versions of geometric range spaces from kernel range spaces to more general types of ranges, so that an element of the ground set can be contained in a range with a non-binary value in [0,1]. We investigate the approximation of these range spaces through ϵ-nets and ϵ-samples. Finally, we study coresets algorithms for kernel regression. The size of the coresets are independent of the size of the data set, rather they only depend on the error guarantee, and in some cases the size of domain and amount of smoothing. We evaluate our methods on very large time series and spatial data, demonstrate that they can be constructed extremely efficiently, and allow for great computational gains

    Fast SVM training using approximate extreme points

    Full text link
    Applications of non-linear kernel Support Vector Machines (SVMs) to large datasets is seriously hampered by its excessive training time. We propose a modification, called the approximate extreme points support vector machine (AESVM), that is aimed at overcoming this burden. Our approach relies on conducting the SVM optimization over a carefully selected subset, called the representative set, of the training dataset. We present analytical results that indicate the similarity of AESVM and SVM solutions. A linear time algorithm based on convex hulls and extreme points is used to compute the representative set in kernel space. Extensive computational experiments on nine datasets compared AESVM to LIBSVM \citep{LIBSVM}, CVM \citep{Tsang05}, BVM \citep{Tsang07}, LASVM \citep{Bordes05}, SVMperf\text{SVM}^{\text{perf}} \citep{Joachims09}, and the random features method \citep{rahimi07}. Our AESVM implementation was found to train much faster than the other methods, while its classification accuracy was similar to that of LIBSVM in all cases. In particular, for a seizure detection dataset, AESVM training was almost 10310^3 times faster than LIBSVM and LASVM and more than forty times faster than CVM and BVM. Additionally, AESVM also gave competitively fast classification times.Comment: The manuscript in revised form has been submitted to J. Machine Learning Researc

    Optimization Algorithms for Faster Computational Geometry

    Get PDF
    We study two fundamental problems in computational geometry: finding the maximum inscribed ball (MaxIB) inside a bounded polyhedron defined by mm hyperplanes, and the minimum enclosing ball (MinEB) of a set of nn points, both in dd-dimensional space. We improve the running time of iterative algorithms on MaxIB from O~(mdα3/ε3)\tilde{O}(m d \alpha^3 / \varepsilon^3) to O~(md+mdα/ε)\tilde{O}(md + m \sqrt{d} \alpha / \varepsilon), a speed-up up to O~(dα2/ε2)\tilde{O}(\sqrt{d} \alpha^2 / \varepsilon^2), and MinEB from O~(nd/ε)\tilde{O}(n d / \sqrt{\varepsilon}) to O~(nd+nd/ε)\tilde{O}(nd + n \sqrt{d} / \sqrt{\varepsilon}), a speed-up up to O~(d)\tilde{O}(\sqrt{d}). Our improvements are based on a novel saddle-point optimization framework. We propose a new algorithm L1L2SPSolver\mathtt{L1L2SPSolver} for solving a class of regularized saddle-point problems, and apply a randomized Hadamard space rotation which is a technique borrowed from compressive sensing. Interestingly, the motivation of using Hadamard rotation solely comes from our optimization view but not the original geometry problem: indeed, it is not immediately clear why MaxIB or MinEB, as a geometric problem, should be easier to solve if we rotate the space by a unitary matrix. We hope that our optimization perspective sheds lights on solving other geometric problems as well.Comment: An abstract of this paper is going to appear in the conference proceedings of ICALP 201

    Applied Randomized Algorithms for Efficient Genomic Analysis

    Get PDF
    The scope and scale of biological data continues to grow at an exponential clip, driven by advances in genetic sequencing, annotation and widespread adoption of surveillance efforts. For instance, the Sequence Read Archive (SRA) now contains more than 25 petabases of public data, while RefSeq, a collection of reference genomes, recently surpassed 100,000 complete genomes. In the process, it has outgrown the practical reach of many traditional algorithmic approaches in both time and space. Motivated by this extreme scale, this thesis details efficient methods for clustering and summarizing large collections of sequence data. While our primary area of interest is biological sequences, these approaches largely apply to sequence collections of any type, including natural language, software source code, and graph structured data. We applied recent advances in randomized algorithms to practical problems. We used MinHash and HyperLogLog, both examples of Locality- Sensitive Hashing, as well as coresets, which are approximate representations for finite sum problems, to build methods capable of scaling to billions of items. Ultimately, these are all derived from variations on sampling. We combined these advances with hardware-based optimizations and incorporated into free and open-source software libraries (sketch, frp, lib- simdsampling) and practical software tools built on these libraries (Dashing, Minicore, Dashing 2), empowering users to interact practically with colossal datasets on commodity hardware
    corecore