10 research outputs found

    Diverse near neighbor problem

    Get PDF
    Motivated by the recent research on diversity-aware search, we investigate the k-diverse near neighbor reporting problem. The problem is defined as follows: given a query point q, report the maximum diversity set S of k points in the ball of radius r around q. The diversity of a set S is measured by the minimum distance between any pair of points in SS (the higher, the better). We present two approximation algorithms for the case where the points live in a d-dimensional Hamming space. Our algorithms guarantee query times that are sub-linear in n and only polynomial in the diversity parameter k, as well as the dimension d. For low values of k, our algorithms achieve sub-linear query times even if the number of points within distance r from a query qq is linear in nn. To the best of our knowledge, these are the first known algorithms of this type that offer provable guarantees.Charles Stark Draper LaboratoryNational Science Foundation (U.S.) (Award NSF CCF-1012042)David & Lucile Packard Foundatio

    Width of Points in the Streaming Model

    Full text link

    Learning Big (Image) Data via Coresets for Dictionaries

    Get PDF
    Signal and image processing have seen an explosion of interest in the last few years in a new form of signal/image characterization via the concept of sparsity with respect to a dictionary. An active field of research is dictionary learning: the representation of a given large set of vectors (e.g. signals or images) as linear combinations of only few vectors (patterns). To further reduce the size of the representation, the combinations are usually required to be sparse, i.e., each signal is a linear combination of only a small number of patterns. This paper suggests a new computational approach to the problem of dictionary learning, known in computational geometry as coresets. A coreset for dictionary learning is a small smart non-uniform sample from the input signals such that the quality of any given dictionary with respect to the input can be approximated via the coreset. In particular, the optimal dictionary for the input can be approximated by learning the coreset. Since the coreset is small, the learning is faster. Moreover, using merge-and-reduce, the coreset can be constructed for streaming signals that do not fit in memory and can also be computed in parallel. We apply our coresets for dictionary learning of images using the K-SVD algorithm and bound their size and approximation error analytically. Our simulations demonstrate gain factors of up to 60 in computational time with the same, and even better, performance. We also demonstrate our ability to perform computations on larger patches and high-definition images, where the traditional approach breaks down

    Approximate nearest neighbor and its many variants

    Get PDF
    Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2013.Cataloged from PDF version of thesis.Includes bibliographical references (p. 53-55).This thesis investigates two variants of the approximate nearest neighbor problem. First, motivated by the recent research on diversity-aware search, we investigate the k-diverse near neighbor reporting problem. The problem is defined as follows: given a query point q, report the maximum diversity set S of k points in the ball of radius r around q. The diversity of a set S is measured by the minimum distance between any pair of points in S (the higher, the better). We present two approximation algorithms for the case where the points live in a d-dimensional Hamming space. Our algorithms guarantee query times that are sub-linear in n and only polynomial in the diversity parameter k, as well as the dimension d. For low values of k, our algorithms achieve sub-linear query times even if the number of points within distance r from a query q is linear in n. To the best of our knowledge, these are the first known algorithms of this type that offer provable guarantees. In the other variant, we consider the approximate line near neighbor (LNN) problem. Here, the database consists of a set of lines instead of points but the query is still a point. Let L be a set of n lines in the d dimensional euclidean space Rd. The goal is to preprocess the set of lines so that we can answer the Line Near Neighbor (LNN) queries in sub-linear time. That is, given the query point ... we want to report a line ... (if there is any), such that ... for some threshold value r, where ... is the euclidean distance between them. We start by illustrating the solution to the problem in the case where there are only two lines in the database and present a data structure in this case. Then we show a recursive algorithm that merges these data structures and solve the problem for the general case of n lines. The algorithm has polynomial space and performs only a logarithmic number of calls to the approximate nearest neighbor subproblem.by Sepideh Mahabadi.S.M

    A Near-Linear Algorithm for Projective Clustering Integer Points

    No full text
    We consider the problem of projective clustering in Euclidean spaces of non-fixed dimension. Here, we are given a set P of n points in R m and integers j ≄ 1, k ≄ 0, and the goal is to find j k-subspaces so that the sum of the distances of each point in P to the nearest subspace is minimized. Observe that this is a shape fitting problem where we wish to find the best fit in the L1 sense. Here we will treat the number j of subspaces we want to fit and the dimension k of each of them as constants. We consider instances of projective clustering where the point coordinates are integers of magnitude polynomial in m and n. Our main result is a randomized algorithm that for any Δ> 0 runs in time O(mn polylog(mn)) and outputs a solution that with high probability is within (1 + Δ) of the optimal solution. To obtain this result, we show that the fixed dimensional version of the above projective clustering problem has a small coreset. We do that by observing that in a fairly general sense, shape fitting problems that have small coresets in the L ∞ setting also have small coresets in the L1 setting, and then exploiting an existing construction for the L∞ setting. This observation seems to be quite useful for other shape fitting problems as well, as we demonstrate by constructing the first “regular” coreset for the circle fitting problem in the plane

    Coresets and streaming algorithms for the k-means problem and related clustering objectives

    Get PDF
    The k-means problem seeks a clustering that minimizes the sum of squared errors cost function: For input points P from the Euclidean space R^d and any solution consisting of k centers from R^d, the cost is the sum of the squared distances of any point to its closest center. This thesis studies concepts used for large input point sets. For inputs with many points, the term coreset refers to a reduced version with less but weighted points. For inputs with high-dimensional points, dimensionality reduction is used to reduce the number of dimensions. In both cases, the reduced version has to maintain the cost function up to an epsilon-fraction for all choices of k centers. We study coreset constructions and dimensionality reductions for the k-means problem. Further, we develop coreset constructions in the data stream model. Here, the data is so large that it should only be read once and cannot be stored in main memory. The input might even arrive as a stream of points in an arbitrary order. Thus, a data stream algorithm has to continuously process the input while it arrives and can only store small summaries. In the second part of the thesis, the obtained results are extended to related clustering objectives. Projective clustering minimizes the squared distances to k subspaces instead of k points. Kernel k-means is an extension of k-means to scenarios where the target clustering is not linearly separable. In addition to extensions to these objectives, we study coreset constructions for a probabilistic clustering problem where input points are given as distributions over a finite set of locations
    corecore