7 research outputs found

    Diverse near neighbor problem

    Get PDF
    Motivated by the recent research on diversity-aware search, we investigate the k-diverse near neighbor reporting problem. The problem is defined as follows: given a query point q, report the maximum diversity set S of k points in the ball of radius r around q. The diversity of a set S is measured by the minimum distance between any pair of points in SS (the higher, the better). We present two approximation algorithms for the case where the points live in a d-dimensional Hamming space. Our algorithms guarantee query times that are sub-linear in n and only polynomial in the diversity parameter k, as well as the dimension d. For low values of k, our algorithms achieve sub-linear query times even if the number of points within distance r from a query qq is linear in nn. To the best of our knowledge, these are the first known algorithms of this type that offer provable guarantees.Charles Stark Draper LaboratoryNational Science Foundation (U.S.) (Award NSF CCF-1012042)David & Lucile Packard Foundatio

    Computing Optimal Kernels in Two Dimensions

    Full text link
    Let PP be a set of nn points in R2\mathbb{R}^2. A subset C⊆PC\subseteq P is an Δ\varepsilon-kernel of PP if the projection of the convex hull of CC approximates that of PP within (1−Δ)(1-\varepsilon)-factor in every direction. The set CC is a weak Δ\varepsilon-kernel if its directional width approximates that of PP in every direction. We present fast algorithms for computing a minimum-size Δ\varepsilon-kernel as well as a weak Δ\varepsilon-kernel. We also present a fast algorithm for the Hausdorff variant of this problem. In addition, we introduce the notion of Δ\varepsilon-core, a convex polygon lying inside CH(P)CH(P), prove that it is a good approximation of the optimal Δ\varepsilon-kernel, present an efficient algorithm for computing it, and use it to compute an Δ\varepsilon-kernel of small size

    Approximate nearest neighbor and its many variants

    Get PDF
    Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2013.Cataloged from PDF version of thesis.Includes bibliographical references (p. 53-55).This thesis investigates two variants of the approximate nearest neighbor problem. First, motivated by the recent research on diversity-aware search, we investigate the k-diverse near neighbor reporting problem. The problem is defined as follows: given a query point q, report the maximum diversity set S of k points in the ball of radius r around q. The diversity of a set S is measured by the minimum distance between any pair of points in S (the higher, the better). We present two approximation algorithms for the case where the points live in a d-dimensional Hamming space. Our algorithms guarantee query times that are sub-linear in n and only polynomial in the diversity parameter k, as well as the dimension d. For low values of k, our algorithms achieve sub-linear query times even if the number of points within distance r from a query q is linear in n. To the best of our knowledge, these are the first known algorithms of this type that offer provable guarantees. In the other variant, we consider the approximate line near neighbor (LNN) problem. Here, the database consists of a set of lines instead of points but the query is still a point. Let L be a set of n lines in the d dimensional euclidean space Rd. The goal is to preprocess the set of lines so that we can answer the Line Near Neighbor (LNN) queries in sub-linear time. That is, given the query point ... we want to report a line ... (if there is any), such that ... for some threshold value r, where ... is the euclidean distance between them. We start by illustrating the solution to the problem in the case where there are only two lines in the database and present a data structure in this case. Then we show a recursive algorithm that merges these data structures and solve the problem for the general case of n lines. The algorithm has polynomial space and performs only a logarithmic number of calls to the approximate nearest neighbor subproblem.by Sepideh Mahabadi.S.M

    Robust shape fitting via peeling and grating coresets

    No full text
    Let P be a set of n points in R d. A subset S of P is called a (k, Δ)-kernel if for every direction, the direction width of S Δ-approximates that of P, when k “outliers ” can be ignored in that direction. We show that a (k, Δ)-kernel of P of size O(k/Δ (d−1)/2) can be computed in time O(n+k 2 /Δ d−1). The new algorithm works by repeatedly “peeling” away (0, Δ)-kernels from the point set. We also present a simple Δ-approximation algorithm for fitting various shapes through a set of points with at most k outliers. The algorithm is incremental and works by repeatedly “grating ” critical points into a working set, till the working set provides the required approximation. We prove that the size of the working set is independent of n, and thus results in a simple and practical, near-linear Δ-approximation algorithm for shape fitting with outliers in low dimensions. We demonstrate the practicality of our algorithms by showing their empirical performance on various inputs and problems.

    Algorithms for continuous queries: A geometric approach

    Get PDF
    <p>There has been an unprecedented growth in both the amount of data and the number of users interested in different types of data. Users often want to keep track of the data that match their interests over a period of time. A continuous query, once issued by a user, maintains the matching results for the user as new data (as well as updates to the existing data) continue to arrive in a stream. However, supporting potentially millions of continuous queries is a huge challenge. This dissertation addresses the problem of scalably processing a large number of continuous queries over a wide-area network. </p><p>Conceptually, the task of supporting distributed continuous queries can be divided into two components--event processing (computing the set of affected users for each data update) and notification dissemination (notifying the set of affected users). The first part of this dissertation focuses on event processing. Since interacting with large-scale data can easily frustrate and overwhelm the users, top-k queries have attracted considerable interest from the database community as they allow users to focus on the top-ranked results only. However, it is nearly impossible to find a set of common top-ranked data that everyone is interested in, therefore, users are allowed to specify their interest in different forms of preferences, such as personalized ranking function and range selection. This dissertation presents geometric frameworks, data structures, and algorithms for answering several types of preference queries efficiently. Experimental evaluations show that our approaches outperform the previous ones by orders of magnitude.</p><p>The second part of the dissertation presents comprehensive solutions to the problem of processing and notifying a large number of continuous range top-k queries across a wide-area network. Simple solutions include using a content-driven network to notify all continuous queries whose ranges contain the update (ignoring top-k), or using a server to compute only the affected continuous queries and notifying them individually. The former solution generates too much network traffic, while the latter overwhelms the server. This dissertation presents a geometric framework which allows the set of affected continuous queries to be described succinctly with messages that can be efficiently disseminated using content-driven networks. Fast algorithms are also developed to reformulate each update into a set of messages whose number is provably optimal, with or without knowing all continuous queries. </p><p>The final component of this dissertation is the design of a wide-area dissemination network for continuous range queries. In particular, this dissertation addresses the problem of assigning users to servers in a wide-area content-based publish/subscribe system. A good assignment should consider both users' interests and locations, and balance multiple performance criteria including bandwidth, delay, and load balance. This dissertation presents a Monte Carlo approximation algorithm as well as a simple greedy algorithm. The Monte Carlo algorithm jointly considers multiple performance criteria to find a broker-subscriber assignment and provides theoretical performance guarantees. Using this algorithm as a yardstick, the greedy algorithm is also concluded to work well across a wide range of workloads.</p>Dissertatio
    corecore