381 research outputs found

    Providing Diversity in K-Nearest Neighbor Query Results

    Full text link
    Given a point query Q in multi-dimensional space, K-Nearest Neighbor (KNN) queries return the K closest answers according to given distance metric in the database with respect to Q. In this scenario, it is possible that a majority of the answers may be very similar to some other, especially when the data has clusters. For a variety of applications, such homogeneous result sets may not add value to the user. In this paper, we consider the problem of providing diversity in the results of KNN queries, that is, to produce the closest result set such that each answer is sufficiently different from the rest. We first propose a user-tunable definition of diversity, and then present an algorithm, called MOTLEY, for producing a diverse result set as per this definition. Through a detailed experimental evaluation on real and synthetic data, we show that MOTLEY can produce diverse result sets by reading only a small fraction of the tuples in the database. Further, it imposes no additional overhead on the evaluation of traditional KNN queries, thereby providing a seamless interface between diversity and distance.Comment: 20 pages, 11 figure

    Textually Relevant Spatial Skylines

    Get PDF
    postprin

    Querying Spatial Data by Dominators in Neighborhood

    Get PDF

    Enhancing SpatialHadoop with Closest Pair Queries

    Get PDF
    Given two datasets P and Q, the K Closest Pair Query (KCPQ) finds the K closest pairs of objects from P ×Q. It is an operation widely adopted by many spatial and GIS applications. As a combination of the K Nearest Neighbor (KNN) and the spatial join queries, KCPQ is an expensive operation. Given the increasing volume of spatial data, it is difficult to perform a KCPQ on a centralized machine efficiently. For this reason, this paper addresses the problem of computing the KCPQ on big spatial datasets in SpatialHadoop, an extension of Hadoop that supports spatial operations efficiently, and proposes a novel algorithm in SpatialHadoop to perform efficient parallel KCPQ on large-scale spatial datasets. We have evaluated the performance of the algorithm in several situations with big synthetic and real-world datasets. The experiments have demonstrated the efficiency and scalability of our proposal

    Spatial skyline query problem in Euclidean and road-network spaces

    Get PDF
    With the growth of data-intensive applications, along with the increase of both size and dimensionality of data, queries with advanced semantics have recently drawn researchers’ attention. Skyline query problem is one of them, which produces optimal results based on user preferences. In this thesis, we study the problem of spatial skyline query in the Euclidean and road network spaces. For a given data set P, we are required to compute the spatial skyline points of P with respect to an arbitrary query set Q. A point p ∈ P is a spatial skyline point if and only if, for any other data point r ∈ P , p is closer to at least one query point q ∈ Q as compared to r and has in the best case the same distance as r to the rest of the query points. We propose several efficient algorithms that outperform the existing algorithms

    Intelligent search in social communities of smartphone users

    Get PDF
    Social communities of smartphone users have recently gained significant interest due to their wide social penetration. The applications in this domain,however, currently rely on centralized or cloud-like architectures for data sharing and searching tasks, introducing both data-disclosure and performance concerns. In this paper, we present a distributed search architecture for intelligent search of objects in a mobile social community. Our framework, coined SmartOpt, is founded on an in-situ data storage model, where captured objects remain local on smartphones and searches then take place over an intelligent multi-objective lookup structure we compute dynamically. Our MO-QRT structure optimizes several conflicting objectives, using a multi-objective evolutionary algorithm that calculates a diverse set of high quality non-dominated solutions in a single run. Then a decision-making subsystem is utilized to tune the retrieval preferences of the query user. We assess our ideas both using trace-driven experiments with mobility and social patterns derived by Microsoft’s GeoLife project, DBLP and Pics ‘n’ Trails but also using our real Android SmartP2P3 system deployed over our SmartLab4 testbed of 40+ smartphones. Our study reveals that SmartOpt yields high query recall rates of 95%, with one order of magnitude less time and two orders of magnitude less energy than its competitors

    Design and analysis of algorithms for similarity search based on intrinsic dimension

    Get PDF
    One of the most fundamental operations employed in data mining tasks such as classification, cluster analysis, and anomaly detection, is that of similarity search. It has been used in numerous fields of application such as multimedia, information retrieval, recommender systems and pattern recognition. Specifically, a similarity query aims to retrieve from the database the most similar objects to a query object, where the underlying similarity measure is usually expressed as a distance function. The cost of processing similarity queries has been typically assessed in terms of the representational dimension of the data involved, that is, the number of features used to represent individual data objects. It is generally the case that high representational dimension would result in a significant increase in the processing cost of similarity queries. This relation is often attributed to an amalgamation of phenomena, collectively referred to as the curse of dimensionality. However, the observed effects of dimensionality in practice may not be as severe as expected. This has led to the development of models quantifying the complexity of data in terms of some measure of the intrinsic dimensionality. The generalized expansion dimension (GED) is one of such models, which estimates the intrinsic dimension in the vicinity of a query point q through the observation of the ranks and distances of pairs of neighbors with respect to q. This dissertation is mainly concerned with the design and analysis of search algorithms, based on the GED model. In particular, three variants of similarity search problem are considered, including adaptive similarity search, flexible aggregate similarity search, and subspace similarity search. The good practical performance of the proposed algorithms demonstrates the effectiveness of dimensionality-driven design of search algorithms
    corecore