21 research outputs found

    Assessing metric structures on GPGPU environments

    Get PDF
    Similarity search consists on retrieving objects within a database that are similar or relevant to a particular query. It is a topic of great interest to scientific community because of its many fields of application, such as searching for words and images on the World Wide Web, pattern recognition, detection of plagiarism, multimedia databases, among others. It is modeled through metric spaces, in which objects are represented in a black-box that contains only the distance between objects; calculating the distance function is costly and search systems operate at a high query rate. Metrical structures have been developed to optimize this process; such structures work as indexes and preprocess data to decrease the distance evaluations during the search. Processing large volumes of data makes unfeasible the use of such structures without using parallel processing environments. Technologies based on multi- CPU and GPU architectures are among the most force due to its costs and performance.XV Workshop de Procesamiento Distribuido y Paralelo (WPDP)Red de Universidades con Carreras en Informática (RedUNCI

    Pivot Selection for Median String Problem

    Full text link
    The Median String Problem is W[1]-Hard under the Levenshtein distance, thus, approximation heuristics are used. Perturbation-based heuristics have been proved to be very competitive as regards the ratio approximation accuracy/convergence speed. However, the computational burden increase with the size of the set. In this paper, we explore the idea of reducing the size of the problem by selecting a subset of representative elements, i.e. pivots, that are used to compute the approximate median instead of the whole set. We aim to reduce the computation time through a reduction of the problem size while achieving similar approximation accuracy. We explain how we find those pivots and how to compute the median string from them. Results on commonly used test data suggest that our approach can reduce the computational requirements (measured in computed edit distances) by 88\% with approximation accuracy as good as the state of the art heuristic. This work has been supported in part by CONICYT-PCHA/Doctorado Nacional/2014631400742014-63140074 through a Ph.D. Scholarship; Universidad Cat\'{o}lica de la Sant\'{i}sima Concepci\'{o}n through the research project DIN-01/2016; European Union's Horizon 2020 under the Marie Sk\l odowska-Curie grant agreement 690941690941; Millennium Institute for Foundational Research on Data (IMFD); FONDECYT-CONICYT grant number 11704971170497; and for O. Pedreira, Xunta de Galicia/FEDER-UE refs. CSI ED431G/01 and GRC: ED431C 2017/58

    Assessing metric structures on GPGPU environments

    Get PDF
    Similarity search consists on retrieving objects within a database that are similar or relevant to a particular query. It is a topic of great interest to scientific community because of its many fields of application, such as searching for words and images on the World Wide Web, pattern recognition, detection of plagiarism, multimedia databases, among others. It is modeled through metric spaces, in which objects are represented in a black-box that contains only the distance between objects; calculating the distance function is costly and search systems operate at a high query rate. Metrical structures have been developed to optimize this process; such structures work as indexes and preprocess data to decrease the distance evaluations during the search. Processing large volumes of data makes unfeasible the use of such structures without using parallel processing environments. Technologies based on multi- CPU and GPU architectures are among the most force due to its costs and performance.XV Workshop de Procesamiento Distribuido y Paralelo (WPDP)Red de Universidades con Carreras en Informática (RedUNCI

    Ptolemaic Indexing

    Full text link
    This paper discusses a new family of bounds for use in similarity search, related to those used in metric indexing, but based on Ptolemy's inequality, rather than the metric axioms. Ptolemy's inequality holds for the well-known Euclidean distance, but is also shown here to hold for quadratic form metrics in general, with Mahalanobis distance as an important special case. The inequality is examined empirically on both synthetic and real-world data sets and is also found to hold approximately, with a very low degree of error, for important distances such as the angular pseudometric and several Lp norms. Indexing experiments demonstrate a highly increased filtering power compared to existing, triangular methods. It is also shown that combining the Ptolemaic and triangular filtering can lead to better results than using either approach on its own

    Assessing metric structures on GPGPU environments

    Get PDF
    Similarity search consists on retrieving objects within a database that are similar or relevant to a particular query. It is a topic of great interest to scientific community because of its many fields of application, such as searching for words and images on the World Wide Web, pattern recognition, detection of plagiarism, multimedia databases, among others. It is modeled through metric spaces, in which objects are represented in a black-box that contains only the distance between objects; calculating the distance function is costly and search systems operate at a high query rate. Metrical structures have been developed to optimize this process; such structures work as indexes and preprocess data to decrease the distance evaluations during the search. Processing large volumes of data makes unfeasible the use of such structures without using parallel processing environments. Technologies based on multi- CPU and GPU architectures are among the most force due to its costs and performance.XV Workshop de Procesamiento Distribuido y Paralelo (WPDP)Red de Universidades con Carreras en Informática (RedUNCI

    Exploiting subspace distance equalities in Highdimensional data for knn queries

    Get PDF
    Efficient k-nearest neighbor computation for high-dimensional data is an important, yet challenging task. The response times of stateof-the-art indexing approaches highly depend on factors like distribution of the data. For clustered data, such approaches are several factors faster than a sequential scan. However, if various dimensions contain uniform or Gaussian data they tend to be clearly outperformed by a simple sequential scan. Hence, we require for an approach generally delivering good response times, independent of the data distribution. As solution, we propose to exploit a novel concept to efficiently compute nearest neighbors. We name it sub-space distance equality, which aims at reducing the number of distance computations independent of the data distribution. We integrate knn computing algorithms into the Elf index structure allowing to study the sub-space distance equality concept in isolation and in combination with a main-memory optimized storage layout. In a large comparative study with twelve data sets, our results indicate that indexes based on sub-space distance equalities compute the least amount of distances. For clustered data, our Elf knn algorithm delivers at least a performance increase of factor two up to an increase of two magnitudes without losing the performance gain compared to sequential scans for uniform or Gaussian data
    corecore