2,001 research outputs found

    Distal Dynamic Spatial Approximation Forest

    Get PDF
    Querying large datasets by proximity, using a distance under the metric space model, has a large number of applications in multimedia, pattern recognition, statistics, etc. There is an ever growing number of indexes and algorithms for proximity querying, however there is only a handful of indexes able to perform well without user intervention to select parameters. One of such indexes is the Distal Spatial Approximation Tree (DiSAT) which is parameter-less and has demonstrated to be very efficient outperforming other approaches. The main drawback of the DiSAT is its static nature, that is, once built, it is difficult to add or to remove new elements. This drawback prevents the use of the DiSAT for many interesting applications. In this paper we overcome this weakness. We use a standard technique, the Bentley and Saxe algorithm, to produce a new index which is dynamic while retaining the simplicity and appeal for practitioners of the DiSAT. In order to improve the DiSAF performance, we do not attempt to directly apply the Bentley and Saxe technique, but we enhance its application by taking advantage of our deep knowledge of the DiSAT behavior.XIII Workshop Bases de datos y Minería de Datos (WBDMD).Red de Universidades con Carreras en Informática (RedUNCI

    Distal Dynamic Spatial Approximation Forest

    Get PDF
    Querying large datasets by proximity, using a distance under the metric space model, has a large number of applications in multimedia, pattern recognition, statistics, etc. There is an ever growing number of indexes and algorithms for proximity querying, however there is only a handful of indexes able to perform well without user intervention to select parameters. One of such indexes is the Distal Spatial Approximation Tree (DiSAT) which is parameter-less and has demonstrated to be very efficient outperforming other approaches. The main drawback of the DiSAT is its static nature, that is, once built, it is difficult to add or to remove new elements. This drawback prevents the use of the DiSAT for many interesting applications. In this paper we overcome this weakness. We use a standard technique, the Bentley and Saxe algorithm, to produce a new index which is dynamic while retaining the simplicity and appeal for practitioners of the DiSAT. In order to improve the DiSAF performance, we do not attempt to directly apply the Bentley and Saxe technique, but we enhance its application by taking advantage of our deep knowledge of the DiSAT behavior.XIII Workshop Bases de datos y Minería de Datos (WBDMD).Red de Universidades con Carreras en Informática (RedUNCI

    All Near Neighbor GraphWithout Searching

    Get PDF
    Given a collection of n objects equipped with a distance function d(·, ·), the Nearest Neighbor Graph (NNG) consists in finding the nearest neighbor of each object in the collection. Without an index the total cost of NNG is quadratic. Using an index the cost would be sub-quadratic if the search for individual items is sublinear. Unfortunately, due to the so called curse of dimensionality the indexed and the brute force methods are almost equally inefficient. In this paper we present an efficient algorithm to build the Near Neighbor Graph (nNG), that is an approximation of NNG, using only the index construction, without actually searching for objects.Facultad de Informátic

    Approximate Nearest Neighbor Graph via Index Construction

    Get PDF
    Given a collection of objects in a metric space, the Nearest Neighbor Graph (NNG) associate each node with its closest neighbor under the given metric. It can be obtained trivially by computing the nearest neighbor of every object. To avoid computing every distance pair an index could be constructed. Unfortunately, due to the curse of dimensionality the indexed and the brute force methods are almost equally inefficient. This bring the attention to algorithms computing approximate versions of NNG. The DiSAT is a proximity searching tree. It is hierarchical. The root computes the distances to all objects, and each child node of the root computes the distance to all its subtree recursively. Top levels will have accurate computation of the nearest neighbor, and as we descend the tree this information would be less accurate. If we perform a few rebuilds of the index, taking deep nodes in each iteration, keeping score of the closest known neighbor, it is possible to compute an Approximate NNG (ANNG). Accordingly, in this work we propose to obtain de ANNG by this approach, without performing any search, and we tested this proposal in both synthetic and real world databases with good results both in costs and response quality.XIII Workshop Bases de datos y Minería de Datos (WBDMD).Red de Universidades con Carreras en Informática (RedUNCI

    Approximate Nearest Neighbor Graph via Index Construction

    Get PDF
    Given a collection of objects in a metric space, the Nearest Neighbor Graph (NNG) associate each node with its closest neighbor under the given metric. It can be obtained trivially by computing the nearest neighbor of every object. To avoid computing every distance pair an index could be constructed. Unfortunately, due to the curse of dimensionality the indexed and the brute force methods are almost equally inefficient. This bring the attention to algorithms computing approximate versions of NNG. The DiSAT is a proximity searching tree. It is hierarchical. The root computes the distances to all objects, and each child node of the root computes the distance to all its subtree recursively. Top levels will have accurate computation of the nearest neighbor, and as we descend the tree this information would be less accurate. If we perform a few rebuilds of the index, taking deep nodes in each iteration, keeping score of the closest known neighbor, it is possible to compute an Approximate NNG (ANNG). Accordingly, in this work we propose to obtain de ANNG by this approach, without performing any search, and we tested this proposal in both synthetic and real world databases with good results both in costs and response quality.XIII Workshop Bases de datos y Minería de Datos (WBDMD).Red de Universidades con Carreras en Informática (RedUNCI

    Indexing Metric Spaces for Exact Similarity Search

    Full text link
    With the continued digitalization of societal processes, we are seeing an explosion in available data. This is referred to as big data. In a research setting, three aspects of the data are often viewed as the main sources of challenges when attempting to enable value creation from big data: volume, velocity and variety. Many studies address volume or velocity, while much fewer studies concern the variety. Metric space is ideal for addressing variety because it can accommodate any type of data as long as its associated distance notion satisfies the triangle inequality. To accelerate search in metric space, a collection of indexing techniques for metric data have been proposed. However, existing surveys each offers only a narrow coverage, and no comprehensive empirical study of those techniques exists. We offer a survey of all the existing metric indexes that can support exact similarity search, by i) summarizing all the existing partitioning, pruning and validation techniques used for metric indexes, ii) providing the time and storage complexity analysis on the index construction, and iii) report on a comprehensive empirical comparison of their similarity query processing performance. Here, empirical comparisons are used to evaluate the index performance during search as it is hard to see the complexity analysis differences on the similarity query processing and the query performance depends on the pruning and validation abilities related to the data distribution. This article aims at revealing different strengths and weaknesses of different indexing techniques in order to offer guidance on selecting an appropriate indexing technique for a given setting, and directing the future research for metric indexes

    Decomposability of DiSAT for Index Dynamization

    Get PDF
    The Distal Spatial Approximation Tree (DiSAT) is one of the most competitive indexes for exact proximity searching. The absence of parameters, the most salient feature, makes the index a suitable choice for a practitioner. The most serious drawback is the static nature of the index, not allowing further insertions once it is built. On the other hand, there is an old approach from Bentley and Saxe (BS) allowing the dynamization of decomposable data structures. The only requirement is to provide a decomposition operation. This is precisely our contribution, we define a decomposition operation allowing the application of the BS technique. The resulting data structure is competitive against the static counterparts.Facultad de Informátic

    Distal Dynamic Spatial Approximation Forest

    Get PDF
    Querying large datasets by proximity, using a distance under the metric space model, has a large number of applications in multimedia, pattern recognition, statistics, etc. There is an ever growing number of indexes and algorithms for proximity querying, however there is only a handful of indexes able to perform well without user intervention to select parameters. One of such indexes is the Distal Spatial Approximation Tree (DiSAT) which is parameter-less and has demonstrated to be very efficient outperforming other approaches. The main drawback of the DiSAT is its static nature, that is, once built, it is difficult to add or to remove new elements. This drawback prevents the use of the DiSAT for many interesting applications. In this paper we overcome this weakness. We use a standard technique, the Bentley and Saxe algorithm, to produce a new index which is dynamic while retaining the simplicity and appeal for practitioners of the DiSAT. In order to improve the DiSAF performance, we do not attempt to directly apply the Bentley and Saxe technique, but we enhance its application by taking advantage of our deep knowledge of the DiSAT behavior.XIII Workshop Bases de datos y Minería de Datos (WBDMD).Red de Universidades con Carreras en Informática (RedUNCI

    Reference point hyperplane trees

    Get PDF
    Our context of interest is tree-structured exact search in metric spaces. We make the simple observation that, the deeper a data item is within the tree, the higher the probability of that item being excluded from a search. Assuming a fixed and independent probability p of any subtree being excluded at query time, the probability of an individual data item being accessed is (1−p)d for a node at depth d. In a balanced binary tree half of the data will be at the maximum depth of the tree so this effect should be significant and observable. We test this hypothesis with two experiments on partition trees. First, we force a balance by adjusting the partition/exclusion criteria, and compare this with unbalanced trees where the mean data depth is greater. Second, we compare a generic hyperplane tree with a monotone hyperplane tree, where also the mean depth is greater. In both cases the tree with the greater mean data depth performs better in high-dimensional spaces. We then experiment with increasing the mean depth of nodes by using a small, fixed set of reference points to make exclusion decisions over the whole tree, so that almost all of the data resides at the maximum depth. Again this can be seen to reduce the overall cost of indexing. Furthermore, we observe that having already calculated reference point distances for all data, a final filtering can be applied if the distance table is retained. This reduces further the number of distance calculations required, whilst retaining scalability. The final structure can in fact be viewed as a hybrid between a generic hyperplane tree and a LAESA search structure

    Supermetric search

    Get PDF
    Metric search is concerned with the efficient evaluation of queries in metric spaces. In general, a large space of objects is arranged in such a way that, when a further object is presented as a query, those objects most similar to the query can be efficiently found. Most mechanisms rely upon the triangle inequality property of the metric governing the space. The triangle inequality property is equivalent to a finite embedding property, which states that any three points of the space can be isometrically embedded in two-dimensional Euclidean space. In this paper, we examine a class of semimetric space that is finitely four-embeddable in three-dimensional Euclidean space. In mathematics this property has been extensively studied and is generally known as the four-point property. All spaces with the four-point property are metric spaces, but they also have some stronger geometric guarantees. We coin the term supermetric space as, in terms of metric search, they are significantly more tractable. Supermetric spaces include all those governed by Euclidean, Cosine, Jensen–Shannon and Triangular distances, and are thus commonly used within many domains. In previous work we have given a generic mathematical basis for the supermetric property and shown how it can improve indexing performance for a given exact search structure. Here we present a full investigation into its use within a variety of different hyperplane partition indexing structures, and go on to show some more of its flexibility by examining a search structure whose partition and exclusion conditions are tailored, at each node, to suit the individual reference points and data set present there. Among the results given, we show a new best performance for exact search using a well-known benchmark
    • …
    corecore