22 research outputs found

    Comparison of AESA and LAESA search algorithms using string and tree-edit-distances

    Get PDF
    Although the success rate of handwritten character recognition using a nearest neighbour technique together with edit distance is satisfactory, the exhaustive search is expensive. Some fast methods as AESA and LAESA have been proposed to find nearest neighbours in metric spaces. The average number of distances computed by these algorithms is very low and does not depend on the number of prototypes in the training set. In this paper, we compare the behaviour of these algorithms when string and tree-edit-distances are used.Work partially supported by the spanish CICYT TIC2000-1599-C02 and TIC2000-1703-CO3-02

    Speeding up the cyclic edit distance using LAESA with early abandon

    Get PDF
    The cyclic edit distance between two strings is the minimum edit distance between one of this strings and every possible cyclic shift of the other. This can be useful, for example, in image analysis where strings describe the contour of shapes or in computational biology for classifying circular permuted proteins or circular DNA/RNA molecules. The cyclic edit distance can be computed in O(mnlog m) time, however, in real recognition tasks this is a high computational cost because of the size of databases. A method to reduce the number of comparisons and avoid an exhaustive search is convenient. In this work, we present a new algorithm based on a modification of LAESA (linear approximating and eliminating search algorithm) for applying pruning in the computation of distances. It is an efficient procedure for classification and retrieval of cyclic strings. Experimental results show that our proposal considerably outperforms LAESAWork partially supported by the Spanish Government (TIN2010-18958), and the Generalitat Valenciana (PROMETEOII/2014/062)

    Indexing Metric Spaces for Exact Similarity Search

    Full text link
    With the continued digitalization of societal processes, we are seeing an explosion in available data. This is referred to as big data. In a research setting, three aspects of the data are often viewed as the main sources of challenges when attempting to enable value creation from big data: volume, velocity and variety. Many studies address volume or velocity, while much fewer studies concern the variety. Metric space is ideal for addressing variety because it can accommodate any type of data as long as its associated distance notion satisfies the triangle inequality. To accelerate search in metric space, a collection of indexing techniques for metric data have been proposed. However, existing surveys each offers only a narrow coverage, and no comprehensive empirical study of those techniques exists. We offer a survey of all the existing metric indexes that can support exact similarity search, by i) summarizing all the existing partitioning, pruning and validation techniques used for metric indexes, ii) providing the time and storage complexity analysis on the index construction, and iii) report on a comprehensive empirical comparison of their similarity query processing performance. Here, empirical comparisons are used to evaluate the index performance during search as it is hard to see the complexity analysis differences on the similarity query processing and the query performance depends on the pruning and validation abilities related to the data distribution. This article aims at revealing different strengths and weaknesses of different indexing techniques in order to offer guidance on selecting an appropriate indexing technique for a given setting, and directing the future research for metric indexes

    A new dynamic and adaptive scheme for indexing in metric spaces

    Get PDF
    Ankara : The Department of Computer Engineering and the Institute of Engineering and Science of Bilkent University, 2007.Thesis (Master's) -- Bilkent University, 2007.Includes bibliographical references leaves 67-70Computer Science applications are often concerned with efficient storage and retrieval of data. Well defined structure of traditional databases help to access required query objects effectively using the Relational Database paradigm. However, in recent times, we are faced with the challenges of dealing with unstructured and complex data such as images, video, sound clips and text documents. Multimedia Information Retrieval, Data Mining, Pattern Recognition, Machine Learning, Computer Vision and Biomedical Databases are examples of the fields that require efficient management of complex data. Complex, unstructured type of data often cannot be broken down into well-defined components, and exact matching cannot be applied for defining queries. Instead, the notion of similarity search is used where a query or prototype object is provided by the user and the database retrieves the objects that are similar. One popular approach for similarity searching is to approximate the relationship between database objects by mapping them into a vector space. There are well-known indexing methods in literature that support similarity queries in vector spaces, however, it has been shown that these methods are ineffective for high dimensional data. Another approach is to use Metric Spaces model for indexing. Metric spaces are defined by a distance function that has the triangular inequality property. Since there are no assumptions about the structure of the data itself, they constitute a higher level abstraction and thus have more applicability. They have also been shown to perform better in higher dimensions. A lot of the previous work in metric spaces have concentrated on static methods that do not allow new insertions once the index structure has been initialized. M-Tree, Slim-Tree, DF-Tree, Omni are some of the popular dynamic structures. These methods can grow incrementally by splitting overflowed nodes and adding new levels to the tree very much like the B-tree variants. Unfortunately, they have been shown to perform very poorly compared to flat structures such as AESA, LAESA, Spaghettis and Kvp that use a fixed set of global pivots. The distances between the query object and the pivots are computed to eliminate some portion of the database from consideration. The number of pivots can be easily increased to provide more selectivity, thus better query performance. However, there is an optimum number of pivots for a given query radius, and using too many pivots increases the costs of queries and the initialization of the index. Recently, Sparse Spatial Selection(SSS) was introduced as a LAESA variant that allows insertions of new database objects and dynamically promotes some of the new objects as pivots. In this thesis, we argue that SSS has fundamental problems that results in poor query performance for clustered or otherwise skewed distributions. Real datasets have often been observed to show such characteristics. We show that SSS has been optimized to work for a symmetrical, balanced distribution and for a specific radius value. Our first main contribution is offering a new pivot promotion scheme that can perform robustly for clustered or skewed distributions. Our second contribution is proposing new methods that solve the problem of determining the right number of pivots for different query radius values. We show that our new indexing scheme performs significantly better than tree-based dynamic structures while having lower insertion costs. We also show that our structure adapts to changes in the database population in a superior way.Tosun, UmutM.S

    Ptolemaic Indexing

    Full text link
    This paper discusses a new family of bounds for use in similarity search, related to those used in metric indexing, but based on Ptolemy's inequality, rather than the metric axioms. Ptolemy's inequality holds for the well-known Euclidean distance, but is also shown here to hold for quadratic form metrics in general, with Mahalanobis distance as an important special case. The inequality is examined empirically on both synthetic and real-world data sets and is also found to hold approximately, with a very low degree of error, for important distances such as the angular pseudometric and several Lp norms. Indexing experiments demonstrate a highly increased filtering power compared to existing, triangular methods. It is also shown that combining the Ptolemaic and triangular filtering can lead to better results than using either approach on its own

    Data Reduction in the String Space for Efficient kNN Classification through Space Partitioning

    Get PDF
    Within the Pattern Recognition field, two representations are generally considered for encoding the data: statistical codifications, which describe elements as feature vectors, and structural representations, which encode elements as high-level symbolic data structures such as strings, trees or graphs. While the vast majority of classifiers are capable of addressing statistical spaces, only some particular methods are suitable for structural representations. The kNN classifier constitutes one of the scarce examples of algorithms capable of tackling both statistical and structural spaces. This method is based on the computation of the dissimilarity between all the samples of the set, which is the main reason for its high versatility, but in turn, for its low efficiency as well. Prototype Generation is one of the possibilities for palliating this issue. These mechanisms generate a reduced version of the initial dataset by performing data transformation and aggregation processes on the initial collection. Nevertheless, these generation processes are quite dependent on the data representation considered, being not generally well defined for structural data. In this work we present the adaptation of the generation-based reduction algorithm Reduction through Homogeneous Clusters to the case of string data. This algorithm performs the reduction by partitioning the space into class-homogeneous clusters for then generating a representative prototype as the median value of each group. Thus, the main issue to tackle is the retrieval of the median element of a set of strings. Our comprehensive experimentation comparatively assesses the performance of this algorithm in both the statistical and the string-based spaces. Results prove the relevance of our approach by showing a competitive compromise between classification rate and data reduction.This research work was partially funded by “Programa I+D+i de la Generalitat Valenciana” through grant ACIF/2019/ 042 and the Spanish Ministry through HISPAMUS project TIN2017-86576-R, partially funded by the EU

    Edit Distance between Unrooted Trees in Cubic Time

    Get PDF
    Edit distance between trees is a natural generalization of the classical edit distance between strings, in which the allowed elementary operations are contraction, uncontraction and relabeling of an edge. Demaine et al. [ACM Trans. on Algorithms, 6(1), 2009] showed how to compute the edit distance between rooted trees on n nodes in O(n^3) time. However, generalizing their method to unrooted trees seems quite problematic, and the most efficient known solution remains to be the previous O(n^3 log n) time algorithm by Klein [ESA 1998]. Given the lack of progress on improving this complexity, it might appear that unrooted trees are simply more difficult than rooted trees. We show that this is, in fact, not the case, and edit distance between unrooted trees on n nodes can be computed in O(n^3) time. A significantly faster solution is unlikely to exist, as Bringmann et al. [SODA 2018] proved that the complexity of computing the edit distance between rooted trees cannot be decreased to O(n^{3-epsilon}) unless some popular conjecture fails, and the lower bound easily extends to unrooted trees. We also show that for two unrooted trees of size m and n, where m <=n, our algorithm can be modified to run in O(nm^2(1+log(n/m))). This, again, matches the complexity achieved by Demaine et al. for rooted trees, who also showed that this is optimal if we restrict ourselves to the so-called decomposition algorithms

    A Content-Addressable Network for Similarity Search in Metric Spaces

    Get PDF
    Because of the ongoing digital data explosion, more advanced search paradigms than the traditional exact match are needed for contentbased retrieval in huge and ever growing collections of data produced in application areas such as multimedia, molecular biology, marketing, computer-aided design and purchasing assistance. As the variety of data types is fast going towards creating a database utilized by people, the computer systems must be able to model human fundamental reasoning paradigms, which are naturally based on similarity. The ability to perceive similarities is crucial for recognition, classification, and learning, and it plays an important role in scientific discovery and creativity. Recently, the mathematical notion of metric space has become a useful abstraction of similarity and many similarity search indexes have been developed. In this thesis, we accept the metric space similarity paradigm and concentrate on the scalability issues. By exploiting computer networks and applying the Peer-to-Peer communication paradigms, we build a structured network of computers able to process similarity queries in parallel. Since no centralized entities are used, such architectures are fully scalable. Specifically, we propose a Peer-to-Peer system for similarity search in metric spaces called Metric Content-Addressable Network (MCAN) which is an extension of the well known Content-Addressable Network (CAN) used for hash lookup. A prototype implementation of MCAN was tested on real-life datasets of image features, protein symbols, and text — observed results are reported. We also compared the performance of MCAN with three other, recently proposed, distributed data structures for similarity search in metric spaces

    Aspects of Metric Spaces in Computation

    Get PDF
    Metric spaces, which generalise the properties of commonly-encountered physical and abstract spaces into a mathematical framework, frequently occur in computer science applications. Three major kinds of questions about metric spaces are considered here: the intrinsic dimensionality of a distribution, the maximum number of distance permutations, and the difficulty of reverse similarity search. Intrinsic dimensionality measures the tendency for points to be equidistant, which is diagnostic of high-dimensional spaces. Distance permutations describe the order in which a set of fixed sites appears while moving away from a chosen point; the number of distinct permutations determines the amount of storage space required by some kinds of indexing data structure. Reverse similarity search problems are constraint satisfaction problems derived from distance-based index structures. Their difficulty reveals details of the structure of the space. Theoretical and experimental results are given for these three questions in a wide range of metric spaces, with commentary on the consequences for computer science applications and additional related results where appropriate
    corecore