537 research outputs found

    Bayesian graph edit distance

    Get PDF
    This paper describes a novel framework for comparing and matching corrupted relational graphs. The paper develops the idea of edit-distance originally introduced for graph-matching by Sanfeliu and Fu [1]. We show how the Levenshtein distance can be used to model the probability distribution for structural errors in the graph-matching problem. This probability distribution is used to locate matches using MAP label updates. We compare the resulting graph-matching algorithm with that recently reported by Wilson and Hancock. The use of edit-distance offers an elegant alternative to the exhaustive compilation of label dictionaries. Moreover, the method is polynomial rather than exponential in its worst-case complexity. We support our approach with an experimental study on synthetic data and illustrate its effectiveness on an uncalibrated stereo correspondence problem. This demonstrates experimentally that the gain in efficiency is not at the expense of quality of match

    Approximate string matching methods for duplicate detection and clustering tasks

    Get PDF
    Approximate string matching methods are utilized by a vast number of duplicate detection and clustering applications in various knowledge domains. The application area is expected to grow due to the recent significant increase in the amount of digital data and knowledge sources. Despite the large number of existing string similarity metrics, there is a need for more precise approximate string matching methods to improve the efficiency of computer-driven data processing, thus decreasing labor-intensive human involvement. This work introduces a family of novel string similarity methods, which outperform a number of effective well-known and widely used string similarity functions. The new algorithms are designed to overcome the most common problem of the existing methods which is the lack of context sensitivity. In this evaluation, the Longest Approximately Common Prefix (LACP) method achieved the highest values of average precision and maximum F1 on three out of four medical informatics datasets used. The LACP demonstrated the lowest execution time ensured by the linear computational complexity within the set of evaluated algorithms. An online interactive spell checker of biomedical terms was developed based on the LACP method. The main goal of the spell checker was to evaluate the LACP method’s ability to make it possible to estimate the similarity of resulting sets at a glance. The Shortest Path Edit Distance (SPED) outperformed all evaluated similarity functions and gained the highest possible values of the average precision and maximum F1 measures on the bioinformatics datasets. The SPED design was inspired by the preceding work on the Markov Random Field Edit Distance (MRFED). The SPED eradicates two shortcomings of the MRFED, which are prolonged execution time and moderate performance. Four modifications of the Histogram Difference (HD) method demonstrated the best performance on the majority of the life and social sciences data sources used in the experiments. The modifications of the HD algorithm were achieved using several re- scorers: HD with Normalized Smith-Waterman Re-scorer, HD with TFIDF and Jaccard re-scorers, HD with the Longest Common Prefix and TFIDF re-scorers, and HD with the Unweighted Longest Common Prefix Re-scorer. Another contribution of this dissertation includes the extensive analysis of the string similarity methods evaluation for duplicate detection and clustering tasks on the life and social sciences, bioinformatics, and medical informatics domains. The experimental results are illustrated with precision-recall charts and a number of tables presenting the average precision, maximum F1, and execution time

    GASP : Geometric Association with Surface Patches

    Full text link
    A fundamental challenge to sensory processing tasks in perception and robotics is the problem of obtaining data associations across views. We present a robust solution for ascertaining potentially dense surface patch (superpixel) associations, requiring just range information. Our approach involves decomposition of a view into regularized surface patches. We represent them as sequences expressing geometry invariantly over their superpixel neighborhoods, as uniquely consistent partial orderings. We match these representations through an optimal sequence comparison metric based on the Damerau-Levenshtein distance - enabling robust association with quadratic complexity (in contrast to hitherto employed joint matching formulations which are NP-complete). The approach is able to perform under wide baselines, heavy rotations, partial overlaps, significant occlusions and sensor noise. The technique does not require any priors -- motion or otherwise, and does not make restrictive assumptions on scene structure and sensor movement. It does not require appearance -- is hence more widely applicable than appearance reliant methods, and invulnerable to related ambiguities such as textureless or aliased content. We present promising qualitative and quantitative results under diverse settings, along with comparatives with popular approaches based on range as well as RGB-D data.Comment: International Conference on 3D Vision, 201

    A Survey on Metric Learning for Feature Vectors and Structured Data

    Full text link
    The need for appropriate ways to measure the distance or similarity between data is ubiquitous in machine learning, pattern recognition and data mining, but handcrafting such good metrics for specific problems is generally difficult. This has led to the emergence of metric learning, which aims at automatically learning a metric from data and has attracted a lot of interest in machine learning and related fields for the past ten years. This survey paper proposes a systematic review of the metric learning literature, highlighting the pros and cons of each approach. We pay particular attention to Mahalanobis distance metric learning, a well-studied and successful framework, but additionally present a wide range of methods that have recently emerged as powerful alternatives, including nonlinear metric learning, similarity learning and local metric learning. Recent trends and extensions, such as semi-supervised metric learning, metric learning for histogram data and the derivation of generalization guarantees, are also covered. Finally, this survey addresses metric learning for structured data, in particular edit distance learning, and attempts to give an overview of the remaining challenges in metric learning for the years to come.Comment: Technical report, 59 pages. Changes in v2: fixed typos and improved presentation. Changes in v3: fixed typos. Changes in v4: fixed typos and new method

    Proactive search: Using outcome-based dynamic nearest-neighbor recommendation algorithms to improve search engine efficacy

    Get PDF
    The explosion of readily available electronic information has changed the focus of data processing from data generation to data discovery. The prevalent use of search engines has generated extensive research into improving the speed and accuracy of searches. The goal of this research is to accurately predict user behavior as a means to proactively improve speed, accuracy, and predictability of search engines. The proactive approach eliminates query entry time and hence reduces the overall processing time, improving speed. Assuming success, the user locates an electronic resource of interest, improving accuracy. Algorithms that have been shown to predict many vastly different aspects of user behavior exist in literature. Two common approaches are used in such prediction: statistical techniques and collaborative actions. This research extends the scope of proactive search by using search histories of users in building a predictive model. The proposed approach was compared to statistical and collaborative behavior models. The test results verified that search engine prediction is a viable approach and supports the intuitive notion that prediction is more successful when user behavior exhibits less entropy. The benefits of the proposed approach go beyond improvement in performance and accuracy. As a result of working with search histories as sequences of resources, it is possible to predict a series of resources that a user will likely select in the immediate future. This makes it possible for search engines to return resource sequences instead of simple resources. Working with sequences allows the search engine user to more effectively locate information of interest. In the end, a proactive search engine improves speed and accuracy through prediction and sequencing of electronic resources --Abstract, page iii

    Geometric Graphs: Matching, Similarity, and Indexing

    Get PDF
    For many applications, such as drug discovery, road network analysis, and image processing, it is critical to study spatial properties of objects in addition to object relationships. Geometric graphs provide a suitable modeling framework for such applications, where vertices are located in some 2D space. As a result, searching for similar objects is tackled by estimating the similarity of the structure of different graphs. In this case, inexact graph matching approaches are typically employed. However, computing the optimal solution to the graph matching problem is proved to be a very complex task. In addition to this, approximate approaches face many problems such as poor scalability with respect to graph size and less tolerance to changes in graph structure or labels. In this thesis, we propose a framework to tackle the inexact graph matching problem for geometric graphs in 2D space. It consists of a pipeline of three components that we design to cope with the requirements of several application domains. The first component of our framework is an approach to estimate the similarity of vertices. It is based on the string edit distance and handles any labeling information assigned to the vertices and edges. Based on this, we build the second component of our framework. It consists of two algorithms to tackle the inexact graph matching problem. The first algorithm adopts a probabilistic scheme, where we propose a density function that estimates the probability of the correspondences between vertices of different graphs. Then, a match between the two graphs is computed utilizing the expectation maximization technique. The second graph matching algorithm follows a continuous optimization scheme to iteratively improve the match between two graphs. For this, we propose a vertex embedding approach so that the similarity of different vertices can be easily estimated by the Euclidean distance. The third component of our framework is a graph indexing structure, which helps to efficiently search a graph database for similar graphs. We propose several lower bound graph distances that are used to prune non-similar graphs and reduce the response time. Using representative geometric graphs extracted from a variety of applications domains, such as chemoinformatics, character recognition, road network analysis, and image processing, we show that our approach outperforms existing graph matching approaches in terms of matching quality, classification accuracy, and runtime
    • …
    corecore