2,359 research outputs found

    Tree Edit Distance Learning via Adaptive Symbol Embeddings

    Full text link
    Metric learning has the aim to improve classification accuracy by learning a distance measure which brings data points from the same class closer together and pushes data points from different classes further apart. Recent research has demonstrated that metric learning approaches can also be applied to trees, such as molecular structures, abstract syntax trees of computer programs, or syntax trees of natural language, by learning the cost function of an edit distance, i.e. the costs of replacing, deleting, or inserting nodes in a tree. However, learning such costs directly may yield an edit distance which violates metric axioms, is challenging to interpret, and may not generalize well. In this contribution, we propose a novel metric learning approach for trees which we call embedding edit distance learning (BEDL) and which learns an edit distance indirectly by embedding the tree nodes as vectors, such that the Euclidean distance between those vectors supports class discrimination. We learn such embeddings by reducing the distance to prototypical trees from the same class and increasing the distance to prototypical trees from different classes. In our experiments, we show that BEDL improves upon the state-of-the-art in metric learning for trees on six benchmark data sets, ranging from computer science over biomedical data to a natural-language processing data set containing over 300,000 nodes.Comment: Paper at the International Conference of Machine Learning (2018), 2018-07-10 to 2018-07-15 in Stockholm, Swede

    Generalised median of graph correspondences.

    Get PDF
    A graph correspondence is defined as a function that maps the elements of two attributed graphs. Due to the increasing availability of methods to perform graph matching, numerous graph correspondences can be deducted for a pair of attributed graphs. To obtain a representative prototype for a set of data structures, the concept of the median has been largely employed, as it has proven to deliver a robust sample. Nonetheless, the calculation of the exact (or generalised) median is known to be an NP-complete problem for most domains. In this paper, we present a method based on an optimisation function to calculate the generalised median graph correspondence. This method makes use of the Correspondence Edit Distance, which is a metric that considers the attributes and the local structures of the graphs to obtain more interesting and meaningful results. Experimental validation shows that this approach is capable of obtaining the generalised median in a comparable runtime with respect to state-of-the-art methods on artificial data, while maintaining the success rate for a real-application case

    A generic framework for median graph computation based on a recursive embedding approach

    Get PDF
    The median graph has been shown to be a good choice to obtain a representative of a set of graphs. However, its computation is a complex problem. Recently, graph embedding into vector spaces has been proposed to obtain approximations of the median graph. The problem with such an approach is how to go from a point in the vector space back to a graph in the graph space. The main contribution of this paper is the generalization of this previous method, proposing a generic recursive procedure that permits to recover the graph corresponding to a point in the vector space, introducing only the amount of approximation inherent to the use of graph matching algorithms. In order to evaluate the proposed method, we compare it with the set median and with the other state-of-the-art embedding-based methods for the median graph computation. The experiments are carried out using four different databases (one semi-artificial and three containing real-world data). Results show that with the proposed approach we can obtain better medians, in terms of the sum of distances to the training graphs, than with the previous existing methods. © 2011 Elsevier Inc. All rights reserved.This work has been supported by the Spanish research programmes Consolider Ingenio 2010 CSD2007-00018, TIN2006-15694-C02-02 and TIN2008-04998 and the fellowship RYC-2009-05031.Peer Reviewe

    Generalised median of a set of correspondences based on the hamming distance.

    Get PDF
    A correspondence is a set of mappings that establishes a relation between the elements of two data structures (i.e. sets of points, strings, trees or graphs). If we consider several correspondences between the same two structures, one option to define a representative of them is through the generalised median correspondence. In general, the computation of the generalised median is an NP-complete task. In this paper, we present two methods to calculate the generalised median correspondence of multiple correspondences. The first one obtains the optimal solution in cubic time, but it is restricted to the Hamming distance. The second one obtains a sub-optimal solution through an iterative approach, but does not have any restrictions with respect to the used distance. We compare both proposals in terms of the distance to the true generalised median and runtime

    Pivot Selection for Median String Problem

    Full text link
    The Median String Problem is W[1]-Hard under the Levenshtein distance, thus, approximation heuristics are used. Perturbation-based heuristics have been proved to be very competitive as regards the ratio approximation accuracy/convergence speed. However, the computational burden increase with the size of the set. In this paper, we explore the idea of reducing the size of the problem by selecting a subset of representative elements, i.e. pivots, that are used to compute the approximate median instead of the whole set. We aim to reduce the computation time through a reduction of the problem size while achieving similar approximation accuracy. We explain how we find those pivots and how to compute the median string from them. Results on commonly used test data suggest that our approach can reduce the computational requirements (measured in computed edit distances) by 88\% with approximation accuracy as good as the state of the art heuristic. This work has been supported in part by CONICYT-PCHA/Doctorado Nacional/2014631400742014-63140074 through a Ph.D. Scholarship; Universidad Cat\'{o}lica de la Sant\'{i}sima Concepci\'{o}n through the research project DIN-01/2016; European Union's Horizon 2020 under the Marie Sk\l odowska-Curie grant agreement 690941690941; Millennium Institute for Foundational Research on Data (IMFD); FONDECYT-CONICYT grant number 11704971170497; and for O. Pedreira, Xunta de Galicia/FEDER-UE refs. CSI ED431G/01 and GRC: ED431C 2017/58
    corecore