28,823 research outputs found

    Approximating the Graph Edit Distance with Compact Neighborhood Representations

    Full text link
    The graph edit distance is used for comparing graphs in various domains. Due to its high computational complexity it is primarily approximated. Widely-used heuristics search for an optimal assignment of vertices based on the distance between local substructures. While faster ones only consider vertices and their incident edges, leading to poor accuracy, other approaches require computationally intense exact distance computations between subgraphs. Our new method abstracts local substructures to neighborhood trees and compares them using efficient tree matching techniques. This results in a ground distance for mapping vertices that yields high quality approximations of the graph edit distance. By limiting the maximum tree height, our method supports steering between more accurate results and faster execution. We thoroughly analyze the running time of the tree matching method and propose several techniques to accelerate computation in practice. We use compressed tree representations, recognize redundancies by tree canonization and exploit them via caching. Experimentally we show that our method provides a significantly improved trade-off between running time and approximation quality compared to existing state-of-the-art approaches

    Malware Classification based on Call Graph Clustering

    Full text link
    Each day, anti-virus companies receive tens of thousands samples of potentially harmful executables. Many of the malicious samples are variations of previously encountered malware, created by their authors to evade pattern-based detection. Dealing with these large amounts of data requires robust, automatic detection approaches. This paper studies malware classification based on call graph clustering. By representing malware samples as call graphs, it is possible to abstract certain variations away, and enable the detection of structural similarities between samples. The ability to cluster similar samples together will make more generic detection techniques possible, thereby targeting the commonalities of the samples within a cluster. To compare call graphs mutually, we compute pairwise graph similarity scores via graph matchings which approximately minimize the graph edit distance. Next, to facilitate the discovery of similar malware samples, we employ several clustering algorithms, including k-medoids and DBSCAN. Clustering experiments are conducted on a collection of real malware samples, and the results are evaluated against manual classifications provided by human malware analysts. Experiments show that it is indeed possible to accurately detect malware families via call graph clustering. We anticipate that in the future, call graphs can be used to analyse the emergence of new malware families, and ultimately to automate implementation of generic detection schemes.Comment: This research has been supported by TEKES - the Finnish Funding Agency for Technology and Innovation as part of its ICT SHOK Future Internet research programme, grant 40212/0

    Structural Data Recognition with Graph Model Boosting

    Get PDF
    This paper presents a novel method for structural data recognition using a large number of graph models. In general, prevalent methods for structural data recognition have two shortcomings: 1) Only a single model is used to capture structural variation. 2) Naive recognition methods are used, such as the nearest neighbor method. In this paper, we propose strengthening the recognition performance of these models as well as their ability to capture structural variation. The proposed method constructs a large number of graph models and trains decision trees using the models. This paper makes two main contributions. The first is a novel graph model that can quickly perform calculations, which allows us to construct several models in a feasible amount of time. The second contribution is a novel approach to structural data recognition: graph model boosting. Comprehensive structural variations can be captured with a large number of graph models constructed in a boosting framework, and a sophisticated classifier can be formed by aggregating the decision trees. Consequently, we can carry out structural data recognition with powerful recognition capability in the face of comprehensive structural variation. The experiments shows that the proposed method achieves impressive results and outperforms existing methods on datasets of IAM graph database repository.Comment: 8 page

    Approximating Edit Distance Within Constant Factor in Truly Sub-Quadratic Time

    Full text link
    Edit distance is a measure of similarity of two strings based on the minimum number of character insertions, deletions, and substitutions required to transform one string into the other. The edit distance can be computed exactly using a dynamic programming algorithm that runs in quadratic time. Andoni, Krauthgamer and Onak (2010) gave a nearly linear time algorithm that approximates edit distance within approximation factor poly(logn)\text{poly}(\log n). In this paper, we provide an algorithm with running time O~(n22/7)\tilde{O}(n^{2-2/7}) that approximates the edit distance within a constant factor

    Approximating Weighted Duo-Preservation in Comparative Genomics

    Full text link
    Motivated by comparative genomics, Chen et al. [9] introduced the Maximum Duo-preservation String Mapping (MDSM) problem in which we are given two strings s1s_1 and s2s_2 from the same alphabet and the goal is to find a mapping π\pi between them so as to maximize the number of duos preserved. A duo is any two consecutive characters in a string and it is preserved in the mapping if its two consecutive characters in s1s_1 are mapped to same two consecutive characters in s2s_2. The MDSM problem is known to be NP-hard and there are approximation algorithms for this problem [3, 5, 13], but all of them consider only the "unweighted" version of the problem in the sense that a duo from s1s_1 is preserved by mapping to any same duo in s2s_2 regardless of their positions in the respective strings. However, it is well-desired in comparative genomics to find mappings that consider preserving duos that are "closer" to each other under some distance measure [19]. In this paper, we introduce a generalized version of the problem, called the Maximum-Weight Duo-preservation String Mapping (MWDSM) problem that captures both duos-preservation and duos-distance measures in the sense that mapping a duo from s1s_1 to each preserved duo in s2s_2 has a weight, indicating the "closeness" of the two duos. The objective of the MWDSM problem is to find a mapping so as to maximize the total weight of preserved duos. In this paper, we give a polynomial-time 6-approximation algorithm for this problem.Comment: Appeared in proceedings of the 23rd International Computing and Combinatorics Conference (COCOON 2017
    corecore