Search CORE

28,823 research outputs found

Approximating the Graph Edit Distance with Compact Neighborhood Representations

Author: Bause Franka
Kriege Nils M.
Permann Christian
Publication venue
Publication date: 07/12/2023
Field of study

The graph edit distance is used for comparing graphs in various domains. Due to its high computational complexity it is primarily approximated. Widely-used heuristics search for an optimal assignment of vertices based on the distance between local substructures. While faster ones only consider vertices and their incident edges, leading to poor accuracy, other approaches require computationally intense exact distance computations between subgraphs. Our new method abstracts local substructures to neighborhood trees and compares them using efficient tree matching techniques. This results in a ground distance for mapping vertices that yields high quality approximations of the graph edit distance. By limiting the maximum tree height, our method supports steering between more accurate results and faster execution. We thoroughly analyze the running time of the tree matching method and propose several techniques to accelerate computation in practice. We use compressed tree representations, recognize redundancies by tree canonization and exploit them via caching. Experimentally we show that our method provides a significantly improved trade-off between running time and approximation quality compared to existing state-of-the-art approaches

arXiv.org e-Print Archive

Malware Classification based on Call Graph Clustering

Author: Kinable Joris
Kostakis Orestis
Publication venue
Publication date: 25/08/2010
Field of study

Each day, anti-virus companies receive tens of thousands samples of potentially harmful executables. Many of the malicious samples are variations of previously encountered malware, created by their authors to evade pattern-based detection. Dealing with these large amounts of data requires robust, automatic detection approaches. This paper studies malware classification based on call graph clustering. By representing malware samples as call graphs, it is possible to abstract certain variations away, and enable the detection of structural similarities between samples. The ability to cluster similar samples together will make more generic detection techniques possible, thereby targeting the commonalities of the samples within a cluster. To compare call graphs mutually, we compute pairwise graph similarity scores via graph matchings which approximately minimize the graph edit distance. Next, to facilitate the discovery of similar malware samples, we employ several clustering algorithms, including k-medoids and DBSCAN. Clustering experiments are conducted on a collection of real malware samples, and the results are evaluated against manual classifications provided by human malware analysts. Experiments show that it is indeed possible to accurately detect malware families via call graph clustering. We anticipate that in the future, call graphs can be used to analyse the emergence of new malware families, and ultimately to automate implementation of generic detection schemes.Comment: This research has been supported by TEKES - the Finnish Funding Agency for Technology and Innovation as part of its ICT SHOK Future Internet research programme, grant 40212/0

arXiv.org e-Print Archive

CiteSeerX

Repository TU/e

Structural Data Recognition with Graph Model Boosting

Author: Miyazaki Tomo
Omachi Shinichiro
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 07/03/2017
Field of study

This paper presents a novel method for structural data recognition using a large number of graph models. In general, prevalent methods for structural data recognition have two shortcomings: 1) Only a single model is used to capture structural variation. 2) Naive recognition methods are used, such as the nearest neighbor method. In this paper, we propose strengthening the recognition performance of these models as well as their ability to capture structural variation. The proposed method constructs a large number of graph models and trains decision trees using the models. This paper makes two main contributions. The first is a novel graph model that can quickly perform calculations, which allows us to construct several models in a feasible amount of time. The second contribution is a novel approach to structural data recognition: graph model boosting. Comprehensive structural variations can be captured with a large number of graph models constructed in a boosting framework, and a sophisticated classifier can be formed by aggregating the decision trees. Consequently, we can carry out structural data recognition with powerful recognition capability in the face of comprehensive structural variation. The experiments shows that the proposed method achieves impressive results and outperforms existing methods on datasets of IAM graph database repository.Comment: 8 page

arXiv.org e-Print Archive

Tohoku University Repository (TOUR) / 東北大学機関リポジトリ

Approximating Edit Distance Within Constant Factor in Truly Sub-Quadratic Time

Author: Chakraborty Diptarka
Das Debarati
Goldenberg Elazar
Koucky Michal
Saks Michael
Publication venue
Publication date: 08/10/2018
Field of study

Edit distance is a measure of similarity of two strings based on the minimum number of character insertions, deletions, and substitutions required to transform one string into the other. The edit distance can be computed exactly using a dynamic programming algorithm that runs in quadratic time. Andoni, Krauthgamer and Onak (2010) gave a nearly linear time algorithm that approximates edit distance within approximation factor

\text{poly}(\log n)

. In this paper, we provide an algorithm with running time

\tilde{O}(n^{2-2/7})

that approximates the edit distance within a constant factor

arXiv.org e-Print Archive

Copenhagen University Research Information System

Approximating Weighted Duo-Preservation in Comparative Genomics

Author: AR Mushegian
B Brubach
G Cormode
H Jiang
KM Swenson
L Bulteau
M Chrobak
N Boria
NH Mustafa
RC Hardison
S Beretta
TM Chan
W Chen
X Chen
Publication venue
Publication date: 30/08/2017
Field of study

Motivated by comparative genomics, Chen et al. [9] introduced the Maximum Duo-preservation String Mapping (MDSM) problem in which we are given two strings

s_1

and

s_2

from the same alphabet and the goal is to find a mapping

\pi

between them so as to maximize the number of duos preserved. A duo is any two consecutive characters in a string and it is preserved in the mapping if its two consecutive characters in

s_1

are mapped to same two consecutive characters in

s_2

. The MDSM problem is known to be NP-hard and there are approximation algorithms for this problem [3, 5, 13], but all of them consider only the "unweighted" version of the problem in the sense that a duo from

s_1

is preserved by mapping to any same duo in

s_2

regardless of their positions in the respective strings. However, it is well-desired in comparative genomics to find mappings that consider preserving duos that are "closer" to each other under some distance measure [19]. In this paper, we introduce a generalized version of the problem, called the Maximum-Weight Duo-preservation String Mapping (MWDSM) problem that captures both duos-preservation and duos-distance measures in the sense that mapping a duo from

s_1

to each preserved duo in

s_2

has a weight, indicating the "closeness" of the two duos. The objective of the MWDSM problem is to find a mapping so as to maximize the total weight of preserved duos. In this paper, we give a polynomial-time 6-approximation algorithm for this problem.Comment: Appeared in proceedings of the 23rd International Computing and Combinatorics Conference (COCOON 2017

arXiv.org e-Print Archive

Crossref