28,823 research outputs found
Approximating the Graph Edit Distance with Compact Neighborhood Representations
The graph edit distance is used for comparing graphs in various domains. Due
to its high computational complexity it is primarily approximated. Widely-used
heuristics search for an optimal assignment of vertices based on the distance
between local substructures. While faster ones only consider vertices and their
incident edges, leading to poor accuracy, other approaches require
computationally intense exact distance computations between subgraphs. Our new
method abstracts local substructures to neighborhood trees and compares them
using efficient tree matching techniques. This results in a ground distance for
mapping vertices that yields high quality approximations of the graph edit
distance. By limiting the maximum tree height, our method supports steering
between more accurate results and faster execution. We thoroughly analyze the
running time of the tree matching method and propose several techniques to
accelerate computation in practice. We use compressed tree representations,
recognize redundancies by tree canonization and exploit them via caching.
Experimentally we show that our method provides a significantly improved
trade-off between running time and approximation quality compared to existing
state-of-the-art approaches
Malware Classification based on Call Graph Clustering
Each day, anti-virus companies receive tens of thousands samples of
potentially harmful executables. Many of the malicious samples are variations
of previously encountered malware, created by their authors to evade
pattern-based detection. Dealing with these large amounts of data requires
robust, automatic detection approaches. This paper studies malware
classification based on call graph clustering. By representing malware samples
as call graphs, it is possible to abstract certain variations away, and enable
the detection of structural similarities between samples. The ability to
cluster similar samples together will make more generic detection techniques
possible, thereby targeting the commonalities of the samples within a cluster.
To compare call graphs mutually, we compute pairwise graph similarity scores
via graph matchings which approximately minimize the graph edit distance. Next,
to facilitate the discovery of similar malware samples, we employ several
clustering algorithms, including k-medoids and DBSCAN. Clustering experiments
are conducted on a collection of real malware samples, and the results are
evaluated against manual classifications provided by human malware analysts.
Experiments show that it is indeed possible to accurately detect malware
families via call graph clustering. We anticipate that in the future, call
graphs can be used to analyse the emergence of new malware families, and
ultimately to automate implementation of generic detection schemes.Comment: This research has been supported by TEKES - the Finnish Funding
Agency for Technology and Innovation as part of its ICT SHOK Future Internet
research programme, grant 40212/0
Structural Data Recognition with Graph Model Boosting
This paper presents a novel method for structural data recognition using a
large number of graph models. In general, prevalent methods for structural data
recognition have two shortcomings: 1) Only a single model is used to capture
structural variation. 2) Naive recognition methods are used, such as the
nearest neighbor method. In this paper, we propose strengthening the
recognition performance of these models as well as their ability to capture
structural variation. The proposed method constructs a large number of graph
models and trains decision trees using the models. This paper makes two main
contributions. The first is a novel graph model that can quickly perform
calculations, which allows us to construct several models in a feasible amount
of time. The second contribution is a novel approach to structural data
recognition: graph model boosting. Comprehensive structural variations can be
captured with a large number of graph models constructed in a boosting
framework, and a sophisticated classifier can be formed by aggregating the
decision trees. Consequently, we can carry out structural data recognition with
powerful recognition capability in the face of comprehensive structural
variation. The experiments shows that the proposed method achieves impressive
results and outperforms existing methods on datasets of IAM graph database
repository.Comment: 8 page
Approximating Edit Distance Within Constant Factor in Truly Sub-Quadratic Time
Edit distance is a measure of similarity of two strings based on the minimum
number of character insertions, deletions, and substitutions required to
transform one string into the other. The edit distance can be computed exactly
using a dynamic programming algorithm that runs in quadratic time. Andoni,
Krauthgamer and Onak (2010) gave a nearly linear time algorithm that
approximates edit distance within approximation factor .
In this paper, we provide an algorithm with running time
that approximates the edit distance within a constant
factor
Approximating Weighted Duo-Preservation in Comparative Genomics
Motivated by comparative genomics, Chen et al. [9] introduced the Maximum
Duo-preservation String Mapping (MDSM) problem in which we are given two
strings and from the same alphabet and the goal is to find a
mapping between them so as to maximize the number of duos preserved. A
duo is any two consecutive characters in a string and it is preserved in the
mapping if its two consecutive characters in are mapped to same two
consecutive characters in . The MDSM problem is known to be NP-hard and
there are approximation algorithms for this problem [3, 5, 13], but all of them
consider only the "unweighted" version of the problem in the sense that a duo
from is preserved by mapping to any same duo in regardless of their
positions in the respective strings. However, it is well-desired in comparative
genomics to find mappings that consider preserving duos that are "closer" to
each other under some distance measure [19]. In this paper, we introduce a
generalized version of the problem, called the Maximum-Weight Duo-preservation
String Mapping (MWDSM) problem that captures both duos-preservation and
duos-distance measures in the sense that mapping a duo from to each
preserved duo in has a weight, indicating the "closeness" of the two
duos. The objective of the MWDSM problem is to find a mapping so as to maximize
the total weight of preserved duos. In this paper, we give a polynomial-time
6-approximation algorithm for this problem.Comment: Appeared in proceedings of the 23rd International Computing and
Combinatorics Conference (COCOON 2017
- …