382 research outputs found
Aligning graphs and finding substructures by a cavity approach
We introduce a new distributed algorithm for aligning graphs or finding
substructures within a given graph. It is based on the cavity method and is
used to study the maximum-clique and the graph-alignment problems in random
graphs. The algorithm allows to analyze large graphs and may find applications
in fields such as computational biology. As a proof of concept we use our
algorithm to align the similarity graphs of two interacting protein families
involved in bacterial signal transduction, and to predict actually interacting
protein partners between these families.Comment: 5 pages, 4 figure
Faithful and Consistent Graph Neural Network Explanations with Rationale Alignment
Uncovering rationales behind predictions of graph neural networks (GNNs) has
received increasing attention over recent years. Instance-level GNN explanation
aims to discover critical input elements, like nodes or edges, that the target
GNN relies upon for making predictions. %These identified sub-structures can
provide interpretations of GNN's behavior. Though various algorithms are
proposed, most of them formalize this task by searching the minimal subgraph
which can preserve original predictions. However, an inductive bias is
deep-rooted in this framework: several subgraphs can result in the same or
similar outputs as the original graphs. Consequently, they have the danger of
providing spurious explanations and failing to provide consistent explanations.
Applying them to explain weakly-performed GNNs would further amplify these
issues. To address this problem, we theoretically examine the predictions of
GNNs from the causality perspective. Two typical reasons for spurious
explanations are identified: confounding effect of latent variables like
distribution shift, and causal factors distinct from the original input.
Observing that both confounding effects and diverse causal rationales are
encoded in internal representations, \tianxiang{we propose a new explanation
framework with an auxiliary alignment loss, which is theoretically proven to be
optimizing a more faithful explanation objective intrinsically. Concretely for
this alignment loss, a set of different perspectives are explored: anchor-based
alignment, distributional alignment based on Gaussian mixture models,
mutual-information-based alignment, etc. A comprehensive study is conducted
both on the effectiveness of this new framework in terms of explanation
faithfulness/consistency and on the advantages of these variants.Comment: TIST2023. arXiv admin note: substantial text overlap with
arXiv:2205.1373
Evaluating Self-Supervised Learning for Molecular Graph Embeddings
Graph Self-Supervised Learning (GSSL) provides a robust pathway for acquiring
embeddings without expert labelling, a capability that carries profound
implications for molecular graphs due to the staggering number of potential
molecules and the high cost of obtaining labels. However, GSSL methods are
designed not for optimisation within a specific domain but rather for
transferability across a variety of downstream tasks. This broad applicability
complicates their evaluation. Addressing this challenge, we present "Molecular
Graph Representation Evaluation" (MOLGRAPHEVAL), generating detailed profiles
of molecular graph embeddings with interpretable and diversified attributes.
MOLGRAPHEVAL offers a suite of probing tasks grouped into three categories: (i)
generic graph, (ii) molecular substructure, and (iii) embedding space
properties. By leveraging MOLGRAPHEVAL to benchmark existing GSSL methods
against both current downstream datasets and our suite of tasks, we uncover
significant inconsistencies between inferences drawn solely from existing
datasets and those derived from more nuanced probing. These findings suggest
that current evaluation methodologies fail to capture the entirety of the
landscape.Comment: update result
Malware Classification based on Call Graph Clustering
Each day, anti-virus companies receive tens of thousands samples of
potentially harmful executables. Many of the malicious samples are variations
of previously encountered malware, created by their authors to evade
pattern-based detection. Dealing with these large amounts of data requires
robust, automatic detection approaches. This paper studies malware
classification based on call graph clustering. By representing malware samples
as call graphs, it is possible to abstract certain variations away, and enable
the detection of structural similarities between samples. The ability to
cluster similar samples together will make more generic detection techniques
possible, thereby targeting the commonalities of the samples within a cluster.
To compare call graphs mutually, we compute pairwise graph similarity scores
via graph matchings which approximately minimize the graph edit distance. Next,
to facilitate the discovery of similar malware samples, we employ several
clustering algorithms, including k-medoids and DBSCAN. Clustering experiments
are conducted on a collection of real malware samples, and the results are
evaluated against manual classifications provided by human malware analysts.
Experiments show that it is indeed possible to accurately detect malware
families via call graph clustering. We anticipate that in the future, call
graphs can be used to analyse the emergence of new malware families, and
ultimately to automate implementation of generic detection schemes.Comment: This research has been supported by TEKES - the Finnish Funding
Agency for Technology and Innovation as part of its ICT SHOK Future Internet
research programme, grant 40212/0
Effective Identification of Conserved Pathways in Biological Networks Using Hidden Markov Models
The advent of various high-throughput experimental techniques for measuring molecular interactions has enabled the systematic study of biological interactions on a global scale. Since biological processes are carried out by elaborate collaborations of numerous molecules that give rise to a complex network of molecular interactions, comparative analysis of these biological networks can bring important insights into the functional organization and regulatory mechanisms of biological systems.In this paper, we present an effective framework for identifying common interaction patterns in the biological networks of different organisms based on hidden Markov models (HMMs). Given two or more networks, our method efficiently finds the top matching paths in the respective networks, where the matching paths may contain a flexible number of consecutive insertions and deletions.Based on several protein-protein interaction (PPI) networks obtained from the Database of Interacting Proteins (DIP) and other public databases, we demonstrate that our method is able to detect biologically significant pathways that are conserved across different organisms. Our algorithm has a polynomial complexity that grows linearly with the size of the aligned paths. This enables the search for very long paths with more than 10 nodes within a few minutes on a desktop computer. The software program that implements this algorithm is available upon request from the authors
- …