327 research outputs found
Algorithms for Extracting Frequent Episodes in the Process of Temporal Data Mining
An important aspect in the data mining process is the discovery of patterns having a great influence on the studied problem. The purpose of this paper is to study the frequent episodes data mining through the use of parallel pattern discovery algorithms. Parallel pattern discovery algorithms offer better performance and scalability, so they are of a great interest for the data mining research community. In the following, there will be highlighted some parallel and distributed frequent pattern mining algorithms on various platforms and it will also be presented a comparative study of their main features. The study takes into account the new possibilities that arise along with the emerging novel Compute Unified Device Architecture from the latest generation of graphics processing units. Based on their high performance, low cost and the increasing number of features offered, GPU processors are viable solutions for an optimal implementation of frequent pattern mining algorithmsFrequent Pattern Mining, Parallel Computing, Dynamic Load Balancing, Temporal Data Mining, CUDA, GPU, Fermi, Thread
Parallelizing Maximal Clique Enumeration on GPUs
We present a GPU solution for exact maximal clique enumeration (MCE) that
performs a search tree traversal following the Bron-Kerbosch algorithm. Prior
works on parallelizing MCE on GPUs perform a breadth-first traversal of the
tree, which has limited scalability because of the explosion in the number of
tree nodes at deep levels. We propose to parallelize MCE on GPUs by performing
depth-first traversal of independent subtrees in parallel. Since MCE suffers
from high load imbalance and memory capacity requirements, we propose a worker
list for dynamic load balancing, as well as partial induced subgraphs and a
compact representation of excluded vertex sets to regulate memory consumption.
Our evaluation shows that our GPU implementation on a single GPU outperforms
the state-of-the-art parallel CPU implementation by a geometric mean of 4.9x
(up to 16.7x), and scales efficiently to multiple GPUs. Our code has been
open-sourced to enable further research on accelerating MCE
DeSCo: Towards Generalizable and Scalable Deep Subgraph Counting
Subgraph counting is the problem of counting the occurrences of a given query
graph in a large target graph. Large-scale subgraph counting is useful in
various domains, such as motif counting for social network analysis and loop
counting for money laundering detection on transaction networks. Recently, to
address the exponential runtime complexity of scalable subgraph counting,
neural methods are proposed. However, existing neural counting approaches fall
short in three aspects. Firstly, the counts of the same query can vary from
zero to millions on different target graphs, posing a much larger challenge
than most graph regression tasks. Secondly, current scalable graph neural
networks have limited expressive power and fail to efficiently distinguish
graphs in count prediction. Furthermore, existing neural approaches cannot
predict the occurrence position of queries in the target graph.
Here we design DeSCo, a scalable neural deep subgraph counting pipeline,
which aims to accurately predict the query count and occurrence position on any
target graph after one-time training. Firstly, DeSCo uses a novel canonical
partition and divides the large target graph into small neighborhood graphs.
The technique greatly reduces the count variation while guaranteeing no missing
or double-counting. Secondly, neighborhood counting uses an expressive
subgraph-based heterogeneous graph neural network to accurately perform
counting in each neighborhood. Finally, gossip propagation propagates
neighborhood counts with learnable gates to harness the inductive biases of
motif counts. DeSCo is evaluated on eight real-world datasets from various
domains. It outperforms state-of-the-art neural methods with 137x improvement
in the mean squared error of count prediction, while maintaining the polynomial
runtime complexity.Comment: 8 pages main text, 10 pages appendi
Flashlight: Scalable Link Prediction with Effective Decoders
Link prediction (LP) has been recognized as an important task in graph
learning with its broad practical applications. A typical application of LP is
to retrieve the top scoring neighbors for a given source node, such as the
friend recommendation. These services desire the high inference scalability to
find the top scoring neighbors from many candidate nodes at low latencies.
There are two popular decoders that the recent LP models mainly use to compute
the edge scores from node embeddings: the HadamardMLP and Dot Product decoders.
After theoretical and empirical analysis, we find that the HadamardMLP decoders
are generally more effective for LP. However, HadamardMLP lacks the scalability
for retrieving top scoring neighbors on large graphs, since to the best of our
knowledge, there does not exist an algorithm to retrieve the top scoring
neighbors for HadamardMLP decoders in sublinear complexity. To make HadamardMLP
scalable, we propose the Flashlight algorithm to accelerate the top scoring
neighbor retrievals for HadamardMLP: a sublinear algorithm that progressively
applies approximate maximum inner product search (MIPS) techniques with
adaptively adjusted query embeddings. Empirical results show that Flashlight
improves the inference speed of LP by more than 100 times on the large
OGBL-CITATION2 dataset without sacrificing effectiveness. Our work paves the
way for large-scale LP applications with the effective HadamardMLP decoders by
greatly accelerating their inference
ToDD: Topological Compound Fingerprinting in Computer-Aided Drug Discovery
In computer-aided drug discovery (CADD), virtual screening (VS) is used for
identifying the drug candidates that are most likely to bind to a molecular
target in a large library of compounds. Most VS methods to date have focused on
using canonical compound representations (e.g., SMILES strings, Morgan
fingerprints) or generating alternative fingerprints of the compounds by
training progressively more complex variational autoencoders (VAEs) and graph
neural networks (GNNs). Although VAEs and GNNs led to significant improvements
in VS performance, these methods suffer from reduced performance when scaling
to large virtual compound datasets. The performance of these methods has shown
only incremental improvements in the past few years. To address this problem,
we developed a novel method using multiparameter persistence (MP) homology that
produces topological fingerprints of the compounds as multidimensional vectors.
Our primary contribution is framing the VS process as a new topology-based
graph ranking problem by partitioning a compound into chemical substructures
informed by the periodic properties of its atoms and extracting their
persistent homology features at multiple resolution levels. We show that the
margin loss fine-tuning of pretrained Triplet networks attains highly
competitive results in differentiating between compounds in the embedding space
and ranking their likelihood of becoming effective drug candidates. We further
establish theoretical guarantees for the stability properties of our proposed
MP signatures, and demonstrate that our models, enhanced by the MP signatures,
outperform state-of-the-art methods on benchmark datasets by a wide and highly
statistically significant margin (e.g., 93% gain for Cleves-Jain and 54% gain
for DUD-E Diverse dataset).Comment: NeurIPS, 2022 (36th Conference on Neural Information Processing
Systems
High performance graph analysis on parallel architectures
PhD ThesisOver the last decade pharmacology has been developing computational
methods to enhance drug development and testing. A computational
method called network pharmacology uses graph analysis
tools to determine protein target sets that can lead on better targeted
drugs for diseases as Cancer. One promising area of network-based
pharmacology is the detection of protein groups that can produce
better e ects if they are targeted together by drugs. However, the
e cient prediction of such protein combinations is still a bottleneck
in the area of computational biology.
The computational burden of the algorithms used by such protein
prediction strategies to characterise the importance of such proteins
consists an additional challenge for the eld of network pharmacology.
Such computationally expensive graph algorithms as the all pairs
shortest path (APSP) computation can a ect the overall drug discovery
process as needed network analysis results cannot be given on
time. An ideal solution for these highly intensive computations could
be the use of super-computing. However, graph algorithms have datadriven
computation dictated by the structure of the graph and this
can lead to low compute capacity utilisation with execution times
dominated by memory latency.
Therefore, this thesis seeks optimised solutions for the real-world
graph problems of critical node detection and e ectiveness characterisation
emerged from the collaboration with a pioneer company in the
eld of network pharmacology as part of a Knowledge Transfer Partnership
(KTP) / Secondment (KTS). In particular, we examine how
genetic algorithms could bene t the prediction of protein complexes
where their removal could produce a more e ective 'druggable' impact.
Furthermore, we investigate how the problem of all pairs shortest
path (APSP) computation can be bene ted by the use of emerging
parallel hardware architectures as GPU- and FPGA- desktop-based
accelerators.
In particular, we address the problem of critical node detection with
the development of a heuristic search method. It is based on a genetic
algorithm that computes optimised node combinations where their removal
causes greater impact than common impact analysis strategies.
Furthermore, we design a general pattern for parallel network analysis
on multi-core architectures that considers graph's embedded properties.
It is a divide and conquer approach that decomposes a graph
into smaller subgraphs based on its strongly connected components
and computes the all pairs shortest paths concurrently on GPU. Furthermore,
we use linear algebra to design an APSP approach based
on the BFS algorithm. We use algebraic expressions to transform the
problem of path computation to multiple independent matrix-vector
multiplications that are executed concurrently on FPGA. Finally, we
analyse how the optimised solutions of perturbation analysis and parallel
graph processing provided in this thesis will impact the drug
discovery process.This research was part of a Knowledge Transfer Partnership (KTP)
and Knowledge Transfer Secondment (KTS) between e-therapeutics
PLC and Newcastle University. It was supported as a collaborative
project by e-therapeutics PLC and Technology Strategy boar
Graph networks for molecular design
Deep learning methods applied to chemistry can be used to accelerate the discovery of new molecules. This work introduces GraphINVENT, a platform developed for graph-based molecular design using graph neural networks (GNNs). GraphINVENT uses a tiered deep neural network architecture to probabilistically generate new molecules a single bond at a time. All models implemented in GraphINVENT can quickly learn to build molecules resembling the training set molecules without any explicit programming of chemical rules. The models have been benchmarked using the MOSES distribution-based metrics, showing how GraphINVENT models compare well with state-of-the-art generative models. This work compares six different GNN-based generative models in GraphINVENT, and shows that ultimately the gated-graph neural network performs best against the metrics considered here
- …