8 research outputs found
Neural Graph Databases
Graph databases (GDBs) enable processing and analysis of unstructured,
complex, rich, and usually vast graph datasets. Despite the large significance
of GDBs in both academia and industry, little effort has been made into
integrating them with the predictive power of graph neural networks (GNNs). In
this work, we show how to seamlessly combine nearly any GNN model with the
computational capabilities of GDBs. For this, we observe that the majority of
these systems are based on, or support, a graph data model called the Labeled
Property Graph (LPG), where vertices and edges can have arbitrarily complex
sets of labels and properties. We then develop LPG2vec, an encoder that
transforms an arbitrary LPG dataset into a representation that can be directly
used with a broad class of GNNs, including convolutional, attentional,
message-passing, and even higher-order or spectral models. In our evaluation,
we show that the rich information represented as LPG labels and properties is
properly preserved by LPG2vec, and it increases the accuracy of predictions
regardless of the targeted learning task or the used GNN model, by up to 34%
compared to graphs with no LPG labels/properties. In general, LPG2vec enables
combining predictive power of the most powerful GNNs with the full scope of
information encoded in the LPG model, paving the way for neural graph
databases, a class of systems where the vast complexity of maintained data will
benefit from modern and future graph machine learning methods
Communication-Efficient Jaccard Similarity for High-Performance Distributed Genome Comparisons
The Jaccard similarity index is an important measure of the overlap of two
sets, widely used in machine learning, computational genomics, information
retrieval, and many other areas. We design and implement SimilarityAtScale, the
first communication-efficient distributed algorithm for computing the Jaccard
similarity among pairs of large datasets. Our algorithm provides an efficient
encoding of this problem into a multiplication of sparse matrices. Both the
encoding and sparse matrix product are performed in a way that minimizes data
movement in terms of communication and synchronization costs. We apply our
algorithm to obtain similarity among all pairs of a set of large samples of
genomes. This task is a key part of modern metagenomics analysis and an
evergrowing need due to the increasing availability of high-throughput DNA
sequencing data. The resulting scheme is the first to enable accurate Jaccard
distance derivations for massive datasets, using largescale distributed-memory
systems. We package our routines in a tool, called GenomeAtScale, that combines
the proposed algorithm with tools for processing input sequences. Our
evaluation on real data illustrates that one can use GenomeAtScale to
effectively employ tens of thousands of processors to reach new frontiers in
large-scale genomic and metagenomic analysis. While GenomeAtScale can be used
to foster DNA research, the more general underlying SimilarityAtScale algorithm
may be used for high-performance distributed similarity computations in other
data analytics application domains
GraphMineSuite: Enabling High-Performance and Programmable Graph Mining Algorithms with Set Algebra
We propose GraphMineSuite (GMS): the first benchmarking suite for graph
mining that facilitates evaluating and constructing high-performance graph
mining algorithms. First, GMS comes with a benchmark specification based on
extensive literature review, prescribing representative problems, algorithms,
and datasets. Second, GMS offers a carefully designed software platform for
seamless testing of different fine-grained elements of graph mining algorithms,
such as graph representations or algorithm subroutines. The platform includes
parallel implementations of more than 40 considered baselines, and it
facilitates developing complex and fast mining algorithms. High modularity is
possible by harnessing set algebra operations such as set intersection and
difference, which enables breaking complex graph mining algorithms into simple
building blocks that can be separately experimented with. GMS is supported with
a broad concurrency analysis for portability in performance insights, and a
novel performance metric to assess the throughput of graph mining algorithms,
enabling more insightful evaluation. As use cases, we harness GMS to rapidly
redesign and accelerate state-of-the-art baselines of core graph mining
problems: degeneracy reordering (by up to >2x), maximal clique listing (by up
to >9x), k-clique listing (by 1.1x), and subgraph isomorphism (by up to 2.5x),
also obtaining better theoretical performance bounds
Substream-Centric Maximum Matchings on FPGA
Developing high-performance and energy-efficient algorithms for maximum matchings is becoming increasingly important in social network analysis, computational sciences, scheduling, and others. In this work, we propose the first maximum matching algorithm designed for FPGAs; it is energy-efficient and has provable guarantees on accuracy, performance, and storage utilization. To achieve this, we forego popular graph processing paradigms, such as vertex-centric programming, that often entail large communication costs. Instead, we propose a substream-centric approach, in which the input stream of data is divided into substreams processed independently to enable more parallelism while lowering communication costs. We base our work on the theory of streaming graph algorithms and analyze 14 models and 28 algorithms. We use this analysis to provide theoretical underpinning that matches the physical constraints of FPGA platforms. Our algorithm delivers high performance (more than 4× speedup over tuned parallel CPU variants), low memory, high accuracy, and effective usage of FPGA resources. The substream-centric approach could easily be extended to other algorithms to offer low-power and high-performance graph processing on FPGAs.ISSN:1936-7406ISSN:1936-741