6 research outputs found
Communication-Efficient Jaccard Similarity for High-Performance Distributed Genome Comparisons
The Jaccard similarity index is an important measure of the overlap of two
sets, widely used in machine learning, computational genomics, information
retrieval, and many other areas. We design and implement SimilarityAtScale, the
first communication-efficient distributed algorithm for computing the Jaccard
similarity among pairs of large datasets. Our algorithm provides an efficient
encoding of this problem into a multiplication of sparse matrices. Both the
encoding and sparse matrix product are performed in a way that minimizes data
movement in terms of communication and synchronization costs. We apply our
algorithm to obtain similarity among all pairs of a set of large samples of
genomes. This task is a key part of modern metagenomics analysis and an
evergrowing need due to the increasing availability of high-throughput DNA
sequencing data. The resulting scheme is the first to enable accurate Jaccard
distance derivations for massive datasets, using largescale distributed-memory
systems. We package our routines in a tool, called GenomeAtScale, that combines
the proposed algorithm with tools for processing input sequences. Our
evaluation on real data illustrates that one can use GenomeAtScale to
effectively employ tens of thousands of processors to reach new frontiers in
large-scale genomic and metagenomic analysis. While GenomeAtScale can be used
to foster DNA research, the more general underlying SimilarityAtScale algorithm
may be used for high-performance distributed similarity computations in other
data analytics application domains
GraphMineSuite: Enabling High-Performance and Programmable Graph Mining Algorithms with Set Algebra
We propose GraphMineSuite (GMS): the first benchmarking suite for graph
mining that facilitates evaluating and constructing high-performance graph
mining algorithms. First, GMS comes with a benchmark specification based on
extensive literature review, prescribing representative problems, algorithms,
and datasets. Second, GMS offers a carefully designed software platform for
seamless testing of different fine-grained elements of graph mining algorithms,
such as graph representations or algorithm subroutines. The platform includes
parallel implementations of more than 40 considered baselines, and it
facilitates developing complex and fast mining algorithms. High modularity is
possible by harnessing set algebra operations such as set intersection and
difference, which enables breaking complex graph mining algorithms into simple
building blocks that can be separately experimented with. GMS is supported with
a broad concurrency analysis for portability in performance insights, and a
novel performance metric to assess the throughput of graph mining algorithms,
enabling more insightful evaluation. As use cases, we harness GMS to rapidly
redesign and accelerate state-of-the-art baselines of core graph mining
problems: degeneracy reordering (by up to >2x), maximal clique listing (by up
to >9x), k-clique listing (by 1.1x), and subgraph isomorphism (by up to 2.5x),
also obtaining better theoretical performance bounds