15 research outputs found
GraphMineSuite: Enabling High-Performance and Programmable Graph Mining Algorithms with Set Algebra
We propose GraphMineSuite (GMS): the first benchmarking suite for graph
mining that facilitates evaluating and constructing high-performance graph
mining algorithms. First, GMS comes with a benchmark specification based on
extensive literature review, prescribing representative problems, algorithms,
and datasets. Second, GMS offers a carefully designed software platform for
seamless testing of different fine-grained elements of graph mining algorithms,
such as graph representations or algorithm subroutines. The platform includes
parallel implementations of more than 40 considered baselines, and it
facilitates developing complex and fast mining algorithms. High modularity is
possible by harnessing set algebra operations such as set intersection and
difference, which enables breaking complex graph mining algorithms into simple
building blocks that can be separately experimented with. GMS is supported with
a broad concurrency analysis for portability in performance insights, and a
novel performance metric to assess the throughput of graph mining algorithms,
enabling more insightful evaluation. As use cases, we harness GMS to rapidly
redesign and accelerate state-of-the-art baselines of core graph mining
problems: degeneracy reordering (by up to >2x), maximal clique listing (by up
to >9x), k-clique listing (by 1.1x), and subgraph isomorphism (by up to 2.5x),
also obtaining better theoretical performance bounds
Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable
There has been significant recent interest in parallel graph processing due
to the need to quickly analyze the large graphs available today. Many graph
codes have been designed for distributed memory or external memory. However,
today even the largest publicly-available real-world graph (the Hyperlink Web
graph with over 3.5 billion vertices and 128 billion edges) can fit in the
memory of a single commodity multicore server. Nevertheless, most experimental
work in the literature report results on much smaller graphs, and the ones for
the Hyperlink graph use distributed or external memory. Therefore, it is
natural to ask whether we can efficiently solve a broad class of graph problems
on this graph in memory.
This paper shows that theoretically-efficient parallel graph algorithms can
scale to the largest publicly-available graphs using a single machine with a
terabyte of RAM, processing them in minutes. We give implementations of
theoretically-efficient parallel algorithms for 20 important graph problems. We
also present the optimizations and techniques that we used in our
implementations, which were crucial in enabling us to process these large
graphs quickly. We show that the running times of our implementations
outperform existing state-of-the-art implementations on the largest real-world
graphs. For many of the problems that we consider, this is the first time they
have been solved on graphs at this scale. We have made the implementations
developed in this work publicly-available as the Graph-Based Benchmark Suite
(GBBS).Comment: This is the full version of the paper appearing in the ACM Symposium
on Parallelism in Algorithms and Architectures (SPAA), 201
Parallel Five-Cycle Counting Algorithms
Counting the frequency of subgraphs in large networks is a classic research question that reveals the underlying substructures of these networks for important applications. However, subgraph counting is a challenging problem, even for subgraph sizes as small as five, due to the combinatorial explosion in the number of possible occurrences. This paper focuses on the five-cycle, which is an important special case of five-vertex subgraph counting and one of the most difficult to count efficiently.
We design two new parallel five-cycle counting algorithms and prove that they are work-efficient and achieve polylogarithmic span. Both algorithms are based on computing low out-degree orientations, which enables the efficient computation of directed two-paths and three-paths, and the algorithms differ in the ways in which they use this orientation to eliminate double-counting. We develop fast multicore implementations of the algorithms and propose a work scheduling optimization to improve their performance. Our experiments on a variety of real-world graphs using a 36-core machine with two-way hyper-threading show that our algorithms achieves 10-46x self-relative speed-up, outperform our serial benchmarks by 10-32x, and outperform the previous state-of-the-art serial algorithm by up to 818x
Efficient parallel implementation of the multiplicative weight update method for graph-based linear programs
Positive linear programs (LPs) model many graph and operations research
problems. One can solve for a -approximation for positive LPs,
for any selected , in polylogarithmic depth and near-linear work via
variations of the multiplicative weight update (MWU) method. Despite extensive
theoretical work on these algorithms through the decades, their empirical
performance is not well understood.
In this work, we implement and test an efficient parallel algorithm for
solving positive LP relaxations, and apply it to graph problems such as densest
subgraph, bipartite matching, vertex cover and dominating set. We accelerate
the algorithm via a new step size search heuristic. Our implementation uses
sparse linear algebra optimization techniques such as fusion of vector
operations and use of sparse format. Furthermore, we devise an implicit
representation for graph incidence constraints. We demonstrate the parallel
scalability with the use of threading OpenMP and MPI on the Stampede2
supercomputer. We compare this implementation with exact libraries and
specialized libraries for the above problems in order to evaluate MWU's
practical standing for both accuracy and performance among other methods. Our
results show this implementation is faster than general purpose LP solvers (IBM
CPLEX, Gurobi) in all of our experiments, and in some instances, outperforms
state-of-the-art specialized parallel graph algorithms.Comment: Pre-print. 13 pages, comments welcom
Towards Lightweight and Automated Representation Learning System for Networks
We propose LIGHTNE 2.0, a cost-effective, scalable, automated, and
high-quality network embedding system that scales to graphs with hundreds of
billions of edges on a single machine. In contrast to the mainstream belief
that distributed architecture and GPUs are needed for large-scale network
embedding with good quality, we prove that we can achieve higher quality,
better scalability, lower cost, and faster runtime with shared-memory, CPU-only
architecture. LIGHTNE 2.0 combines two theoretically grounded embedding methods
NetSMF and ProNE. We introduce the following techniques to network embedding
for the first time: (1) a newly proposed downsampling method to reduce the
sample complexity of NetSMF while preserving its theoretical advantages; (2) a
high-performance parallel graph processing stack GBBS to achieve high memory
efficiency and scalability; (3) sparse parallel hash table to aggregate and
maintain the matrix sparsifier in memory; (4) a fast randomized singular value
decomposition (SVD) enhanced by power iteration and fast orthonormalization to
improve vanilla randomized SVD in terms of both efficiency and effectiveness;
(5) Intel MKL for proposed fast randomized SVD and spectral propagation; and
(6) a fast and lightweight AutoML library FLAML for automated hyperparameter
tuning. Experimental results show that LIGHTNE 2.0 can be up to 84X faster than
GraphVite, 30X faster than PBG and 9X faster than NetSMF while delivering
better performance. LIGHTNE 2.0 can embed very large graph with 1.7 billion
nodes and 124 billion edges in half an hour on a CPU server, while other
baselines cannot handle very large graphs of this scale
Parallel Index-Based Structural Graph Clustering and Its Approximation
SCAN (Structural Clustering Algorithm for Networks) is a well-studied, widely
used graph clustering algorithm. For large graphs, however, sequential SCAN
variants are prohibitively slow, and parallel SCAN variants do not effectively
share work among queries with different SCAN parameter settings. Since users of
SCAN often explore many parameter settings to find good clusterings, it is
worthwhile to precompute an index that speeds up queries.
This paper presents a practical and provably efficient parallel index-based
SCAN algorithm based on GS*-Index, a recent sequential algorithm. Our parallel
algorithm improves upon the asymptotic work of the sequential algorithm by
using integer sorting. It is also highly parallel, achieving logarithmic span
(parallel time) for both index construction and clustering queries.
Furthermore, we apply locality-sensitive hashing (LSH) to design a novel
approximate SCAN algorithm and prove guarantees for its clustering behavior.
We present an experimental evaluation of our algorithms on large real-world
graphs. On a 48-core machine with two-way hyper-threading, our parallel index
construction achieves 50--151 speedup over the construction of
GS*-Index. In fact, even on a single thread, our index construction algorithm
is faster than GS*-Index. Our parallel index query implementation achieves
5--32 speedup over GS*-Index queries across a range of SCAN parameter
values, and our implementation is always faster than ppSCAN, a state-of-the-art
parallel SCAN algorithm. Moreover, our experiments show that applying LSH
results in faster index construction while maintaining good clustering quality
ConnectIt: A Framework for Static and Incremental Parallel Graph Connectivity Algorithms
Connected components is a fundamental kernel in graph applications due to its
usefulness in measuring how well-connected a graph is, as well as its use as
subroutines in many other graph algorithms. The fastest existing parallel
multicore algorithms for connectivity are based on some form of edge sampling
and/or linking and compressing trees. However, many combinations of these
design choices have been left unexplored. In this paper, we design the
ConnectIt framework, which provides different sampling strategies as well as
various tree linking and compression schemes. ConnectIt enables us to obtain
several hundred new variants of connectivity algorithms, most of which extend
to computing spanning forest. In addition to static graphs, we also extend
ConnectIt to support mixes of insertions and connectivity queries in the
concurrent setting.
We present an experimental evaluation of ConnectIt on a 72-core machine,
which we believe is the most comprehensive evaluation of parallel connectivity
algorithms to date. Compared to a collection of state-of-the-art static
multicore algorithms, we obtain an average speedup of 37.4x (2.36x average
speedup over the fastest existing implementation for each graph). Using
ConnectIt, we are able to compute connectivity on the largest
publicly-available graph (with over 3.5 billion vertices and 128 billion edges)
in under 10 seconds using a 72-core machine, providing a 3.1x speedup over the
fastest existing connectivity result for this graph, in any computational
setting. For our incremental algorithms, we show that our algorithms can ingest
graph updates at up to several billion edges per second. Finally, to guide the
user in selecting the best variants in ConnectIt for different situations, we
provide a detailed analysis of the different strategies in terms of their work
and locality
Recommended from our members
Compiler and system for resilient distributed heterogeneous graph analytics
Graph analytics systems are used in a wide variety of applications including health care, electronic circuit design, machine learning, and cybersecurity. Graph analytics systems must handle very large graphs such as the Facebook friends graph, which has more than a billion nodes and 200 billion edges. Since machines have limited main memory, distributed-memory clusters with sufficient memory and computation power are required for processing of these graphs. In distributed graph analytics, the graph is partitioned among the machines in a cluster, and communication between partitions is implemented using a substrate like MPI. However, programming distributed-memory systems are not easy and the recent trend towards the processor heterogeneity has added to this complexity. To simplify the programming of graph applications on such platforms, this dissertation first presents a compiler called Abelian that translates shared-memory descriptions of graph algorithms written in the Galois programming model into efficient code for distributed-memory platforms with heterogeneous processors. An important runtime parameter to the compiler-generated distributed code is the partitioning policy. We present an experimental study of partitioning strategies for distributed work-efficient graph analytics applications on different CPU architecture clusters at large scale (up to 256 machines). Based on the study we present a simple rule of thumb to select among myriad policies. Another challenge of distributed graph analytics that we address in this dissertation is to deal with machine fail-stop failures, which is an important concern especially for long-running graph analytics applications on large clusters. We present a novel communication and synchronization substrate called Phoenix that leverages the algorithmic properties of graph analytics applications to recover from faults with zero overheads during fault-free execution and show that Phoenix is 24x faster than previous state-of-the-art systems. In this dissertation, we also look at the new opportunities for graph analytics on massive datasets brought by a new kind of byte-addressable memory technology with higher density and lower cost than DRAM such as intel Optane DC Persistent Memory. This enables the design of affordable systems that support up to 6TB of randomly accessible memory. In this dissertation, we present key runtime and algorithmic principles to consider when performing graph analytics on massive datasets on Optane DC Persistent Memory as well as highlight ideas that apply to graph analytics on all large-memory platforms. Finally, we show that our distributed graph analytics infrastructure can be used for a new domain of applications, in particular, embedding algorithms such as Word2Vec. Word2Vec trains the vector representations of words (also known as word embeddings) on large text corpus and resulting vector embeddings have been shown to capture semantic and syntactic relationships among words. Other examples include Node2Vec, Code2Vec, Sequence2Vec, etc (collectively known as Any2Vec) with a wide variety of uses. We formulate the training of such applications as a graph problem and present GraphAny2Vec, a distributed Any2Vec training framework that leverages the state-of-the-art distributed heterogeneous graph analytics infrastructure developed in this dissertation to scale Any2Vec training to large distributed clusters. GraphAny2Vec also demonstrates a novel way of combining model gradients during training, which allows it to scale without losing accuracyComputer Science