1,084 research outputs found
A Fast and Scalable Graph Coloring Algorithm for Multi-core and Many-core Architectures
Irregular computations on unstructured data are an important class of
problems for parallel programming. Graph coloring is often an important
preprocessing step, e.g. as a way to perform dependency analysis for safe
parallel execution. The total run time of a coloring algorithm adds to the
overall parallel overhead of the application whereas the number of colors used
determines the amount of exposed parallelism. A fast and scalable coloring
algorithm using as few colors as possible is vital for the overall parallel
performance and scalability of many irregular applications that depend upon
runtime dependency analysis.
Catalyurek et al. have proposed a graph coloring algorithm which relies on
speculative, local assignment of colors. In this paper we present an improved
version which runs even more optimistically with less thread synchronization
and reduced number of conflicts compared to Catalyurek et al.'s algorithm. We
show that the new technique scales better on multi-core and many-core systems
and performs up to 1.5x faster than its predecessor on graphs with high-degree
vertices, while keeping the number of colors at the same near-optimal levels.Comment: To appear in the proceedings of Euro Par 201
Graph Coloring Algorithms for Muti-core and Massively Multithreaded Architectures
We explore the interplay between architectures and algorithm design in the
context of shared-memory platforms and a specific graph problem of central
importance in scientific and high-performance computing, distance-1 graph
coloring. We introduce two different kinds of multithreaded heuristic
algorithms for the stated, NP-hard, problem. The first algorithm relies on
speculation and iteration, and is suitable for any shared-memory system. The
second algorithm uses dataflow principles, and is targeted at the
non-conventional, massively multithreaded Cray XMT system. We study the
performance of the algorithms on the Cray XMT and two multi-core systems, Sun
Niagara 2 and Intel Nehalem. Together, the three systems represent a spectrum
of multithreading capabilities and memory structure. As testbed, we use
synthetically generated large-scale graphs carefully chosen to cover a wide
range of input types. The results show that the algorithms have scalable
runtime performance and use nearly the same number of colors as the underlying
serial algorithm, which in turn is effective in practice. The study provides
insight into the design of high performance algorithms for irregular problems
on many-core architectures.Comment: 25 pages, 11 figures, 4 table
Efficient and High-quality Sparse Graph Coloring on the GPU
Graph coloring has been broadly used to discover concurrency in parallel
computing. To speedup graph coloring for large-scale datasets, parallel
algorithms have been proposed to leverage modern GPUs. Existing GPU
implementations either have limited performance or yield unsatisfactory
coloring quality (too many colors assigned). We present a work-efficient
parallel graph coloring implementation on GPUs with good coloring quality. Our
approach employs the speculative greedy scheme which inherently yields better
quality than the method of finding maximal independent set. In order to achieve
high performance on GPUs, we refine the algorithm to leverage efficient
operators and alleviate conflicts. We also incorporate common optimization
techniques to further improve performance. Our method is evaluated with both
synthetic and real-world sparse graphs on the NVIDIA GPU. Experimental results
show that our proposed implementation achieves averaged 4.1x (up to 8.9x)
speedup over the serial implementation. It also outperforms the existing GPU
implementation from the NVIDIA CUSPARSE library (2.2x average speedup), while
yielding much better coloring quality than CUSPARSE.Comment: arXiv admin note: text overlap with arXiv:1205.3809 by other author
Coloring Big Graphs with AlphaGoZero
We show that recent innovations in deep reinforcement learning can
effectively color very large graphs -- a well-known NP-hard problem with clear
commercial applications. Because the Monte Carlo Tree Search with Upper
Confidence Bound algorithm used in AlphaGoZero can improve the performance of a
given heuristic, our approach allows deep neural networks trained using high
performance computing (HPC) technologies to transform computation into improved
heuristics with zero prior knowledge. Key to our approach is the introduction
of a novel deep neural network architecture (FastColorNet) that has access to
the full graph context and requires time and space to color a graph with
vertices, which enables scaling to very large graphs that arise in real
applications like parallel computing, compilers, numerical solvers, and design
automation, among others. As a result, we are able to learn new state of the
art heuristics for graph coloring
Recommended from our members
Preparing sparse solvers for exascale computing.
Sparse solvers provide essential functionality for a wide variety of scientific applications. Highly parallel sparse solvers are essential for continuing advances in high-fidelity, multi-physics and multi-scale simulations, especially as we target exascale platforms. This paper describes the challenges, strategies and progress of the US Department of Energy Exascale Computing project towards providing sparse solvers for exascale computing platforms. We address the demands of systems with thousands of high-performance node devices where exposing concurrency, hiding latency and creating alternative algorithms become essential. The efforts described here are works in progress, highlighting current success and upcoming challenges. This article is part of a discussion meeting issue 'Numerical algorithms for high-performance computational science'
Coloring Large Complex Networks
Given a large social or information network, how can we partition the
vertices into sets (i.e., colors) such that no two vertices linked by an edge
are in the same set while minimizing the number of sets used. Despite the
obvious practical importance of graph coloring, existing works have not
systematically investigated or designed methods for large complex networks. In
this work, we develop a unified framework for coloring large complex networks
that consists of two main coloring variants that effectively balances the
tradeoff between accuracy and efficiency. Using this framework as a fundamental
basis, we propose coloring methods designed for the scale and structure of
complex networks. In particular, the methods leverage triangles,
triangle-cores, and other egonet properties and their combinations. We
systematically compare the proposed methods across a wide range of networks
(e.g., social, web, biological networks) and find a significant improvement
over previous approaches in nearly all cases. Additionally, the solutions
obtained are nearly optimal and sometimes provably optimal for certain classes
of graphs (e.g., collaboration networks). We also propose a parallel algorithm
for the problem of coloring neighborhood subgraphs and make several key
observations. Overall, the coloring methods are shown to be (i) accurate with
solutions close to optimal, (ii) fast and scalable for large networks, and
(iii) flexible for use in a variety of applications
Gunrock: GPU Graph Analytics
For large-scale graph analytics on the GPU, the irregularity of data access
and control flow, and the complexity of programming GPUs, have presented two
significant challenges to developing a programmable high-performance graph
library. "Gunrock", our graph-processing system designed specifically for the
GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on
operations on a vertex or edge frontier. Gunrock achieves a balance between
performance and expressiveness by coupling high performance GPU computing
primitives and optimization strategies with a high-level programming model that
allows programmers to quickly develop new graph primitives with small code size
and minimal GPU programming knowledge. We characterize the performance of
various optimization strategies and evaluate Gunrock's overall performance on
different GPU architectures on a wide range of graph primitives that span from
traversal-based algorithms and ranking algorithms, to triangle counting and
bipartite-graph-based algorithms. The results show that on a single GPU,
Gunrock has on average at least an order of magnitude speedup over Boost and
PowerGraph, comparable performance to the fastest GPU hardwired primitives and
CPU shared-memory graph libraries such as Ligra and Galois, and better
performance than any other GPU high-level graph library.Comment: 52 pages, invited paper to ACM Transactions on Parallel Computing
(TOPC), an extended version of PPoPP'16 paper "Gunrock: A High-Performance
Graph Processing Library on the GPU
Relaxed Schedulers Can Efficiently Parallelize Iterative Algorithms
There has been significant progress in understanding the parallelism inherent
to iterative sequential algorithms: for many classic algorithms, the depth of
the dependence structure is now well understood, and scheduling techniques have
been developed to exploit this shallow dependence structure for efficient
parallel implementations. A related, applied research strand has studied
methods by which certain iterative task-based algorithms can be efficiently
parallelized via relaxed concurrent priority schedulers. These allow for high
concurrency when inserting and removing tasks, at the cost of executing
superfluous work due to the relaxed semantics of the scheduler.
In this work, we take a step towards unifying these two research directions,
by showing that there exists a family of relaxed priority schedulers that can
efficiently and deterministically execute classic iterative algorithms such as
greedy maximal independent set (MIS) and matching. Our primary result shows
that, given a randomized scheduler with an expected relaxation factor of in
terms of the maximum allowed priority inversions on a task, and any graph on
vertices, the scheduler is able to execute greedy MIS with only an additive
factor of poly() expected additional iterations compared to an exact (but
not scalable) scheduler. This counter-intuitive result demonstrates that the
overhead of relaxation when computing MIS is not dependent on the input size or
structure of the input graph. Experimental results show that this overhead can
be clearly offset by the gain in performance due to the highly scalable
scheduler. In sum, we present an efficient method to deterministically
parallelize iterative sequential algorithms, with provable runtime guarantees
in terms of the number of executed tasks to completion.Comment: PODC 2018, pages 377-386 in proceeding
Thread Parallelism for Highly Irregular Computation in Anisotropic Mesh Adaptation
Thread-level parallelism in irregular applications with mutable data
dependencies presents challenges because the underlying data is extensively
modified during execution of the algorithm and a high degree of parallelism
must be realized while keeping the code race-free. In this article we describe
a methodology for exploiting thread parallelism for a class of graph-mutating
worklist algorithms, which guarantees safe parallel execution via processing in
rounds of independent sets and using a deferred update strategy to commit
changes in the underlying data structures. Scalability is assisted by atomic
fetch-and-add operations to create worklists and work-stealing to balance the
shared-memory workload. This work is motivated by mesh adaptation algorithms,
for which we show a parallel efficiency of 60% and 50% on Intel(R) Xeon(R)
Sandy Bridge and AMD Opteron(tm) Magny-Cours systems, respectively, using these
techniques.Comment: To appear in the proceedings of EASC 201
- …