286 research outputs found
A Fast and Scalable Graph Coloring Algorithm for Multi-core and Many-core Architectures
Irregular computations on unstructured data are an important class of
problems for parallel programming. Graph coloring is often an important
preprocessing step, e.g. as a way to perform dependency analysis for safe
parallel execution. The total run time of a coloring algorithm adds to the
overall parallel overhead of the application whereas the number of colors used
determines the amount of exposed parallelism. A fast and scalable coloring
algorithm using as few colors as possible is vital for the overall parallel
performance and scalability of many irregular applications that depend upon
runtime dependency analysis.
Catalyurek et al. have proposed a graph coloring algorithm which relies on
speculative, local assignment of colors. In this paper we present an improved
version which runs even more optimistically with less thread synchronization
and reduced number of conflicts compared to Catalyurek et al.'s algorithm. We
show that the new technique scales better on multi-core and many-core systems
and performs up to 1.5x faster than its predecessor on graphs with high-degree
vertices, while keeping the number of colors at the same near-optimal levels.Comment: To appear in the proceedings of Euro Par 201
Fast and high quality topology-aware task mapping
Considering the large number of processors and the size of the interconnection networks on exascale capable supercomputers, mapping concurrently executable and communicating tasks of an application is a complex problem that needs to be dealt with care. For parallel applications, the communication overhead can be a significant bottleneck on scalability. Topology-aware task-mapping methods that map the tasks to the processors (i.e., cores) by exploiting the underlying network information are very effective to avoid, or at worst bend, this limitation. We propose novel, efficient, and effective task mapping algorithms employing a graph model. The experiments show that the methods are faster than the existing approaches proposed for the same task, and on 4096 processors, the algorithms improve the communication hops and link contentions by 16% and 32%, respectively, on the average. In addition, they improve the average execution time of a parallel SpMV kernel and a communication-only application by 9% and 14%, respectively
Cooperative Minibatching in Graph Neural Networks
Significant computational resources are required to train Graph Neural
Networks (GNNs) at a large scale, and the process is highly data-intensive. One
of the most effective ways to reduce resource requirements is minibatch
training coupled with graph sampling. GNNs have the unique property that items
in a minibatch have overlapping data. However, the commonly implemented
Independent Minibatching approach assigns each Processing Element (PE) its own
minibatch to process, leading to duplicated computations and input data access
across PEs. This amplifies the Neighborhood Explosion Phenomenon (NEP), which
is the main bottleneck limiting scaling. To reduce the effects of NEP in the
multi-PE setting, we propose a new approach called Cooperative Minibatching.
Our approach capitalizes on the fact that the size of the sampled subgraph is a
concave function of the batch size, leading to significant reductions in the
amount of work per seed vertex as batch sizes increase. Hence, it is favorable
for processors equipped with a fast interconnect to work on a large minibatch
together as a single larger processor, instead of working on separate smaller
minibatches, even though global batch size is identical. We also show how to
take advantage of the same phenomenon in serial execution by generating
dependent consecutive minibatches. Our experimental evaluations show up to 4x
bandwidth savings for fetching vertex embeddings, by simply increasing this
dependency without harming model convergence. Combining our proposed
approaches, we achieve up to 64% speedup over Independent Minibatching on
single-node multi-GPU systems.Comment: Under submissio
A constructive multi-way circuit partitioning algorithm based on minimum degree ordering
Ankara : The Department of Computer Engineering and Information Science and the Institute of Engineering and Science of Bilkent Univ., 1994.Thesis (Master's) -- Bilkent University, 1994.Includes bibliographical references leaves 52-54.Circuit partitioning has many important applications in VLSI. Circuit partitioning
problem can be most properly modeled as hypergraph partitioning. In
this work, we propose a novel k-v/ay hypergraph partitioning heuristic using
the Minimum Degree (MD) ordering which is a well-known heuristic for reducing
the amount of fills in the factorization of symmetric sparse matrices.
The proposed algorithm operates on the dual graph of the given hypergraph.
The algorithm grows node-clusters on the dual graph which induce cell-clusters
with locally minimum net-cut sizes. The quotient graph concept, widely used
in MD ordering, is exploited for the sake of efficient implementation. The
proposed algorithm outperforms well-known heuristics, such as Kernighan-Lin
(KL) based algorithms and Simulated Annealing, in terms of solution quality
on various VLSI benchmark circuits. A nice property of the proposed algorithm
is that its execution time reduces with increasing k as opposed to the
existing iterative heuristics. It is even faster than the fast KL-based algorithms
on the partitioning of the benchmark circuits for k > 16.Çatalyürek, Ümit VM.S
- …