Search CORE

156 research outputs found

Greedy algorithms for distance-2 graph coloring and bipartite graph partial coloring

Author: Taş Mustafa Kemal
Publication venue
Publication date: 12/07/2019
Field of study

In parallel computing, a valid graph coloring yields a lock-free processing of the colored tasks, data points, etc., without expensive synchronization mechanisms. However, the coloring stage is not free and the overhead can be signi cant. In particular, for distance-2 graph coloring (D2GC) and bipartite graph partial coloring (BGPC) problems, which have various use-cases within the scienti c computing and numerical optimization domains, the coloring overhead can be in the order of minutes with a single thread for many real-life graphs, having millions and billions of vertices and edges. In this thesis, we propose a novel greedy algorithm for the distance-2 graph coloring problem on shared-memory architectures. We then extend the algorithm to bipartite graph partial coloring problem, which is structurally very similar to D2GC. The proposed algorithms yield a better parallel coloring performance compared to the existing shared-memory parallel coloring algorithms, by employing greedier and more optimistic techniques. In particular, when compared to the state-of-the-art, the proposed algorithms obtain 25 speedup with 16 cores, without decreasing the coloring quality. Moreover, we extend the existing distance-2 graph coloring algorithm to manycore architectures. Due to architectural limitations, the multicore algorithm can not easily be extended to manycore. Thus several optimizations and modi- cations are proposed to overcome such obstacles. In addition to multi and manycore implementations, we also o er novel optimizations for both D2GC and BGPC on social network graphs. Exploiting the structural properties of social graphs, we propose faster heuristics to increase the performance without decreasing the coloring quality. Finally, we propose two costless balancing heuristics that can be applied to both BGPC and D2GC, which would yield a better color-based parallelization performance with a better load-balancing, especially on manycore architecture

Sabanci University Research Database

A Fast and Scalable Graph Coloring Algorithm for Multi-core and Many-core Architectures

Author: D Chakrabarti
H Cougny De
MM Strout
MR Garey
MT Jones
ÜV Çatalyürek
Publication venue
Publication date: 18/05/2015
Field of study

Irregular computations on unstructured data are an important class of problems for parallel programming. Graph coloring is often an important preprocessing step, e.g. as a way to perform dependency analysis for safe parallel execution. The total run time of a coloring algorithm adds to the overall parallel overhead of the application whereas the number of colors used determines the amount of exposed parallelism. A fast and scalable coloring algorithm using as few colors as possible is vital for the overall parallel performance and scalability of many irregular applications that depend upon runtime dependency analysis. Catalyurek et al. have proposed a graph coloring algorithm which relies on speculative, local assignment of colors. In this paper we present an improved version which runs even more optimistically with less thread synchronization and reduced number of conflicts compared to Catalyurek et al.'s algorithm. We show that the new technique scales better on multi-core and many-core systems and performs up to 1.5x faster than its predecessor on graphs with high-degree vertices, while keeping the number of colors at the same near-optimal levels.Comment: To appear in the proceedings of Euro Par 201

arXiv.org e-Print Archive

Crossref

Spiral - Imperial College Digital Repository

Massive Data-Centric Parallelism in the Chiplet Era

Author: Martonosi Margaret
Orenes-Vera Marcelo
Tureci Esin
Wentzlaf David
Publication venue
Publication date: 18/04/2023
Field of study

Traditionally, massively parallel applications are executed on distributed systems, where computing nodes are distant enough that the parallelization schemes must minimize communication and synchronization to achieve scalability. Mapping communication-intensive workloads to distributed systems requires complicated problem partitioning and dataset pre-processing. With the current AI-driven trend of having thousands of interconnected processors per chip, there is an opportunity to re-think these communication-bottlenecked workloads. This bottleneck often arises from data structure traversals, which cause irregular memory accesses and poor cache locality. Recent works have introduced task-based parallelization schemes to accelerate graph traversal and other sparse workloads. Data structure traversals are split into tasks and pipelined across processing units (PUs). Dalorex demonstrated the highest scalability (up to thousands of PUs on a single chip) by having the entire dataset on-chip, scattered across PUs, and executing the tasks at the PU where the data is local. However, it also raised questions on how to scale to larger datasets when all the memory is on chip, and at what cost. To address these challenges, we propose a scalable architecture composed of a grid of Data-Centric Reconfigurable Array (DCRA) chiplets. Package-time reconfiguration enables creating chip products that optimize for different target metrics, such as time-to-solution, energy, or cost, while software reconfigurations avoid network saturation when scaling to millions of PUs across many chip packages. We evaluate six applications and four datasets, with several configurations and memory technologies, to provide a detailed analysis of the performance, power, and cost of data-local execution at scale. Our parallelization of Breadth-First-Search with RMAT-26 across a million PUs reaches 3323 GTEPS

arXiv.org e-Print Archive

Recommended from our members

Preparing sparse solvers for exascale computing.

Author: Anzt Hartwig
Boman Erik
Curfman McInnes Lois
Falgout Rob
Ghysels Pieter
Heroux Michael
Li Xiaoye
Meier Yang Ulrike
Rajamanickam Sivasankaran
Rupp Karl
Smith Barry
Tran Mills Richard
Yamazaki Ichitaro
Publication venue: eScholarship, University of California
Publication date: 01/03/2020
Field of study

Sparse solvers provide essential functionality for a wide variety of scientific applications. Highly parallel sparse solvers are essential for continuing advances in high-fidelity, multi-physics and multi-scale simulations, especially as we target exascale platforms. This paper describes the challenges, strategies and progress of the US Department of Energy Exascale Computing project towards providing sparse solvers for exascale computing platforms. We address the demands of systems with thousands of high-performance node devices where exposing concurrency, hiding latency and creating alternative algorithms become essential. The efforts described here are works in progress, highlighting current success and upcoming challenges. This article is part of a discussion meeting issue 'Numerical algorithms for high-performance computational science'

eScholarship - University of California

Enhanced Parallel ILU(p)-based Preconditioners for Multi-core CPUs and GPUs - The Power(q)-pattern Method

Author: Heuveline Vincent
Lukarski Dimitar
Weiss Jan-Philipp
Publication venue: Karlsruher Institut für Technologie
Publication date: 01/01/2011
Field of study

KITopen