156 research outputs found

    Greedy algorithms for distance-2 graph coloring and bipartite graph partial coloring

    Get PDF
    In parallel computing, a valid graph coloring yields a lock-free processing of the colored tasks, data points, etc., without expensive synchronization mechanisms. However, the coloring stage is not free and the overhead can be signi cant. In particular, for distance-2 graph coloring (D2GC) and bipartite graph partial coloring (BGPC) problems, which have various use-cases within the scienti c computing and numerical optimization domains, the coloring overhead can be in the order of minutes with a single thread for many real-life graphs, having millions and billions of vertices and edges. In this thesis, we propose a novel greedy algorithm for the distance-2 graph coloring problem on shared-memory architectures. We then extend the algorithm to bipartite graph partial coloring problem, which is structurally very similar to D2GC. The proposed algorithms yield a better parallel coloring performance compared to the existing shared-memory parallel coloring algorithms, by employing greedier and more optimistic techniques. In particular, when compared to the state-of-the-art, the proposed algorithms obtain 25 speedup with 16 cores, without decreasing the coloring quality. Moreover, we extend the existing distance-2 graph coloring algorithm to manycore architectures. Due to architectural limitations, the multicore algorithm can not easily be extended to manycore. Thus several optimizations and modi- cations are proposed to overcome such obstacles. In addition to multi and manycore implementations, we also o er novel optimizations for both D2GC and BGPC on social network graphs. Exploiting the structural properties of social graphs, we propose faster heuristics to increase the performance without decreasing the coloring quality. Finally, we propose two costless balancing heuristics that can be applied to both BGPC and D2GC, which would yield a better color-based parallelization performance with a better load-balancing, especially on manycore architecture

    A Fast and Scalable Graph Coloring Algorithm for Multi-core and Many-core Architectures

    Full text link
    Irregular computations on unstructured data are an important class of problems for parallel programming. Graph coloring is often an important preprocessing step, e.g. as a way to perform dependency analysis for safe parallel execution. The total run time of a coloring algorithm adds to the overall parallel overhead of the application whereas the number of colors used determines the amount of exposed parallelism. A fast and scalable coloring algorithm using as few colors as possible is vital for the overall parallel performance and scalability of many irregular applications that depend upon runtime dependency analysis. Catalyurek et al. have proposed a graph coloring algorithm which relies on speculative, local assignment of colors. In this paper we present an improved version which runs even more optimistically with less thread synchronization and reduced number of conflicts compared to Catalyurek et al.'s algorithm. We show that the new technique scales better on multi-core and many-core systems and performs up to 1.5x faster than its predecessor on graphs with high-degree vertices, while keeping the number of colors at the same near-optimal levels.Comment: To appear in the proceedings of Euro Par 201

    Massive Data-Centric Parallelism in the Chiplet Era

    Full text link
    Traditionally, massively parallel applications are executed on distributed systems, where computing nodes are distant enough that the parallelization schemes must minimize communication and synchronization to achieve scalability. Mapping communication-intensive workloads to distributed systems requires complicated problem partitioning and dataset pre-processing. With the current AI-driven trend of having thousands of interconnected processors per chip, there is an opportunity to re-think these communication-bottlenecked workloads. This bottleneck often arises from data structure traversals, which cause irregular memory accesses and poor cache locality. Recent works have introduced task-based parallelization schemes to accelerate graph traversal and other sparse workloads. Data structure traversals are split into tasks and pipelined across processing units (PUs). Dalorex demonstrated the highest scalability (up to thousands of PUs on a single chip) by having the entire dataset on-chip, scattered across PUs, and executing the tasks at the PU where the data is local. However, it also raised questions on how to scale to larger datasets when all the memory is on chip, and at what cost. To address these challenges, we propose a scalable architecture composed of a grid of Data-Centric Reconfigurable Array (DCRA) chiplets. Package-time reconfiguration enables creating chip products that optimize for different target metrics, such as time-to-solution, energy, or cost, while software reconfigurations avoid network saturation when scaling to millions of PUs across many chip packages. We evaluate six applications and four datasets, with several configurations and memory technologies, to provide a detailed analysis of the performance, power, and cost of data-local execution at scale. Our parallelization of Breadth-First-Search with RMAT-26 across a million PUs reaches 3323 GTEPS
    corecore