468 research outputs found

    Towards Terabit Carrier Ethernet and Energy Efficient Optical Transport Networks

    Get PDF

    Engineering of Algorithms for Very Large k Partitioning

    Get PDF

    Parallel and Flow-Based High Quality Hypergraph Partitioning

    Get PDF
    Balanced hypergraph partitioning is a classic NP-hard optimization problem that is a fundamental tool in such diverse disciplines as VLSI circuit design, route planning, sharding distributed databases, optimizing communication volume in parallel computing, and accelerating the simulation of quantum circuits. Given a hypergraph and an integer kk, the task is to divide the vertices into kk disjoint blocks with bounded size, while minimizing an objective function on the hyperedges that span multiple blocks. In this dissertation we consider the most commonly used objective, the connectivity metric, where we aim to minimize the number of different blocks connected by each hyperedge. The most successful heuristic for balanced partitioning is the multilevel approach, which consists of three phases. In the coarsening phase, vertex clusters are contracted to obtain a sequence of structurally similar but successively smaller hypergraphs. Once sufficiently small, an initial partition is computed. Lastly, the contractions are successively undone in reverse order, and an iterative improvement algorithm is employed to refine the projected partition on each level. An important aspect in designing practical heuristics for optimization problems is the trade-off between solution quality and running time. The appropriate trade-off depends on the specific application, the size of the data sets, and the computational resources available to solve the problem. Existing algorithms are either slow, sequential and offer high solution quality, or are simple, fast, easy to parallelize, and offer low quality. While this trade-off cannot be avoided entirely, our goal is to close the gaps as much as possible. We achieve this by improving the state of the art in all non-trivial areas of the trade-off landscape with only a few techniques, but employed in two different ways. Furthermore, most research on parallelization has focused on distributed memory, which neglects the greater flexibility of shared-memory algorithms and the wide availability of commodity multi-core machines. In this thesis, we therefore design and revisit fundamental techniques for each phase of the multilevel approach, and develop highly efficient shared-memory parallel implementations thereof. We consider two iterative improvement algorithms, one based on the Fiduccia-Mattheyses (FM) heuristic, and one based on label propagation. For these, we propose a variety of techniques to improve the accuracy of gains when moving vertices in parallel, as well as low-level algorithmic improvements. For coarsening, we present a parallel variant of greedy agglomerative clustering with a novel method to resolve cluster join conflicts on-the-fly. Combined with a preprocessing phase for coarsening based on community detection, a portfolio of from-scratch partitioning algorithms, as well as recursive partitioning with work-stealing, we obtain our first parallel multilevel framework. It is the fastest partitioner known, and achieves medium-high quality, beating all parallel partitioners, and is close to the highest quality sequential partitioner. Our second contribution is a parallelization of an n-level approach, where only one vertex is contracted and uncontracted on each level. This extreme approach aims at high solution quality via very fine-grained, localized refinement, but seems inherently sequential. We devise an asynchronous n-level coarsening scheme based on a hierarchical decomposition of the contractions, as well as a batch-synchronous uncoarsening, and later fully asynchronous uncoarsening. In addition, we adapt our refinement algorithms, and also use the preprocessing and portfolio. This scheme is highly scalable, and achieves the same quality as the highest quality sequential partitioner (which is based on the same components), but is of course slower than our first framework due to fine-grained uncoarsening. The last ingredient for high quality is an iterative improvement algorithm based on maximum flows. In the sequential setting, we first improve an existing idea by solving incremental maximum flow problems, which leads to smaller cuts and is faster due to engineering efforts. Subsequently, we parallelize the maximum flow algorithm and schedule refinements in parallel. Beyond the strive for highest quality, we present a deterministically parallel partitioning framework. We develop deterministic versions of the preprocessing, coarsening, and label propagation refinement. Experimentally, we demonstrate that the penalties for determinism in terms of partition quality and running time are very small. All of our claims are validated through extensive experiments, comparing our algorithms with state-of-the-art solvers on large and diverse benchmark sets. To foster further research, we make our contributions available in our open-source framework Mt-KaHyPar. While it seems inevitable, that with ever increasing problem sizes, we must transition to distributed memory algorithms, the study of shared-memory techniques is not in vain. With the multilevel approach, even the inherently slow techniques have a role to play in fast systems, as they can be employed to boost quality on coarse levels at little expense. Similarly, techniques for shared-memory parallelism are important, both as soon as a coarse graph fits into memory, and as local building blocks in the distributed algorithm

    Parallelization of Reordering Algorithms for Bandwidth and Wavefront Reduction

    Full text link
    Abstract—Many sparse matrix computations can be speeded up if the matrix is first reordered. Reordering was originally developed for direct methods but it has recently become popular for improving the cache locality of parallel iterative solvers since reordering the matrix to reduce bandwidth and wavefront can improve the locality of reference of sparse matrix-vector multiplication (SpMV), the key kernel in iterative solvers. In this paper, we present the first parallel implementations of two widely used reordering algorithms: Reverse Cuthill-McKee (RCM) and Sloan. On 16 cores of the Stampede supercomputer, our parallel RCM is 5.56 times faster on the average than a state-of-the-art sequential implementation of RCM in the HSL library. Sloan is significantly more constrained than RCM, but our parallel implementation achieves a speedup of 2.88X on the average over sequential HSL-Sloan. Reordering the matrix using our parallel RCM and then performing 100 SpMV iterations is twice as fast as using HSL-RCM and then performing the SpMV iterations; it is also 1.5 times faster than performing the SpMV iterations without reordering the matrix. I

    Algorithm Engineering for fundamental Sorting and Graph Problems

    Get PDF
    Fundamental Algorithms build a basis knowledge for every computer science undergraduate or a professional programmer. It is a set of basic techniques one can find in any (good) coursebook on algorithms and data structures. In this thesis we try to close the gap between theoretically worst-case optimal classical algorithms and the real-world circumstances one face under the assumptions imposed by the data size, limited main memory or available parallelism

    Efficient Transactional-Memory-based Implementation of Morph Algorithms on GPU

    Get PDF
    General Purpose GPUs (GPGPUs) are ideal platforms for parallel execution of applications with regular shared memory access patterns. However, majority of real world multithreaded applications require access to shared memory with irregular patterns. The morph algorithms, which arise in many real world applications, change their graph data structures in unpredictable ways, thus, leading to irregular access patterns to shared data. Such irregularity makes morph algorithms more challenging to be implemented on GPUs which favor regularity. The Borouvka’s algorithm for calculating Minimum Spanning Forest (MSF), and multilevel graph partitioning are two examples of morph algorithms with varied levels of expressed parallelism. In this work we show that a transactional-memory-based design and implementation of the morph algorithms on GPUs can handle some of the challenges arising due to irregularities such as complexity of code and overhead of synchronization. First, we identify the major phases of the algorithm which requires synchronization of the shared data. If the algorithm exhibits certain algebraic properties (e.g., monotonicity, idempotency, associativity), we can use lock-free synchronizations for performance; otherwise we utilize a Software Transactional Memory (STM) based synchronization method. Experimental results show that our GPU-based implementation of Borouvka’s algorithm outperforms both the fastest sequential implementation and the existing STM-based implementation on multicore CPUs when tested on large-scale graphs with diverse densities. Moreover, to show the applicability of our approach to other morph algorithms, we do a pen-and-paper implementation and complexity analysis of multilevel graph partitioning

    Asynchronous n-Level Hypergraph Partitioning

    Get PDF
    • …
    corecore