403,317 research outputs found

    Engineering Massively Parallel MST Algorithms

    Full text link
    We develop and extensively evaluate highly scalable distributed-memory algorithms for computing minimum spanning trees (MSTs). At the heart of our solutions is a scalable variant of Boruvka's algorithm. For partitioned graphs with many local edges, we improve this with an effective form of contracting local parts of the graph during a preprocessing step. We also adapt the filtering concept of the best practical sequential algorithm to develop a massively parallel Filter-Boruvka algorithm that is very useful for graphs with poor locality and high average degree. Our experiments indicate that our algorithms scale well up to at least 65 536 cores and are up to 800 times faster than previous distributed MST algorithms.Comment: 12 pages, 6 figure

    Design and analysis of numerical algorithms for the solution of linear systems on parallel and distributed architectures

    Get PDF
    The increasing availability of parallel computers is having a very significant impact on all aspects of scientific computation, including algorithm research and software development in numerical linear algebra. In particular, the solution of linear systems, which lies at the heart of most calculations in scientific computing is an important computation found in many engineering and scientific applications. In this thesis, well-known parallel algorithms for the solution of linear systems are compared with implicit parallel algorithms or the Quadrant Interlocking (QI) class of algorithms to solve linear systems. These implicit algorithms are (2x2) block algorithms expressed in explicit point form notation. [Continues.

    Engineering Fast Parallel Matching Algorithms

    Get PDF
    The computation of matchings has applications in the solving process of a large variety of problems, e.g. as part of graph partitioners. We present and analyze three sequential and two parallel approximation algorithms for the cardinality and weighted matching problem. One of the sequential algorithms is based on an algorithm by Karp and Sipser to compute maximal matchings [21]. Another one is based on the idea of locally heaviest edges by Preis [30]. The third sequential algorithm is a new algorithm based on the computation of maximum weighted matchings of trees spanning the input graph. We show for two of these algorithms that the runtime for slight variations of them is expected to be linear. However the experimental results suggest that this is also the case for the unmodified versions. The comparison with other approximate matching algorithms show that the computed matchings have a similar quality or even the same quality. On the other hand two of our the algorithms are much faster. For two of the sequential algorithms we show how to turn them into parallel matching algorithms. We show that for a simple non optimal partitioning of the input graphs speedups can be observed using up to 1024 processors. For certain kinds of input graphs we see a good scaling behaviour

    Aspects of practical implementations of PRAM algorithms

    Get PDF
    The PRAM is a shared memory model of parallel computation which abstracts away from inessential engineering details. It provides a very simple architecture independent model and provides a good programming environment. Theoreticians of the computer science community have proved that it is possible to emulate the theoretical PRAM model using current technology. Solutions have been found for effectively interconnecting processing elements, for routing data on these networks and for distributing the data among memory modules without hotspots. This thesis reviews this emulation and the possibilities it provides for large scale general purpose parallel computation. The emulation employs a bridging model which acts as an interface between the actual hardware and the PRAM model. We review the evidence that such a scheme can achieve scalable parallel performance and portable parallel software and that PRAM algorithms can be optimally implemented on such practical models. In the course of this review we presented the following new results: 1. Concerning parallel approximation algorithms, we describe an NC algorithm for findings an approximation to a minimum weight perfect matching in a complete weighted graph. The algorithm is conceptually very simple and it is also the first NC-approximation algorithm for the task with a sub-linear performance ratio. 2. Concerning graph embedding, we describe dense edge-disjoint embeddings of the complete binary tree with n leaves in the following n-node communication networks: the hypercube, the dc Bruijn and shuffle-exchange networks and the 2-dimcnsional mesh. In the embeddings the maximum distance from a leaf to the root of the tree is asymptotically optimally short. The embeddings facilitate efficient implementation of many PRAM algorithms on networks employing these graphs as interconnection networks. 3. Concerning bulk synchronous algorithmic, we describe scalable transportable algorithms for the following three commonly required types of computation; balanced tree computations. Fast Fourier Transforms and matrix multiplications

    Fast matrix multiplication techniques based on the Adleman-Lipton model

    Full text link
    On distributed memory electronic computers, the implementation and association of fast parallel matrix multiplication algorithms has yielded astounding results and insights. In this discourse, we use the tools of molecular biology to demonstrate the theoretical encoding of Strassen's fast matrix multiplication algorithm with DNA based on an nn-moduli set in the residue number system, thereby demonstrating the viability of computational mathematics with DNA. As a result, a general scalable implementation of this model in the DNA computing paradigm is presented and can be generalized to the application of \emph{all} fast matrix multiplication algorithms on a DNA computer. We also discuss the practical capabilities and issues of this scalable implementation. Fast methods of matrix computations with DNA are important because they also allow for the efficient implementation of other algorithms (i.e. inversion, computing determinants, and graph theory) with DNA.Comment: To appear in the International Journal of Computer Engineering Research. Minor changes made to make the preprint as similar as possible to the published versio

    On the acceleration of wavefront applications using distributed many-core architectures

    Get PDF
    In this paper we investigate the use of distributed graphics processing unit (GPU)-based architectures to accelerate pipelined wavefront applications—a ubiquitous class of parallel algorithms used for the solution of a number of scientific and engineering applications. Specifically, we employ a recently developed port of the LU solver (from the NAS Parallel Benchmark suite) to investigate the performance of these algorithms on high-performance computing solutions from NVIDIA (Tesla C1060 and C2050) as well as on traditional clusters (AMD/InfiniBand and IBM BlueGene/P). Benchmark results are presented for problem classes A to C and a recently developed performance model is used to provide projections for problem classes D and E, the latter of which represents a billion-cell problem. Our results demonstrate that while the theoretical performance of GPU solutions will far exceed those of many traditional technologies, the sustained application performance is currently comparable for scientific wavefront applications. Finally, a breakdown of the GPU solution is conducted, exposing PCIe overheads and decomposition constraints. A new k-blocking strategy is proposed to improve the future performance of this class of algorithm on GPU-based architectures
    corecore