403,317 research outputs found
Engineering Massively Parallel MST Algorithms
We develop and extensively evaluate highly scalable distributed-memory
algorithms for computing minimum spanning trees (MSTs). At the heart of our
solutions is a scalable variant of Boruvka's algorithm. For partitioned graphs
with many local edges, we improve this with an effective form of contracting
local parts of the graph during a preprocessing step. We also adapt the
filtering concept of the best practical sequential algorithm to develop a
massively parallel Filter-Boruvka algorithm that is very useful for graphs with
poor locality and high average degree. Our experiments indicate that our
algorithms scale well up to at least 65 536 cores and are up to 800 times
faster than previous distributed MST algorithms.Comment: 12 pages, 6 figure
Design and analysis of numerical algorithms for the solution of linear systems on parallel and distributed architectures
The increasing availability of parallel computers is having a very significant impact on
all aspects of scientific computation, including algorithm research and software
development in numerical linear algebra. In particular, the solution of linear systems,
which lies at the heart of most calculations in scientific computing is an important
computation found in many engineering and scientific applications.
In this thesis, well-known parallel algorithms for the solution of linear systems are
compared with implicit parallel algorithms or the Quadrant Interlocking (QI) class of
algorithms to solve linear systems. These implicit algorithms are (2x2) block
algorithms expressed in explicit point form notation. [Continues.
Engineering Fast Parallel Matching Algorithms
The computation of matchings has applications in the solving process of a large variety of problems, e.g. as part of graph partitioners. We present and analyze three sequential and two parallel approximation algorithms for the cardinality and weighted matching problem. One of the sequential algorithms is based on an algorithm by Karp and Sipser to compute maximal matchings [21]. Another one is based on the idea of locally heaviest edges by Preis [30]. The third sequential algorithm is a new algorithm based on the computation of maximum weighted matchings of trees spanning the input graph. We show for two of these algorithms that the runtime for slight variations of them is expected to be linear. However the experimental results suggest that this is also the case for the unmodified versions. The comparison with other approximate matching algorithms show that the computed matchings have a similar quality or even the same quality. On the other hand two of our the algorithms are much faster. For two of the sequential algorithms we show how to turn them into parallel matching algorithms. We show that for a simple non optimal partitioning of the input graphs speedups can be observed using up to 1024 processors. For certain kinds of input graphs we see a good scaling behaviour
Aspects of practical implementations of PRAM algorithms
The PRAM is a shared memory model of parallel computation which abstracts away from inessential engineering details. It provides a very simple architecture independent model and provides a good programming environment. Theoreticians of the computer science community have proved that it is possible to emulate the theoretical PRAM model using current technology. Solutions have been found for effectively interconnecting processing elements, for routing data on these networks and for distributing the data among memory modules without hotspots. This thesis reviews this emulation and the possibilities it provides for large scale general purpose parallel computation. The emulation employs a bridging model which acts as an interface between the actual hardware and the PRAM model. We review the evidence that such a scheme can achieve scalable parallel performance and portable parallel software and that PRAM algorithms can be optimally implemented on such practical models. In the course of this review we presented the following new results:
1. Concerning parallel approximation algorithms, we describe an NC algorithm for findings an approximation to a minimum weight perfect matching in a complete weighted graph. The algorithm is conceptually very simple and it is also the first NC-approximation algorithm for the task with a sub-linear performance ratio.
2. Concerning graph embedding, we describe dense edge-disjoint embeddings of the complete binary tree with n leaves in the following n-node communication networks: the hypercube, the dc Bruijn and shuffle-exchange networks and the 2-dimcnsional mesh. In the embeddings the maximum distance from a leaf to the root of the tree is asymptotically optimally short. The embeddings facilitate efficient implementation of many PRAM algorithms on networks employing these graphs as interconnection networks.
3. Concerning bulk synchronous algorithmic, we describe scalable transportable algorithms for the following three commonly required types of computation; balanced tree computations. Fast Fourier Transforms and matrix multiplications
Recommended from our members
A parallel genetic algorithm for the Steiner Problem in Networks
This paper presents a parallel genetic algorithm to the
Steiner Problem in Networks. Several previous papers
have proposed the adoption of GAs and others
metaheuristics to solve the SPN demonstrating the
validity of their approaches. This work differs from them
for two main reasons: the dimension and the
characteristics of the networks adopted in the experiments
and the aim from which it has been originated. The reason
that aimed this work was namely to build a comparison
term for validating deterministic and computationally
inexpensive algorithms which can be used in practical
engineering applications, such as the multicast
transmission in the Internet. On the other hand, the large
dimensions of our sample networks require the adoption
of a parallel implementation of the Steiner GA, which is
able to deal with such large problem instances
Fast matrix multiplication techniques based on the Adleman-Lipton model
On distributed memory electronic computers, the implementation and
association of fast parallel matrix multiplication algorithms has yielded
astounding results and insights. In this discourse, we use the tools of
molecular biology to demonstrate the theoretical encoding of Strassen's fast
matrix multiplication algorithm with DNA based on an -moduli set in the
residue number system, thereby demonstrating the viability of computational
mathematics with DNA. As a result, a general scalable implementation of this
model in the DNA computing paradigm is presented and can be generalized to the
application of \emph{all} fast matrix multiplication algorithms on a DNA
computer. We also discuss the practical capabilities and issues of this
scalable implementation. Fast methods of matrix computations with DNA are
important because they also allow for the efficient implementation of other
algorithms (i.e. inversion, computing determinants, and graph theory) with DNA.Comment: To appear in the International Journal of Computer Engineering
Research. Minor changes made to make the preprint as similar as possible to
the published versio
On the acceleration of wavefront applications using distributed many-core architectures
In this paper we investigate the use of distributed graphics processing unit (GPU)-based architectures to accelerate pipelined wavefront applications—a ubiquitous class of parallel algorithms used for the solution of a number of scientific and engineering applications. Specifically, we employ a recently developed port of the LU solver (from the NAS Parallel Benchmark suite) to investigate the performance of these algorithms on high-performance computing solutions from NVIDIA (Tesla C1060 and C2050) as well as on traditional clusters (AMD/InfiniBand and IBM BlueGene/P). Benchmark results are presented for problem classes A to C and a recently developed performance model is used to provide projections for problem classes D and E, the latter of which represents a billion-cell problem. Our results demonstrate that while the theoretical performance of GPU solutions will far exceed those of many traditional technologies, the sustained application performance is currently comparable for scientific wavefront applications. Finally, a breakdown of the GPU solution is conducted, exposing PCIe overheads and decomposition constraints. A new k-blocking strategy is proposed to improve the future performance of this class of algorithm on GPU-based architectures
- …