2,484 research outputs found
Mixing multi-core CPUs and GPUs for scientific simulation software
Recent technological and economic developments have led to widespread availability of
multi-core CPUs and specialist accelerator processors such as graphical processing units
(GPUs). The accelerated computational performance possible from these devices can be very
high for some applications paradigms. Software languages and systems such as NVIDIA's
CUDA and Khronos consortium's open compute language (OpenCL) support a number of
individual parallel application programming paradigms. To scale up the performance of some
complex systems simulations, a hybrid of multi-core CPUs for coarse-grained parallelism and
very many core GPUs for data parallelism is necessary. We describe our use of hybrid applica-
tions using threading approaches and multi-core CPUs to control independent GPU devices.
We present speed-up data and discuss multi-threading software issues for the applications
level programmer and o er some suggested areas for language development and integration
between coarse-grained and ne-grained multi-thread systems. We discuss results from three
common simulation algorithmic areas including: partial di erential equations; graph cluster
metric calculations and random number generation. We report on programming experiences
and selected performance for these algorithms on: single and multiple GPUs; multi-core CPUs;
a CellBE; and using OpenCL. We discuss programmer usability issues and the outlook and
trends in multi-core programming for scienti c applications developers
Vectorization and Parallelization of the Adaptive Mesh Refinement N-body Code
In this paper, we describe our vectorized and parallelized adaptive mesh
refinement (AMR) N-body code with shared time steps, and report its performance
on a Fujitsu VPP5000 vector-parallel supercomputer. Our AMR N-body code puts
hierarchical meshes recursively where higher resolution is required and the
time step of all particles are the same. The parts which are the most difficult
to vectorize are loops that access the mesh data and particle data. We
vectorized such parts by changing the loop structure, so that the innermost
loop steps through the cells instead of the particles in each cell, in other
words, by changing the loop order from the depth-first order to the
breadth-first order. Mass assignment is also vectorizable using this loop order
exchange and splitting the loop into loops, if the cloud-in-cell
scheme is adopted. Here, is the number of dimension. These
vectorization schemes which eliminate the unvectorized loops are applicable to
parallelization of loops for shared-memory multiprocessors. We also
parallelized our code for distributed memory machines. The important part of
parallelization is data decomposition. We sorted the hierarchical mesh data by
the Morton order, or the recursive N-shaped order, level by level and split and
allocated the mesh data to the processors. Particles are allocated to the
processor to which the finest refined cells including the particles are also
assigned. Our timing analysis using the -dominated cold dark matter
simulations shows that our parallel code speeds up almost ideally up to 32
processors, the largest number of processors in our test.Comment: 21pages, 16 figures, to be published in PASJ (Vol. 57, No. 5, Oct.
2005
A taxonomy of parallel sorting
TR 84-601In this paper, we propose a taxonomy of parallel sorting that includes a broad range of array
and file sorting algorithms. We analyze the evolution of research on parallel sorting, from the
earliest sorting networks to the shared memory algorithms and the VLSI sorters. In the context
of sorting networks, we describe two fundamental parallel merging schemes - the odd-even and
the bitonic merge. Sorting algorithms have been derived from these merging algorithms for parallel
computers where processors communicate through interconnection networks such as the perfect
shuffle, the mesh and a number of other sparse networks. After describing the network sorting
algorithms, we show that, with a shared memory model of parallel computation, faster algorithms
have been derived from parallel enumeration sorting schemes, where keys are first ranked and
then rearranged according to their rank
Supporting shared data structures on distributed memory architectures
Programming nonshared memory systems is more difficult than programming shared memory systems, since there is no support for shared data structures. Current programming languages for distributed memory architectures force the user to decompose all data structures into separate pieces, with each piece owned by one of the processors in the machine, and with all communication explicitly specified by low-level message-passing primitives. A new programming environment is presented for distributed memory architectures, providing a global name space and allowing direct access to remote parts of data values. The analysis and program transformations required to implement this environment are described, and the efficiency of the resulting code on the NCUBE/7 and IPSC/2 hypercubes are described
\u3cem\u3ek-k\u3c/em\u3e Routing, \u3cem\u3ek-k\u3c/em\u3e Sorting, and Cut Through Routing on the Mesh
In this paper we present randomized algorithms for k-k routing, k-k sorting, and cut through routing. The stated resource bounds hold with high probability. The algorithm for k-k routing runs in [k/2]n+o(kn) steps. We also show that k-k sorting can be accomplished within [k/2] n+n+o(kn) steps, and cut through routing can be done in [3/4]kn+[3/2]n+o(kn) steps. The best known time bounds (prior to this paper) for all these three problems were kn+o(kn).
[kn/2] is a known lower bound for all the three problems (which is the bisection bound), and hence our algorithms are very nearly optimal. All the above mentioned algorithms have optimal queue length, namely k+o(k). These algorithms also extend to higher dimensional meshes
Simulating the Bitonic Sort on a 2D-mesh with P Systems
This paper gives a version of the parallel bitonic sorting algorithm of
Batcher, which can sort N elements in time O(log2 N). When applying it to the 2D
mesh architecture, two indexing functions are considered, row-major and shuffled row-
major. Some properties are proved for the later, together with a correctness proof of
the proposed algorithm. Two simulations with P systems are proposed and discussed.
The first one uses dynamic communication graphs and follows the guidelines of the mesh
version of the algorithm. The second simulation requires only symbol rewriting rules in
one membrane
Parallel alogorithms for MIMD parallel computers
This thesis mainly covers the design and analysis of asynchronous
parallel algorithms that can be run on MIMD (Multiple Instruction
Multiple Data) parallel computers, in particular the NEPTUNE system at
Loughborough University. Initially the fundamentals of parallel computer
architectures are introduced with different parallel architectures being
described and compared. The principles of parallel programming and the
design of parallel algorithms are also outlined. Also the main
characteristics of the 4 processor MIMD NEPTUNE system are presented,
and performance indicators, i.e. the speed-up and the efficiency factors
are defined for the measurement of parallelism in a given system.
Both numerical and non-numerical algorithms are covered in the
thesis. In the numerical solution of partial differential equations,
a new parallel 9-point block iterative method is developed. Here, the
organization of the blocks is done in such a way that each process
contains its own group of 9 points on the network, therefore, they can
be run in parallel. The parallel implementation of both 9-point and 4-
point block iterative methods were programmed using natural and redblack
ordering with synchronous and asynchronous approaches. The
results obtained for these different implementations were compared and
analysed.
Next the parallel version of the A.G.E. (Alternating Group Explicit)
method is developed in which the explicit nature of the difference
equation is revealed and exploited when applied to derive the solution
of both linear and non-linear 2-point boundary value problems. Two
strategies have been used in the implementation of the parallel A.G.E.
method using the synchronous and asynchronous approaches. The results
from these implementations were compared. Also for comparison reasons
the results obtained from the parallel A.G.E. were compared with the ~
corresponding results obtained from the parallel versions of the Jacobi,
Gauss-Seidel and S.O.R. methods. Finally, a computational complexity
analysis of the parallel A.G.E. algorithms is included.
In the area of non-numeric algorithms, the problems of sorting and
searching were studied. The sorting methods which were investigated
was the shell and the digit sort methods. with each method different
parallel strategies and approaches were used and compared to find the
best results which can be obtained on the parallel machine.
In the searching methods, the sequential search algorithm in an
unordered table and the binary search algorithms were investigated and
implemented in parallel with a presentation of the results. Finally,
a complexity analysis of these methods is presented.
The thesis concludes with a chapter summarizing the main results
- …