24,289 research outputs found
Sorting Integers on the AP1000
Sorting is one of the classic problems of computer science. Whilst well
understood on sequential machines, the diversity of architectures amongst
parallel systems means that algorithms do not perform uniformly on all
platforms. This document describes the implementation of a radix based
algorithm for sorting positive integers on a Fujitsu AP1000 Supercomputer,
which was constructed as an entry in the Joint Symposium on Parallel Processing
(JSPP) 1994 Parallel Software Contest (PSC94). Brief consideration is also
given to a full radix sort conducted in parallel across the machine.Comment: 1994 Project Report, 23 page
Efficient Implementations of Molecular Dynamics Simulations for Lennard-Jones Systems
Efficient implementations of the classical molecular dynamics (MD) method for
Lennard-Jones particle systems are considered. Not only general algorithms but
also techniques that are efficient for some specific CPU architectures are also
explained. A simple spatial-decomposition-based strategy is adopted for
parallelization. By utilizing the developed code, benchmark simulations are
performed on a HITACHI SR16000/J2 system consisting of IBM POWER6 processors
which are 4.7 GHz at the National Institute for Fusion Science (NIFS) and an
SGI Altix ICE 8400EX system consisting of Intel Xeon processors which are 2.93
GHz at the Institute for Solid State Physics (ISSP), the University of Tokyo.
The parallelization efficiency of the largest run, consisting of 4.1 billion
particles with 8192 MPI processes, is about 73% relative to that of the
smallest run with 128 MPI processes at NIFS, and it is about 66% relative to
that of the smallest run with 4 MPI processes at ISSP. The factors causing the
parallel overhead are investigated. It is found that fluctuations of the
execution time of each process degrade the parallel efficiency. These
fluctuations may be due to the interference of the operating system, which is
known as OS Jitter.Comment: 33 pages, 19 figures, add references and figures are revise
A portable platform for accelerated PIC codes and its application to GPUs using OpenACC
We present a portable platform, called PIC_ENGINE, for accelerating
Particle-In-Cell (PIC) codes on heterogeneous many-core architectures such as
Graphic Processing Units (GPUs). The aim of this development is efficient
simulations on future exascale systems by allowing different parallelization
strategies depending on the application problem and the specific architecture.
To this end, this platform contains the basic steps of the PIC algorithm and
has been designed as a test bed for different algorithmic options and data
structures. Among the architectures that this engine can explore, particular
attention is given here to systems equipped with GPUs. The study demonstrates
that our portable PIC implementation based on the OpenACC programming model can
achieve performance closely matching theoretical predictions. Using the Cray
XC30 system, Piz Daint, at the Swiss National Supercomputing Centre (CSCS), we
show that PIC_ENGINE running on an NVIDIA Kepler K20X GPU can outperform the
one on an Intel Sandybridge 8-core CPU by a factor of 3.4
Engineering Parallel String Sorting
We discuss how string sorting algorithms can be parallelized on modern
multi-core shared memory machines. As a synthesis of the best sequential string
sorting algorithms and successful parallel sorting algorithms for atomic
objects, we first propose string sample sort. The algorithm makes effective use
of the memory hierarchy, uses additional word level parallelism, and largely
avoids branch mispredictions. Then we focus on NUMA architectures, and develop
parallel multiway LCP-merge and -mergesort to reduce the number of random
memory accesses to remote nodes. Additionally, we parallelize variants of
multikey quicksort and radix sort that are also useful in certain situations.
Comprehensive experiments on five current multi-core platforms are then
reported and discussed. The experiments show that our implementations scale
very well on real-world inputs and modern machines.Comment: 46 pages, extension of "Parallel String Sample Sort" arXiv:1305.115
- …