128,030 research outputs found
Adapting the Phylogenetic Program FITCH for Distributed Processing
The ability to reconstruct optimal phylogenies (evolutionary trees) based on objective criteria impacts directly on our understanding the relationships among organisms, including human evolution, as well as the spread of infectious disease. Numerous tree construction methods have been implemented for execution on single processors, however inferring large phylogenies using computationally intense algorithms can be beyond the practical capacity of a single processor. Distributed and parallel processing provides a means for overcoming this hurdle. FITCH is a freely available, single-processor implementation of a distance-based, tree-building algorithm commonly used by the biological community. Through an alternating least squares approach to branch length optimization and tree comparison, FITCH iteratively builds up evolutionary trees through species addition and branch rearrangement. To extend the utility of this program, I describe the design, implementation, and performance of mpiFITCH, a parallel processing version of FITCH developed using the Message Passing Interface for message exchange. Balanced load distribution required the conversion of tree generation from recursive linked list traversal to iterative, array-based traversal. Execution of mpiFITCH on a Beowulf cluster running 64 processors revealed maximum performance enhancement of up to ~28 fold with an efficiency of ~ 40%
Improved neighbor list algorithm in molecular simulations using cell decomposition and data sorting method
An improved neighbor list algorithm is proposed to reduce unnecessary
interatomic distance calculations in molecular simulations. It combines the
advantages of Verlet table and cell linked list algorithms by using cell
decomposition approach to accelerate the neighbor list construction speed, and
data sorting method to lower the CPU data cache miss rate, as well as partial
updating method to minimize the unnecessary reconstruction of the neighbor
list. Both serial and parallel performance of molecular dynamics simulation are
evaluated using the proposed algorithm and compared with those using
conventional Verlet table and cell linked list algorithms. Results show that
the new algorithm outperforms the conventional algorithms by a factor of 2~3 in
cases of both small and large number of atoms.Comment: 14 pages, 7 figures. Submitted to Computer Physics Communication
Efficiency of linked cell algorithms
The linked cell list algorithm is an essential part of molecular simulation
software, both molecular dynamics and Monte Carlo. Though it scales linearly
with the number of particles, there has been a constant interest in increasing
its efficiency, because a large part of CPU time is spent to identify the
interacting particles. Several recent publications proposed improvements to the
algorithm and investigated their efficiency by applying them to particular
setups. In this publication we develop a general method to evaluate the
efficiency of these algorithms, which is mostly independent of the parameters
of the simulation, and test it for a number of linked cell list algorithms. We
also propose a combination of linked cell reordering and interaction sorting
that shows a good efficiency for a broad range of simulation setups.Comment: Submitted to Computer Physics Communications on 22 December 2009,
still awaiting a referee repor
An Efficient Cell List Implementation for Monte Carlo Simulation on GPUs
Maximizing the performance potential of the modern day GPU architecture
requires judicious utilization of available parallel resources. Although
dramatic reductions can often be obtained through straightforward mappings,
further performance improvements often require algorithmic redesigns to more
closely exploit the target architecture. In this paper, we focus on efficient
molecular simulations for the GPU and propose a novel cell list algorithm that
better utilizes its parallel resources. Our goal is an efficient GPU
implementation of large-scale Monte Carlo simulations for the grand canonical
ensemble. This is a particularly challenging application because there is
inherently less computation and parallelism than in similar applications with
molecular dynamics. Consistent with the results of prior researchers, our
simulation results show traditional cell list implementations for Monte Carlo
simulations of molecular systems offer effectively no performance improvement
for small systems [5, 14], even when porting to the GPU. However for larger
systems, the cell list implementation offers significant gains in performance.
Furthermore, our novel cell list approach results in better performance for all
problem sizes when compared with other GPU implementations with or without cell
lists.Comment: 30 page
A Parallel Adaptive P3M code with Hierarchical Particle Reordering
We discuss the design and implementation of HYDRA_OMP a parallel
implementation of the Smoothed Particle Hydrodynamics-Adaptive P3M (SPH-AP3M)
code HYDRA. The code is designed primarily for conducting cosmological
hydrodynamic simulations and is written in Fortran77+OpenMP. A number of
optimizations for RISC processors and SMP-NUMA architectures have been
implemented, the most important optimization being hierarchical reordering of
particles within chaining cells, which greatly improves data locality thereby
removing the cache misses typically associated with linked lists. Parallel
scaling is good, with a minimum parallel scaling of 73% achieved on 32 nodes
for a variety of modern SMP architectures. We give performance data in terms of
the number of particle updates per second, which is a more useful performance
metric than raw MFlops. A basic version of the code will be made available to
the community in the near future.Comment: 34 pages, 12 figures, accepted for publication in Computer Physics
Communication
Efficient Parallelization of Short-Range Molecular Dynamics Simulations on Many-Core Systems
This article introduces a highly parallel algorithm for molecular dynamics
simulations with short-range forces on single node multi- and many-core
systems. The algorithm is designed to achieve high parallel speedups for
strongly inhomogeneous systems like nanodevices or nanostructured materials. In
the proposed scheme the calculation of the forces and the generation of
neighbor lists is divided into small tasks. The tasks are then executed by a
thread pool according to a dependent task schedule. This schedule is
constructed in such a way that a particle is never accessed by two threads at
the same time.Benchmark simulations on a typical 12 core machine show that the
described algorithm achieves excellent parallel efficiencies above 80 % for
different kinds of systems and all numbers of cores. For inhomogeneous systems
the speedups are strongly superior to those obtained with spatial
decomposition. Further benchmarks were performed on an Intel Xeon Phi
coprocessor. These simulations demonstrate that the algorithm scales well to
large numbers of cores.Comment: 12 pages, 8 figure
- …