7,267 research outputs found
High Performance Direct Gravitational N-body Simulations on Graphics Processing Units
We present the results of gravitational direct -body simulations using the
commercial graphics processing units (GPU) NVIDIA Quadro FX1400 and GeForce
8800GTX, and compare the results with GRAPE-6Af special purpose hardware. The
force evaluation of the -body problem was implemented in Cg using the GPU
directly to speed-up the calculations. The integration of the equations of
motions were, running on the host computer, implemented in C using the 4th
order predictor-corrector Hermite integrator with block time steps. We find
that for a large number of particles (N \apgt 10^4) modern graphics
processing units offer an attractive low cost alternative to GRAPE special
purpose hardware. A modern GPU continues to give a relatively flat scaling with
the number of particles, comparable to that of the GRAPE. Using the same time
step criterion the total energy of the -body system was conserved better
than to one in on the GPU, which is only about an order of magnitude
worse than obtained with GRAPE. For N\apgt 10^6 the GeForce 8800GTX was about
20 times faster than the host computer. Though still about an order of
magnitude slower than GRAPE, modern GPU's outperform GRAPE in their low cost,
long mean time between failure and the much larger onboard memory; the
GRAPE-6Af holds at most 256k particles whereas the GeForce 8800GTF can hold 9
million particles in memory.Comment: Submitted to New Astronom
Components and Interfaces of a Process Management System for Parallel Programs
Parallel jobs are different from sequential jobs and require a different type
of process management. We present here a process management system for parallel
programs such as those written using MPI. A primary goal of the system, which
we call MPD (for multipurpose daemon), is to be scalable. By this we mean that
startup of interactive parallel jobs comprising thousands of processes is
quick, that signals can be quickly delivered to processes, and that stdin,
stdout, and stderr are managed intuitively. Our primary target is parallel
machines made up of clusters of SMPs, but the system is also useful in more
tightly integrated environments. We describe how MPD enables much faster
startup and better runtime management of parallel jobs. We show how close
control of stdio can support the easy implementation of a number of convenient
system utilities, even a parallel debugger. We describe a simple but general
interface that can be used to separate any process manager from a parallel
library, which we use to keep MPD separate from MPICH.Comment: 12 pages, Workshop on Clusters and Computational Grids for Scientific
Computing, Sept. 24-27, 2000, Le Chateau de Faverges de la Tour, Franc
Performance Evaluation of Specialized Hardware for Fast Global Operations on Distributed Memory Multicomputers
Workstation cluster multicomputers are increasingly being applied for solving scientific problems that require massive computing power. Parallel Virtual Machine (PVM) is a popular message-passing model used to program these clusters. One of the major performance limiting factors for cluster multicomputers is their inefficiency in performing parallel program operations involving collective communications. These operations include synchronization, global reduction, broadcast/multicast operations and orderly access to shared global variables. Hall has demonstrated that a .secondary network with wide tree topology and centralized coordination processors (COP) could improve the performance of global operations on a variety of distributed architectures [Hall94a]. My hypothesis was that the efficiency of many PVM applications on workstation clusters could be significantly improved by utilizing a COP system for collective communication operations. To test my hypothesis, I interfaced COP system with PVM. The interface software includes a virtual memory-mapped secondary network interface driver, and a function library which allows to use COP system in place of PVM function calls in application programs. My implementation makes it possible to easily port any existing PVM applications to perform fast global operations using the COP system. To evaluate the performance improvements of using a COP system, I measured cost of various PVM global functions, derived the cost of equivalent COP library global functions, and compared the results. To analyze the cost of global operations on overall execution time of applications, I instrumented a complex molecular dynamics PVM application and performed measurements. The measurements were performed for a sample cluster size of 5 and for message sizes up to 16 kilobytes. The comparison of PVM and COP system global operation performance clearly demonstrates that the COP system can speed up a variety of global operations involving small-to-medium sized messages by factors of 5-25. Analysis of the example application for a sample cluster size of 5 show that speedup provided by my global function libraries and the COP system reduces overall execution time for this and similar applications by above 1.5 times. Additionally, the performance improvement seen by applications increases as the cluster size increases, thus providing a scalable solution for performing global operations
Designing SSI clusters with hierarchical checkpointing and single I/O space
Adopting a new hierarchical checkpointing architecture, the authors develop a single I/O address space for building highly available clusters of computers. They propose a systematic approach to achieving a single system image by integrating existing middleware support with the newly developed features.published_or_final_versio
- …