106 research outputs found
On the acceleration of wavefront applications using distributed many-core architectures
In this paper we investigate the use of distributed graphics processing unit (GPU)-based architectures to accelerate pipelined wavefront applications—a ubiquitous class of parallel algorithms used for the solution of a number of scientific and engineering applications. Specifically, we employ a recently developed port of the LU solver (from the NAS Parallel Benchmark suite) to investigate the performance of these algorithms on high-performance computing solutions from NVIDIA (Tesla C1060 and C2050) as well as on traditional clusters (AMD/InfiniBand and IBM BlueGene/P). Benchmark results are presented for problem classes A to C and a recently developed performance model is used to provide projections for problem classes D and E, the latter of which represents a billion-cell problem. Our results demonstrate that while the theoretical performance of GPU solutions will far exceed those of many traditional technologies, the sustained application performance is currently comparable for scientific wavefront applications. Finally, a breakdown of the GPU solution is conducted, exposing PCIe overheads and decomposition constraints. A new k-blocking strategy is proposed to improve the future performance of this class of algorithm on GPU-based architectures
Molecular dynamics beyonds the limits: massive scaling on 72 racks of a BlueGene/P and supercooled glass transition of a 1 billion particles system
We report scaling results on the world's largest supercomputer of our
recently developed Billions-Body Molecular Dynamics (BBMD) package, which was
especially designed for massively parallel simulations of the atomic dynamics
in structural glasses and amorphous materials. The code was able to scale up to
72 racks of an IBM BlueGene/P, with a measured 89% efficiency for a system with
100 billion particles. The code speed, with less than 0.14 seconds per
iteration in the case of 1 billion particles, paves the way to the study of
billion-body structural glasses with a resolution increase of two orders of
magnitude with respect to the largest simulation ever reported. We demonstrate
the effectiveness of our code by studying the liquid-glass transition of an
exceptionally large system made by a binary mixture of 1 billion particles.Comment: 14 pages, 8 figures, submitted to Journal of Computational Physic
Computing for Perturbative QCD - A Snowmass White Paper
We present a study on high-performance computing and large-scale distributed
computing for perturbative QCD calculations.Comment: 21 pages, 5 table
Towards Loosely-Coupled Programming on Petascale Systems
We have extended the Falkon lightweight task execution framework to make
loosely coupled programming on petascale systems a practical and useful
programming model. This work studies and measures the performance factors
involved in applying this approach to enable the use of petascale systems by a
broader user community, and with greater ease. Our work enables the execution
of highly parallel computations composed of loosely coupled serial jobs with no
modifications to the respective applications. This approach allows a new-and
potentially far larger-class of applications to leverage petascale systems,
such as the IBM Blue Gene/P supercomputer. We present the challenges of I/O
performance encountered in making this model practical, and show results using
both microbenchmarks and real applications from two domains: economic energy
modeling and molecular dynamics. Our benchmarks show that we can scale up to
160K processor-cores with high efficiency, and can achieve sustained execution
rates of thousands of tasks per second.Comment: IEEE/ACM International Conference for High Performance Computing,
Networking, Storage and Analysis (SuperComputing/SC) 200
Progress Towards Petascale Applications in Biology: Status in 2006
Petascale computing is currently a common topic of discussion in the high performance computing community. Biological applications, particularly protein folding, are often given as examples of the need for petascale computing. There are at present biological applications that scale to execution rates of approximately 55 teraflops on a special-purpose supercomputer and 2.2 teraflops on a general-purpose supercomputer. In comparison, Qbox, a molecular dynamics code used to model metals, has an achieved performance of 207.3 teraflops. It may be useful to increase the extent to which operation rates and total calculations are reported in discussion of biological applications, and use total operations (integer and floating point combined) rather than (or in addition to) floating point operations as the unit of measure. Increased reporting of such metrics will enable better tracking of progress as the research community strives for the insights that will be enabled by petascale computing.This research was supported in part by the Indiana Genomics Initiative and the Indiana Metabolomics and Cytomics Initiative. The Indiana Genomics Initiative of Indiana University and the Indiana Metabolomics and Cytomics Initiative of Indiana University are supported in part by Lilly Endowment, Inc. The authors also wish to thank IBM, Inc. for support via Shared University Research Grants and partnerships via IU’s relationship as an IBM Life Sciences Institute of Innovation. Indiana University also thanks the TeraGrid partners; IU’s participation in the TeraGrid is funded by National Science Foundation grant numbers 0338618, 0504075, and 0451237. The early development of this paper was supported by a Fulbright Senior Scholars award from the Council for International Exchange of Scholars (CIES) and the United States Department of State to Dr. Craig A. Stewart; Matthias Mueller and the Technische Universität Dresden were hosts. Many reviewers contributed to the improvement of the ideas expressed in this paper and are gratefully appreciated; Thom Dunning, Robert Germain, Chris Mueller, Jim Phillips, Richard Repasky, Ralph Roskies, and Allan Snavely are thanked particularly for their insights
Lessons learned at 208K: Towards debugging millions of cores
Petascale systems will present several new challenges to performance and correctness tools. Such machines may contain millions of cores, requiring that tools use scalable data structures and analysis algorithms to collect and to process application data. In addition, at such scales, each tool itself will become a large parallel application – already, debugging the full BlueGene/L (BG/L) installation at the Lawrence Livermore National Laboratory requires employing 1664 tool daemons. To scale to such counts and beyond, tools must employ a scalable communication infrastructure and manage their own tool processes efficiently. Some system resources, such as the file system, may also become a tool bottleneck. In this paper, we present challenges to petascale tool development, using the Stack Trace Analysis Tool (STAT) as a case study. STAT is a lightweight tool that gathers and merges stack traces from a parallel application to identify process equivalence classes. We use results gathered at thousands of tasks on an Infiniband cluster and results up to 208K processes on BG/L to identify current scalability issues as well as challenges that will be faced at the petas-cale. We then present solutions to these challenges that have been implemented and show the resulting performance improvements. We also discuss future plans to meet the debugging demands of petascale machines.
- …