106 research outputs found

    On the acceleration of wavefront applications using distributed many-core architectures

    Get PDF
    In this paper we investigate the use of distributed graphics processing unit (GPU)-based architectures to accelerate pipelined wavefront applications—a ubiquitous class of parallel algorithms used for the solution of a number of scientific and engineering applications. Specifically, we employ a recently developed port of the LU solver (from the NAS Parallel Benchmark suite) to investigate the performance of these algorithms on high-performance computing solutions from NVIDIA (Tesla C1060 and C2050) as well as on traditional clusters (AMD/InfiniBand and IBM BlueGene/P). Benchmark results are presented for problem classes A to C and a recently developed performance model is used to provide projections for problem classes D and E, the latter of which represents a billion-cell problem. Our results demonstrate that while the theoretical performance of GPU solutions will far exceed those of many traditional technologies, the sustained application performance is currently comparable for scientific wavefront applications. Finally, a breakdown of the GPU solution is conducted, exposing PCIe overheads and decomposition constraints. A new k-blocking strategy is proposed to improve the future performance of this class of algorithm on GPU-based architectures

    Molecular dynamics beyonds the limits: massive scaling on 72 racks of a BlueGene/P and supercooled glass transition of a 1 billion particles system

    Full text link
    We report scaling results on the world's largest supercomputer of our recently developed Billions-Body Molecular Dynamics (BBMD) package, which was especially designed for massively parallel simulations of the atomic dynamics in structural glasses and amorphous materials. The code was able to scale up to 72 racks of an IBM BlueGene/P, with a measured 89% efficiency for a system with 100 billion particles. The code speed, with less than 0.14 seconds per iteration in the case of 1 billion particles, paves the way to the study of billion-body structural glasses with a resolution increase of two orders of magnitude with respect to the largest simulation ever reported. We demonstrate the effectiveness of our code by studying the liquid-glass transition of an exceptionally large system made by a binary mixture of 1 billion particles.Comment: 14 pages, 8 figures, submitted to Journal of Computational Physic

    Towards Loosely-Coupled Programming on Petascale Systems

    Full text link
    We have extended the Falkon lightweight task execution framework to make loosely coupled programming on petascale systems a practical and useful programming model. This work studies and measures the performance factors involved in applying this approach to enable the use of petascale systems by a broader user community, and with greater ease. Our work enables the execution of highly parallel computations composed of loosely coupled serial jobs with no modifications to the respective applications. This approach allows a new-and potentially far larger-class of applications to leverage petascale systems, such as the IBM Blue Gene/P supercomputer. We present the challenges of I/O performance encountered in making this model practical, and show results using both microbenchmarks and real applications from two domains: economic energy modeling and molecular dynamics. Our benchmarks show that we can scale up to 160K processor-cores with high efficiency, and can achieve sustained execution rates of thousands of tasks per second.Comment: IEEE/ACM International Conference for High Performance Computing, Networking, Storage and Analysis (SuperComputing/SC) 200

    2006 Computation Directorate Annual Report

    Full text link

    On the Potential of NoC Virtualization for Multicore Chips

    Full text link

    Progress Towards Petascale Applications in Biology: Status in 2006

    Get PDF
    Petascale computing is currently a common topic of discussion in the high performance computing community. Biological applications, particularly protein folding, are often given as examples of the need for petascale computing. There are at present biological applications that scale to execution rates of approximately 55 teraflops on a special-purpose supercomputer and 2.2 teraflops on a general-purpose supercomputer. In comparison, Qbox, a molecular dynamics code used to model metals, has an achieved performance of 207.3 teraflops. It may be useful to increase the extent to which operation rates and total calculations are reported in discussion of biological applications, and use total operations (integer and floating point combined) rather than (or in addition to) floating point operations as the unit of measure. Increased reporting of such metrics will enable better tracking of progress as the research community strives for the insights that will be enabled by petascale computing.This research was supported in part by the Indiana Genomics Initiative and the Indiana Metabolomics and Cytomics Initiative. The Indiana Genomics Initiative of Indiana University and the Indiana Metabolomics and Cytomics Initiative of Indiana University are supported in part by Lilly Endowment, Inc. The authors also wish to thank IBM, Inc. for support via Shared University Research Grants and partnerships via IU’s relationship as an IBM Life Sciences Institute of Innovation. Indiana University also thanks the TeraGrid partners; IU’s participation in the TeraGrid is funded by National Science Foundation grant numbers 0338618, 0504075, and 0451237. The early development of this paper was supported by a Fulbright Senior Scholars award from the Council for International Exchange of Scholars (CIES) and the United States Department of State to Dr. Craig A. Stewart; Matthias Mueller and the Technische Universität Dresden were hosts. Many reviewers contributed to the improvement of the ideas expressed in this paper and are gratefully appreciated; Thom Dunning, Robert Germain, Chris Mueller, Jim Phillips, Richard Repasky, Ralph Roskies, and Allan Snavely are thanked particularly for their insights

    PRODEEDINGS OF RIKEN BNL RESEARCH CENTER WORKSHOP : HIGH PERFORMANCE COMPUTING WITH QCDOC AND BLUEGENE.

    Full text link

    Lessons learned at 208K: Towards debugging millions of cores

    Full text link
    Petascale systems will present several new challenges to performance and correctness tools. Such machines may contain millions of cores, requiring that tools use scalable data structures and analysis algorithms to collect and to process application data. In addition, at such scales, each tool itself will become a large parallel application – already, debugging the full BlueGene/L (BG/L) installation at the Lawrence Livermore National Laboratory requires employing 1664 tool daemons. To scale to such counts and beyond, tools must employ a scalable communication infrastructure and manage their own tool processes efficiently. Some system resources, such as the file system, may also become a tool bottleneck. In this paper, we present challenges to petascale tool development, using the Stack Trace Analysis Tool (STAT) as a case study. STAT is a lightweight tool that gathers and merges stack traces from a parallel application to identify process equivalence classes. We use results gathered at thousands of tasks on an Infiniband cluster and results up to 208K processes on BG/L to identify current scalability issues as well as challenges that will be faced at the petas-cale. We then present solutions to these challenges that have been implemented and show the resulting performance improvements. We also discuss future plans to meet the debugging demands of petascale machines.

    ISCR Annual Report: Fical Year 2004

    Full text link
    corecore