12 research outputs found

    Project Final Report: HPC-Colony II

    Full text link

    Atomic detail visualization of photosynthetic membranes with GPU-accelerated ray tracing

    Get PDF
    The cellular process responsible for providing energy for most life on Earth, namely, photosynthetic light-harvesting, requires the cooperation of hundreds of proteins across an organelle, involving length and time scales spanning several orders of magnitude over quantum and classical regimes. Simulation and visualization of this fundamental energy conversion process pose many unique methodological and computational challenges. We present, in two accompanying movies, light-harvesting in the photosynthetic apparatus found in purple bacteria, the so-called chromatophore. The movies are the culmination of three decades of modeling efforts, featuring the collaboration of theoretical, experimental, and computational scientists. We describe the techniques that were used to build, simulate, analyze, and visualize the structures shown in the movies, and we highlight cases where scientific needs spurred the development of new parallel algorithms that efficiently harness GPU accelerators and petascale computers

    ALCC Allocation Final Report: HPC Colony II

    Full text link

    Reducing synchronization in distributed parallel programs

    Get PDF
    Developers of scalable libraries and applications for distributed-memory parallel systems face many challenges to attaining high performance. These challenges include communication latency, critical path delay, suboptimal scheduling, load imbalance, and system noise. These challenges are often defined and measured relative to points of broad synchronization in the program’s execution. Given the way in which many algorithms are defined and systems are implemented, gauging the above challenges at synchronization points is not unreasonable. In this thesis, I attempt to demonstrate that in many cases, those synchronization points are themselves the core issue behind these challenges. In some cases, the synchronizing operations cause a program to incur the costs from these challenges. In other cases, the presence of synchronization potentially exacerbates these problems. Through a simple performance model, I demonstrate that making synchronization less frequent can greatly mitigate performance issues. My work and several results in the literature show that many motifs and whole applications can be successfully redesigned to operate with asymptotically less synchronization than their naïve starting points. In exploring these issues, I have identified recurrent patterns across many applications and multiple environments that can guide future efforts more directly toward synchronization-avoiding designs. Thus, I attempt to offer developers the beginnings of a high-level play-book to follow rather than having to rediscover application-specific instances of the patterns

    High Performance Computing Facility Operational Assessment, 2012 Oak Ridge Leadership Computing Facility

    Full text link

    An Algorithm for Computing Short-Range Forces in Molecular Dynamics Simulations with Non-Uniform Particle Densities

    Get PDF
    We present projection sorting, an algorithmic approach to determining pairwise short-range forces between particles in molecular dynamics simulations. We show it can be more effective than the standard approaches when particle density is non-uniform. We implement tuned versions of the algorithm in the context of a biophysical simulation of chromosome condensation, for the modern Intel Broadwell and Knights Landing architectures, across multiple nodes. We demonstrate up to 5x overall speedup and good scaling to large problem sizes and processor counts

    Optimisation of a Molecular Dynamics Simulation of Chromosome Condensation

    Get PDF
    We present optimisations applied to a bespoke bio-physical molecular dynamics simulation designed to investigate chromosome condensation. Our primary focus is on domain-specific algorithmic improvements to determining short-range interaction forces between particles, as certain qualities of the simulation render traditional methods less effective. We implement tuned versions of the code for both traditional CPU architectures and the modern many-core architecture found in the Intel Xeon Phi coprocessor and compare their effectiveness. We achieve speed-ups starting at a factor of 10 over the original code, facilitating more detailed and larger-scale experiments

    PICS - a Performance-analysis-based Introspective Control System to steer parallel applications

    Get PDF
    Parallel programming has always been difficult due to the complexity of hardware and the diversity of applications. Although significant progress has been achieved over the years, attaining high parallel efficiency on large supercomputers for various applications is still quite challenging. As we go beyond the current scale of computers to those with peak capacities of an ExaFLOP/s, it is clear that an introspective and adaptive runtime system (RTS) will be critical to reduce programmers' tuning efforts by automatically handling the complexities of applications and machines. This is the motivation for my research on a Performance-analysis-based Introspective Control System - PICS. PICS intelligently steers parallel applications and runtime system configurations to achieve desired goals by utilizing expert knowledge to analyze performance data and adaptively reconfiguring applications. This thesis designs a holistic introspective control system for automatic performance tuning that combines the real-time performance analysis and performance steering to effectively automate the optimization. A few techniques are explored to make the parallel runtime system and applications more adaptive and controllable. Control points are defined for applications to interact with PICS. Decision tree based automatic performance analysis is implemented to significantly reduce the search space of multiple configurations. Parallel evaluation and sampling techniques are exploited to reduce the overhead of the system and to improve its scalability. In addition, the result of automatic performance analysis can be visualized to help developers manually tune their applications. The utility of PICS is demonstrated with several benchmarks and real- world applications
    corecore