54 research outputs found

    A massively parallel semi-Lagrangian solver for the six-dimensional Vlasov-Poisson equation

    Full text link
    This paper presents an optimized and scalable semi-Lagrangian solver for the Vlasov-Poisson system in six-dimensional phase space. Grid-based solvers of the Vlasov equation are known to give accurate results. At the same time, these solvers are challenged by the curse of dimensionality resulting in very high memory requirements, and moreover, requiring highly efficient parallelization schemes. In this paper, we consider the 6d Vlasov-Poisson problem discretized by a split-step semi-Lagrangian scheme, using successive 1d interpolations on 1d stripes of the 6d domain. Two parallelization paradigms are compared, a remapping scheme and a classical domain decomposition approach applied to the full 6d problem. From numerical experiments, the latter approach is found to be superior in the massively parallel case in various respects. We address the challenge of artificial time step restrictions due to the decomposition of the domain by introducing a blocked one-sided communication scheme for the purely electrostatic case and a rotating mesh for the case with a constant magnetic field. In addition, we propose a pipelining scheme that enables to hide the costs for the halo communication between neighbor processes efficiently behind useful computation. Parallel scalability on up to 65k processes is demonstrated for benchmark problems on a supercomputer

    Adaptive Parallelism for Coupled, Multithreaded Message-Passing Programs

    Get PDF
    Hybrid parallel programming models that combine message passing (MP) and shared- memory multithreading (MT) are becoming more popular, especially with applications requiring higher degrees of parallelism and scalability. Consequently, coupled parallel programs, those built via the integration of independently developed and optimized software libraries linked into a single application, increasingly comprise message-passing libraries with differing preferred degrees of threading, resulting in thread-level heterogeneity. Retroactively matching threading levels between independently developed and maintained libraries is difficult, and the challenge is exacerbated because contemporary middleware services provide only static scheduling policies over entire program executions, necessitating suboptimal, over-subscribed or under-subscribed, configurations. In coupled applications, a poorly configured component can lead to overall poor application performance, suboptimal resource utilization, and increased time-to-solution. So it is critical that each library executes in a manner consistent with its design and tuning for a particular system architecture and workload. Therefore, there is a need for techniques that address dynamic, conflicting configurations in coupled multithreaded message-passing (MT-MP) programs. Our thesis is that we can achieve significant performance improvements over static under-subscribed approaches through reconfigurable execution environments that consider compute phase parallelization strategies along with both hardware and software characteristics. In this work, we present new ways to structure, execute, and analyze coupled MT- MP programs. Our study begins with an examination of contemporary approaches used to accommodate thread-level heterogeneity in coupled MT-MP programs. Here we identify potential inefficiencies in how these programs are structured and executed in the high-performance computing domain. We then present and evaluate a novel approach for accommodating thread-level heterogeneity. Our approach enables full utilization of all available compute resources throughout an application’s execution by providing programmable facilities with modest overheads to dynamically reconfigure runtime environments for compute phases with differing threading factors and affinities. Our performance results show that for a majority of the tested scientific workloads our approach and corresponding open-source reference implementation render speedups greater than 50 % over the static under-subscribed baseline. Motivated by our examination of reconfigurable execution environments and their memory overhead, we also study the memory attribution problem: the inability to predict or evaluate during runtime where the available memory is used across the software stack comprising the application, reusable software libraries, and supporting runtime infrastructure. Specifically, dynamic adaptation requires runtime intervention, which by its nature introduces additional runtime and memory overhead. To better understand the latter, we propose and evaluate a new way to quantify component-level memory usage from unmodified binaries dynamically linked to a message-passing communication library. Our experimental results show that our approach and corresponding implementation accurately measure memory resource usage as a function of time, scale, communication workload, and software or hardware system architecture, clearly distinguishing between application and communication library usage at a per-process level

    Support for adaptivity in ARMCI using migratable objects

    Full text link
    Many new paradigms of parallel programming have emerged that compete with and complement the standard and well-established MPI model. Most notable, and suc-cessful, among these are models that support some form of global address space. At the same time, approaches based on migratable objects (also called virtualized processes) have shown that resource management concerns can be sep-arated effectively from the overall parallel programming ef-fort. For example, Charm++ supports dynamic load bal-ancing via an intelligent adaptive run-time system. It is also becoming clear that a multi-paradigm approach that allows modules written in one or more paradigms to coexist and co-operate will be necessary to tame the parallel pro-gramming challenge. ARMCI is a remote memory copy library that serves as a foundation of many global address space languages and libraries. This paper presents our preliminary work on inte-grating and supporting ARMCI with the adaptive run-time system of Charm++ as a part of our overall effort in the multi-paradigm approach.

    ASCR/HEP Exascale Requirements Review Report

    Full text link
    This draft report summarizes and details the findings, results, and recommendations derived from the ASCR/HEP Exascale Requirements Review meeting held in June, 2015. The main conclusions are as follows. 1) Larger, more capable computing and data facilities are needed to support HEP science goals in all three frontiers: Energy, Intensity, and Cosmic. The expected scale of the demand at the 2025 timescale is at least two orders of magnitude -- and in some cases greater -- than that available currently. 2) The growth rate of data produced by simulations is overwhelming the current ability, of both facilities and researchers, to store and analyze it. Additional resources and new techniques for data analysis are urgently needed. 3) Data rates and volumes from HEP experimental facilities are also straining the ability to store and analyze large and complex data volumes. Appropriately configured leadership-class facilities can play a transformational role in enabling scientific discovery from these datasets. 4) A close integration of HPC simulation and data analysis will aid greatly in interpreting results from HEP experiments. Such an integration will minimize data movement and facilitate interdependent workflows. 5) Long-range planning between HEP and ASCR will be required to meet HEP's research needs. To best use ASCR HPC resources the experimental HEP program needs a) an established long-term plan for access to ASCR computational and data resources, b) an ability to map workflows onto HPC resources, c) the ability for ASCR facilities to accommodate workflows run by collaborations that can have thousands of individual members, d) to transition codes to the next-generation HPC platforms that will be available at ASCR facilities, e) to build up and train a workforce capable of developing and using simulations and analysis to support HEP scientific research on next-generation systems.Comment: 77 pages, 13 Figures; draft report, subject to further revisio

    A RECONFIGURABLE AND EXTENSIBLE EXPLORATION PLATFORM FOR FUTURE HETEROGENEOUS SYSTEMS

    Get PDF
    Accelerator-based -or heterogeneous- computing has become increasingly important in a variety of scenarios, ranging from High-Performance Computing (HPC) to embedded systems. While most solutions use sometimes custom-made components, most of today’s systems rely on commodity highend CPUs and/or GPU devices, which deliver adequate performance while ensuring programmability, productivity, and application portability. Unfortunately, pure general-purpose hardware is affected by inherently limited power-efficiency, that is, low GFLOPS-per-Watt, now considered as a primary metric. The many-core model and architectural customization can play here a key role, as they enable unprecedented levels of power-efficiency compared to CPUs/GPUs. However, such paradigms are still immature and deeper exploration is indispensable. This dissertation investigates customizability and proposes novel solutions for heterogeneous architectures, focusing on mechanisms related to coherence and network-on-chip (NoC). First, the work presents a non-coherent scratchpad memory with a configurable bank remapping system to reduce bank conflicts. The experimental results show the benefits of both using a customizable hardware bank remapping function and non-coherent memories for some types of algorithms. Next, we demonstrate how a distributed synchronization master better suits many-cores than standard centralized solutions. This solution, inspired by the directory-based coherence mechanism, supports concurrent synchronizations without relying on memory transactions. The results collected for different NoC sizes provided indications about the area overheads incurred by our solution and demonstrated the benefits of using a dedicated hardware synchronization support. Finally, this dissertation proposes an advanced coherence subsystem, based on the sparse directory approach, with a selective coherence maintenance system which allows coherence to be deactivated for blocks that do not require it. Experimental results show that the use of a hybrid coherent and non-coherent architectural mechanism along with an extended coherence protocol can enhance performance. The above results were all collected by means of a modular and customizable heterogeneous many-core system developed to support the exploration of power-efficient high-performance computing architectures. The system is based on a NoC and a customizable GPU-like accelerator core, as well as a reconfigurable coherence subsystem, ensuring application-specific configuration capabilities. All the explored solutions were evaluated on this real heterogeneous system, which comes along with the above methodological results as part of the contribution in this dissertation. In fact, as a key benefit, the experimental platform enables users to integrate novel hardware/software solutions on a full-system scale, whereas existing platforms do not always support a comprehensive heterogeneous architecture exploration

    3rd Many-core Applications Research Community (MARC) Symposium. (KIT Scientific Reports ; 7598)

    Get PDF
    This manuscript includes recent scientific work regarding the Intel Single Chip Cloud computer and describes approaches for novel approaches for programming and run-time organization

    ISCR Annual Report: Fical Year 2004

    Full text link

    Domain engineering and generic programming for parallel scientific computing

    Get PDF
    Die Entwicklung von Software für wissenschaftliche Anwendungen, die auf dynamischen oder irregulären Gittern beruhen, ist mit vielen Problemen verbunden, da hier so unterschiedliche Ziele wie hohe Leistung und Flexibilität miteinander vereinbart werden müssen. Die vorliegende Dissertation geht diese Probleme folgendermaßen an: Zunächst werden die Ideen des domain engineering auf das Gebiet daten-paralleler Anwendungen angewandt, um wiederverwendbare Softwareprodukte zu entwerfen, deren Benutzung die Entwicklung konkreter Softwaresysteme auf diesem Gebiet beschleunigt. Hierbei wird eine umfassende Analyse datenparalleler Anwendungen durchgeführt und es werden allgemeine Anforderungen an zu entwickelnde Komponenten formuliert. In einem zweiten Schritt wird auf der Grundlage der gewonnen Kenntnisse und unter Benutzung der Ideen des generischen Programmierens die Janus Softwarearchitektur entworfen und implementiert. Das sich daraus ergebende konzeptionelle Gerüst und die C++-template Bibliothek Janus stellt eine flexible und erweiterbare Sammlung effizienter Datenstrukturen und Algorithmen für eine umfassende Klasse datenparalleler Anwendungen dar. Insbesondere werden finite Differenz- und finite Elementverfahren sowie datenparallele Graphalgorithmen unterstützt. Ein herausragender Vorteil einer generischen C++ Bibliothek wie Janus ist, dass ihre anwendungsorientierten Abstraktionen eine hohe Leistung liefern und dabei weder von Spracherweiterungen noch von nicht allgemein verfügbaren Kompilationstechniken abhängen. Die Benutzung von C++-Templates bei der Implementierung von Janus macht es sehr einfach, nutzerdefinierte Datentypen in die Komponenten von Janus zu integrieren, ohne dass dabei die Effizienz leidet. Ein weiterer Vorteil von Janus ist, dass es sehr einfach mit bereits existierenden Softwarepaketen kooperieren kann. Diese Dissertation beschreibt eine portable Implementierung von Janus für Architekturen mit verteiltem Speicher, die auf der standardisierten Kommunikationsbibliothek MPI beruht. Die Ausdruckskraft von Janus wird an Hand der Implementierung typischer Anwendungen aus dem Bereich des datenparallelen wissenschaftlichen Rechnens nachgewiesen. Die Leistungsfähigkeit der Komponenten von Janus wird bewertet, indem Janus-Applikationen mit vergleichbaren Implementierungen, die auf anderen Ansätzen beruhen, verglichen werden. Die Untersuchungen zur Skalierbarkeit von Janus-Applikationen auf einem Linux Clustersystem zeigen, dass Janus auch in dieser Hinsicht hohe Anforderungen erfüllt
    • …
    corecore