165 research outputs found

    Parallel 3D Sweep Kernel with PARSEC

    Get PDF
    International audienceHigh-fidelity nuclear power plant core simulations require solving the Boltzmann transport equation. In discrete ordinates methods, the most computationally demanding operation of this equation is the sweep operation. Considering the evolution of computer architectures, we propose in this paper, as a first step toward heterogeneous distributed architectures, a hybrid parallel implementation of the sweep operation on top of the generic task-based runtime system: PaRSEC. Such an implementation targets three nested levels of parallelism: message passing, multi-threading, and vectorization. The proposed parallel implementation of the Sweep achieves a sustained performance of 6.1 Tflop/s, corresponding to 33.9% of the peak performance of the targeted supercomputer. This implementation compares favorably with state-of-art solvers such as PARTISN; and it can therefore serve as a building block for a massively parallel version of the neutron transport solver DOMINO developed at EDF

    Efficient Parallel Solution of the 3D Stationary Boltzmann Transport Equation for Diffusive Problems

    Get PDF
    International audienceThis paper presents an efficient parallel method for the deterministic solution of the 3D stationary Boltzmann transport equation applied to diffusive problems such as nuclear core criticality computations. Based on standard MultiGroup-Sn-DD discretization schemes, our approach combines a highly efficient nested parallelization strategy [1] with the PDSA parallel acceleration technique [2] applied for the first time to 3D transport problems. These two key ingredients enable us to solve extremely large neutronic problems involving up to 10 12 degrees of freedom in less than an hour using 64 super-computer nodes

    Résolution parallèle efficace de l'équation de transport 3D de Boltzmann pour des problèmes diffusifs

    Get PDF
    This paper presents an efficient parallel method for the deterministic solution of the 3D stationary Boltzmann transport equation applied to diffusive problems such as nuclear core criticality computations. Based on standard MultiGroup-Sn-DD discretization schemes, our approach combines a highly efficient nested parallelization strategy with the PDSA parallel acceleration technique applied for the first time to 3D transport problems. These two key ingredients enable us to solve extremely large neutronic problems involving up to 10^(12) degrees of freedom in less than an hour using 64 super-computer nodes.Ce papier présente une méthode efficace pour le calcul déterministe d'une solution au problème des équations de Boltzmann pour le transport stationnaire 3D appliqué à des problèmes diffusifs de calcul de criticité dans les coeurs de réacteurs nucléaires. Notre approche, basée sur un schéma de discrétisation standard en multi-groupes Sn-DD, combine une stratégie de parallélisation efficace avec la technique d'accélération parallèle PDSA appliquées pour la première fois à des problèmes de transports 3D. Ces deux ingrédients clés nous ont permis de résoudre des problèmes de neutronique extrêmement large impliquant jusqu'à 10^(12) degrés de libertés en moins d'1 heure sur 64 noeuds d'un super-calculateur

    Evaluation of Distributed Programming Models and Extensions to Task-based Runtime Systems

    Get PDF
    High Performance Computing (HPC) has always been a key foundation for scientific simulation and discovery. And more recently, deep learning models\u27 training have further accelerated the demand of computational power and lower precision arithmetic. In this era following the end of Dennard\u27s Scaling and when Moore\u27s Law seemingly still holds true to a lesser extent, it is not a coincidence that HPC systems are equipped with multi-cores CPUs and a variety of hardware accelerators that are all massively parallel. Coupling this with interconnect networks\u27 speed improvements lagging behind those of computational power increases, the current state of HPC systems is heterogeneous and extremely complex. This was heralded as a great challenge to the software stacks and their ability to extract performance from these systems, but also as a great opportunity to innovate at the programming model level to explore the different approaches and propose new solutions. With usability, portability, and performance as the main factors to consider, this dissertation first evaluates some of the widely used parallel programming models (MPI, MPI+OpenMP, and task-based runtime systems) ability to manage the load imbalance among the processes computing the LU factorization of a large dense matrix stored in the Block Low-Rank (BLR) format. Next I proposed a number of optimizations and implemented them in PaRSEC\u27s Dynamic Task Discovery (DTD) model, including user-level graph trimming and direct Application Programming Interface (API) calls to perform data broadcast operation to further extend the limit of STF model. On the other hand, the Parameterized Task Graph (PTG) approach in PaRSEC is the most scalable approach for many different applications, which I then explored the possibility of combining both the algorithmic approach of Communication-Avoiding (CA) and the communication-computation overlapping benefits provided by runtime systems using 2D five-point stencil as the test case. This broad programming models evaluation and extension work highlighted the abilities of task-based runtime system in achieving scalable performance and portability on contemporary heterogeneous HPC systems. Finally, I summarized the profiling capability of PaRSEC runtime system, and demonstrated with a use case its important role in the performance bottleneck identification leading to optimizations

    Design trade-offs for emerging HPC processors based on mobile market technology

    Get PDF
    This is a post-peer-review, pre-copyedit version of an article published in The Journal of Supercomputing. The final authenticated version is available online at: http://dx.doi.org/10.1007/s11227-019-02819-4High-performance computing (HPC) is at the crossroads of a potential transition toward mobile market processor technology. Unlike in prior transitions, numerous hardware vendors and integrators will have access to state-of-the-art processor designs due to Arm’s licensing business model. This fact gives them greater flexibility to implement custom HPC-specific designs. In this paper, we undertake a study to quantify the different energy-performance trade-offs when architecting a processor based on mobile market technology. Through detailed simulations over a representative set of benchmarks, our results show that: (i) a modest amount of last-level cache per core is sufficient, leading to significant power and area savings; (ii) in-order cores offer favorable trade-offs when compared to out-of-order cores for a wide range of benchmarks; and (iii) heterogeneous configurations help to improve processor performance and energy efficiency.Peer ReviewedPostprint (author's final draft

    DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks

    Full text link
    Data movement between the CPU and main memory is a first-order obstacle against improving performance, scalability, and energy efficiency in modern systems. Computer systems employ a range of techniques to reduce overheads tied to data movement, spanning from traditional mechanisms (e.g., deep multi-level cache hierarchies, aggressive hardware prefetchers) to emerging techniques such as Near-Data Processing (NDP), where some computation is moved close to memory. Our goal is to methodically identify potential sources of data movement over a broad set of applications and to comprehensively compare traditional compute-centric data movement mitigation techniques to more memory-centric techniques, thereby developing a rigorous understanding of the best techniques to mitigate each source of data movement. With this goal in mind, we perform the first large-scale characterization of a wide variety of applications, across a wide range of application domains, to identify fundamental program properties that lead to data movement to/from main memory. We develop the first systematic methodology to classify applications based on the sources contributing to data movement bottlenecks. From our large-scale characterization of 77K functions across 345 applications, we select 144 functions to form the first open-source benchmark suite (DAMOV) for main memory data movement studies. We select a diverse range of functions that (1) represent different types of data movement bottlenecks, and (2) come from a wide range of application domains. Using NDP as a case study, we identify new insights about the different data movement bottlenecks and use these insights to determine the most suitable data movement mitigation mechanism for a particular application. We open-source DAMOV and the complete source code for our new characterization methodology at https://github.com/CMU-SAFARI/DAMOV.Comment: Our open source software is available at https://github.com/CMU-SAFARI/DAMO

    Hydrodynamic Evolution of Sgr A East: The Imprint of A Supernova Remnant in the Galactic Center

    Full text link
    We perform three-dimensional numerical simulations to study the hydrodynamic evolution of Sgr A East, the only known supernova remnant (SNR) in the center of our Galaxy, to infer its debated progenitor SN type and its potential impact on the Galactic center environment. Three sets of simulations are performed, each of which represents a represent a certain type of SN explosion (SN Iax, SN Ia or core-collapse SN) expanding against a nuclear outflow of hot gas driven by massive stars, whose thermodynamical properties have been well established by previous work and fixed in the simulations. All three simulations can simultaneously roughly reproduce the extent of Sgr A East and the position and morphology of an arc-shaped thermal X-ray feature, known as the "ridge". Confirming previous work, our simulations show that the ridge is the manifestation of a strong collision between the expanding SN ejecta and the nuclear outflow. The simulation of the core-collapse SN, with an assumed explosion energy of 5x10^50 erg and an ejecta mass of 10 M_sun, can well match the X-ray flux of the ridge, whereas the simulations of the SN Iax and SN Ia explosions underpredict its X-ray emission, due to a smaller ejecta mass. All three simulations constrain the age of Sgr A East to be <1500 yr and predict that the ridge should fade out over the next few hundred years. We address the implications of these results for our understanding of the Galactic center environment.Comment: 21 pages, 18 figures. Accepted for publication on MNRA
    • …
    corecore