165 research outputs found
Parallel 3D Sweep Kernel with PARSEC
International audienceHigh-fidelity nuclear power plant core simulations require solving the Boltzmann transport equation. In discrete ordinates methods, the most computationally demanding operation of this equation is the sweep operation. Considering the evolution of computer architectures, we propose in this paper, as a first step toward heterogeneous distributed architectures, a hybrid parallel implementation of the sweep operation on top of the generic task-based runtime system: PaRSEC. Such an implementation targets three nested levels of parallelism: message passing, multi-threading, and vectorization. The proposed parallel implementation of the Sweep achieves a sustained performance of 6.1 Tflop/s, corresponding to 33.9% of the peak performance of the targeted supercomputer. This implementation compares favorably with state-of-art solvers such as PARTISN; and it can therefore serve as a building block for a massively parallel version of the neutron transport solver DOMINO developed at EDF
Efficient Parallel Solution of the 3D Stationary Boltzmann Transport Equation for Diffusive Problems
International audienceThis paper presents an efficient parallel method for the deterministic solution of the 3D stationary Boltzmann transport equation applied to diffusive problems such as nuclear core criticality computations. Based on standard MultiGroup-Sn-DD discretization schemes, our approach combines a highly efficient nested parallelization strategy [1] with the PDSA parallel acceleration technique [2] applied for the first time to 3D transport problems. These two key ingredients enable us to solve extremely large neutronic problems involving up to 10 12 degrees of freedom in less than an hour using 64 super-computer nodes
Résolution parallèle efficace de l'équation de transport 3D de Boltzmann pour des problèmes diffusifs
This paper presents an efficient parallel method for the deterministic solution of the 3D stationary Boltzmann transport equation applied to diffusive problems such as nuclear core criticality computations. Based on standard MultiGroup-Sn-DD discretization schemes, our approach combines a highly efficient nested parallelization strategy with the PDSA parallel acceleration technique applied for the first time to 3D transport problems. These two key ingredients enable us to solve extremely large neutronic problems involving up to 10^(12) degrees of freedom in less than an hour using 64 super-computer nodes.Ce papier présente une méthode efficace pour le calcul déterministe d'une solution au problème des équations de Boltzmann pour le transport stationnaire 3D appliqué à des problèmes diffusifs de calcul de criticité dans les coeurs de réacteurs nucléaires. Notre approche, basée sur un schéma de discrétisation standard en multi-groupes Sn-DD, combine une stratégie de parallélisation efficace avec la technique d'accélération parallèle PDSA appliquées pour la première fois à des problèmes de transports 3D. Ces deux ingrédients clés nous ont permis de résoudre des problèmes de neutronique extrêmement large impliquant jusqu'à 10^(12) degrés de libertés en moins d'1 heure sur 64 noeuds d'un super-calculateur
Recommended from our members
Scalable electronic structure methods to solve the Kohn-Sham equation
From the single hydrogen to proteins in the hundreds of thousands of kilodaltons, scientists can use the electronic structure of interacting atoms to predict their material properties. Knowing the material properties through solving the electronic structure problem, would allow for the controlled prediction and corresponding design of materials. The Kohn-Sham equations, based on density functional theory, transform a many-body problem impossible to solve for anything but the smallest molecules, into a practical problem which can be used to predict material properties. Although KSDFT scales as the cube of the number of electrons in the system, there are additional well documented approximations to further reduce the number of electrons, such as the pseudopotential method.
The incoming exascale era will lead to unavoidable challenges in solving the Kohn-Sham equations. These challenges include communication and hardware considerations. Old paradigms, epitomized by repeated series of globally forced synchronization points, will give way to new breeds of algorithms to maximizing scaling performance while maintaining portability.
This thesis focuses on the solution to Kohn-Sham DFT in real space at scale. Key to this effort is a parallel treatment of numerical elements involving the Rayleigh-Ritz method. At minimum, the Rayleigh-Ritz projection requires a number of distributed matrix vector operations equal to the number of electrons solved for in a system. Furthermore, the projection requires that number, squared and then halved, of dot products. The memory cost for such an algorithm also grows very large quickly, and explicit intelligent management is not an option. I demonstrate the computational requirements for the various steps in solving for the electronic structure problem for both large and small molecular systems. This thesis also discusses opportunities in real space Kohn-Sham DFT to further utilize floating point optimized hardware the with higher order stencils.Chemical Engineerin
Evaluation of Distributed Programming Models and Extensions to Task-based Runtime Systems
High Performance Computing (HPC) has always been a key foundation for scientific simulation and discovery. And more recently, deep learning models\u27 training have further accelerated the demand of computational power and lower precision arithmetic. In this era following the end of Dennard\u27s Scaling and when Moore\u27s Law seemingly still holds true to a lesser extent, it is not a coincidence that HPC systems are equipped with multi-cores CPUs and a variety of hardware accelerators that are all massively parallel. Coupling this with interconnect networks\u27 speed improvements lagging behind those of computational power increases, the current state of HPC systems is heterogeneous and extremely complex.
This was heralded as a great challenge to the software stacks and their ability to extract performance from these systems, but also as a great opportunity to innovate at the programming model level to explore the different approaches and propose new solutions. With usability, portability, and performance as the main factors to consider, this dissertation first evaluates some of the widely used parallel programming models (MPI, MPI+OpenMP, and task-based runtime systems) ability to manage the load imbalance among the processes computing the LU factorization of a large dense matrix stored in the Block Low-Rank (BLR) format.
Next I proposed a number of optimizations and implemented them in PaRSEC\u27s Dynamic Task Discovery (DTD) model, including user-level graph trimming and direct Application Programming Interface (API) calls to perform data broadcast operation to further extend the limit of STF model. On the other hand, the Parameterized Task Graph (PTG) approach in PaRSEC is the most scalable approach for many different applications, which I then explored the possibility of combining both the algorithmic approach of Communication-Avoiding (CA) and the communication-computation overlapping benefits provided by runtime systems using 2D five-point stencil as the test case. This broad programming models evaluation and extension work highlighted the abilities of task-based runtime system in achieving scalable performance and portability on contemporary heterogeneous HPC systems. Finally, I summarized the profiling capability of PaRSEC runtime system, and demonstrated with a use case its important role in the performance bottleneck identification leading to optimizations
Design trade-offs for emerging HPC processors based on mobile market technology
This is a post-peer-review, pre-copyedit version of an article published in The Journal of Supercomputing. The final authenticated version is available online at: http://dx.doi.org/10.1007/s11227-019-02819-4High-performance computing (HPC) is at the crossroads of a potential transition toward mobile market processor technology. Unlike in prior transitions, numerous hardware vendors and integrators will have access to state-of-the-art processor designs due to Arm’s licensing business model. This fact gives them greater flexibility to implement custom HPC-specific designs. In this paper, we undertake a study to quantify the different energy-performance trade-offs when architecting a processor based on mobile market technology. Through detailed simulations over a representative set of benchmarks, our results show that: (i) a modest amount of last-level cache per core is sufficient, leading to significant power and area savings; (ii) in-order cores offer favorable trade-offs when compared to out-of-order cores for a wide range of benchmarks; and (iii) heterogeneous configurations help to improve processor performance and energy efficiency.Peer ReviewedPostprint (author's final draft
DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks
Data movement between the CPU and main memory is a first-order obstacle
against improving performance, scalability, and energy efficiency in modern
systems. Computer systems employ a range of techniques to reduce overheads tied
to data movement, spanning from traditional mechanisms (e.g., deep multi-level
cache hierarchies, aggressive hardware prefetchers) to emerging techniques such
as Near-Data Processing (NDP), where some computation is moved close to memory.
Our goal is to methodically identify potential sources of data movement over a
broad set of applications and to comprehensively compare traditional
compute-centric data movement mitigation techniques to more memory-centric
techniques, thereby developing a rigorous understanding of the best techniques
to mitigate each source of data movement.
With this goal in mind, we perform the first large-scale characterization of
a wide variety of applications, across a wide range of application domains, to
identify fundamental program properties that lead to data movement to/from main
memory. We develop the first systematic methodology to classify applications
based on the sources contributing to data movement bottlenecks. From our
large-scale characterization of 77K functions across 345 applications, we
select 144 functions to form the first open-source benchmark suite (DAMOV) for
main memory data movement studies. We select a diverse range of functions that
(1) represent different types of data movement bottlenecks, and (2) come from a
wide range of application domains. Using NDP as a case study, we identify new
insights about the different data movement bottlenecks and use these insights
to determine the most suitable data movement mitigation mechanism for a
particular application. We open-source DAMOV and the complete source code for
our new characterization methodology at https://github.com/CMU-SAFARI/DAMOV.Comment: Our open source software is available at
https://github.com/CMU-SAFARI/DAMO
Hydrodynamic Evolution of Sgr A East: The Imprint of A Supernova Remnant in the Galactic Center
We perform three-dimensional numerical simulations to study the hydrodynamic
evolution of Sgr A East, the only known supernova remnant (SNR) in the center
of our Galaxy, to infer its debated progenitor SN type and its potential impact
on the Galactic center environment. Three sets of simulations are performed,
each of which represents a represent a certain type of SN explosion (SN Iax, SN
Ia or core-collapse SN) expanding against a nuclear outflow of hot gas driven
by massive stars, whose thermodynamical properties have been well established
by previous work and fixed in the simulations. All three simulations can
simultaneously roughly reproduce the extent of Sgr A East and the position and
morphology of an arc-shaped thermal X-ray feature, known as the "ridge".
Confirming previous work, our simulations show that the ridge is the
manifestation of a strong collision between the expanding SN ejecta and the
nuclear outflow. The simulation of the core-collapse SN, with an assumed
explosion energy of 5x10^50 erg and an ejecta mass of 10 M_sun, can well match
the X-ray flux of the ridge, whereas the simulations of the SN Iax and SN Ia
explosions underpredict its X-ray emission, due to a smaller ejecta mass. All
three simulations constrain the age of Sgr A East to be <1500 yr and predict
that the ridge should fade out over the next few hundred years. We address the
implications of these results for our understanding of the Galactic center
environment.Comment: 21 pages, 18 figures. Accepted for publication on MNRA
- …