736 research outputs found
The HPCG benchmark: analysis, shared memory preliminary improvements and evaluation on an Arm-based platform
The High-Performance Conjugate Gradient (HPCG) benchmark complements the LINPACK benchmark in the performance evaluation coverage of large High-Performance Computing (HPC) systems. Due to its lower arithmetic intensity and higher memory pressure, HPCG is recognized as a more representative benchmark for data-center and irregular memory access pattern workloads, therefore its popularity and acceptance is raising within the HPC community. As only a small fraction of the reference version of the HPCG benchmark is parallelized with shared memory techniques (OpenMP), we introduce in this report two OpenMP parallelization methods. Due to the increasing importance of Arm architecture in the HPC scenario, we evaluate our HPCG code at scale on a state-of-the-art HPC system based on Cavium ThunderX2 SoC. We consider our work as a contribution to the Arm ecosystem: along with this technical report, we plan in fact to release our code for boosting the tuning of the HPCG benchmark within the Arm community.Postprint (author's final draft
Main memory in HPC: do we need more, or could we live with less?
An important aspect of High-Performance Computing (HPC) system design is the choice of main memory capacity. This choice becomes increasingly important now that 3D-stacked memories are entering the market. Compared with conventional Dual In-line Memory Modules (DIMMs), 3D memory chiplets provide better performance and energy efficiency but lower memory capacities. Therefore, the adoption of 3D-stacked memories in the HPC domain depends on whether we can find use cases that require much less memory than is available now.
This study analyzes the memory capacity requirements of important HPC benchmarks and applications. We find that the High-Performance Conjugate Gradients (HPCG) benchmark could be an important success story for 3D-stacked memories in HPC, but High-Performance Linpack (HPL) is likely to be constrained by 3D memory capacity. The study also emphasizes that the analysis of memory footprints of production HPC applications is complex and that it requires an understanding of application scalability and target category, i.e., whether the users target capability or capacity computing. The results show that most of the HPC applications under study have per-core memory footprints in the range of hundreds of megabytes, but we also detect applications and use cases that require gigabytes per core. Overall, the study identifies the HPC applications and use cases with memory footprints that could be provided by 3D-stacked memory chiplets, making a first step toward adoption of this novel technology in the HPC domain.This work was supported by the Collaboration Agreement between Samsung Electronics Co., Ltd. and BSC, Spanish Government through Severo Ochoa programme (SEV-2015-0493), by the Spanish Ministry of Science and Technology through TIN2015-65316-P project and by the Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272). This work has also received funding from the European Union’s Horizon
2020 research and innovation programme under ExaNoDe project (grant agreement No 671578). Darko Zivanovic holds the Severo Ochoa grant (SVP-2014-068501) of the Ministry of Economy and Competitiveness
of Spain. The authors thank Harald Servat from BSC and Vladimir Marjanovi´c from High Performance Computing Center Stuttgart for their technical support.Postprint (published version
Random circuit block-encoded matrix and a proposal of quantum LINPACK benchmark
The LINPACK benchmark reports the performance of a computer for solving a
system of linear equations with dense random matrices. Although this task was
not designed with a real application directly in mind, the LINPACK benchmark
has been used to define the list of TOP500 supercomputers since the debut of
the list in 1993. We propose that a similar benchmark, called the quantum
LINPACK benchmark, could be used to measure the whole machine performance of
quantum computers. The success of the quantum LINPACK benchmark should be
viewed as the minimal requirement for a quantum computer to perform a useful
task of solving linear algebra problems, such as linear systems of equations.
We propose an input model called the RAndom Circuit Block-Encoded Matrix
(RACBEM), which is a proper generalization of a dense random matrix in the
quantum setting. The RACBEM model is efficient to be implemented on a quantum
computer, and can be designed to optimally adapt to any given quantum
architecture, with relying on a black-box quantum compiler. Besides solving
linear systems, the RACBEM model can be used to perform a variety of linear
algebra tasks relevant to many physical applications, such as computing
spectral measures, time series generated by a Hamiltonian simulation, and
thermal averages of the energy. We implement these linear algebra operations on
IBM Q quantum devices as well as quantum virtual machines, and demonstrate
their performance in solving scientific computing problems.Comment: 22 pages, 18 figure
Quantum Monte Carlo for large chemical systems: Implementing efficient strategies for petascale platforms and beyond
Various strategies to implement efficiently QMC simulations for large
chemical systems are presented. These include: i.) the introduction of an
efficient algorithm to calculate the computationally expensive Slater matrices.
This novel scheme is based on the use of the highly localized character of
atomic Gaussian basis functions (not the molecular orbitals as usually done),
ii.) the possibility of keeping the memory footprint minimal, iii.) the
important enhancement of single-core performance when efficient optimization
tools are employed, and iv.) the definition of a universal, dynamic,
fault-tolerant, and load-balanced computational framework adapted to all kinds
of computational platforms (massively parallel machines, clusters, or
distributed grids). These strategies have been implemented in the QMC=Chem code
developed at Toulouse and illustrated with numerical applications on small
peptides of increasing sizes (158, 434, 1056 and 1731 electrons). Using 10k-80k
computing cores of the Curie machine (GENCI-TGCC-CEA, France) QMC=Chem has been
shown to be capable of running at the petascale level, thus demonstrating that
for this machine a large part of the peak performance can be achieved.
Implementation of large-scale QMC simulations for future exascale platforms
with a comparable level of efficiency is expected to be feasible
Energy Efficiency of Personal Computers: A Comparative Analysis
The demand for electricity related to Information and Communications Technologies is
constantly growing and significantly contributes to the increase in global greenhouse gas emissions.
To reduce this harmful growth, it is necessary to address this problem from different perspectives.
Among these is changing the computing scale, such as migrating, if possible, algorithms and processes
to the most energy efficient resources. In this context, this paper explores the possibility of running
scientific and engineering programs on personal computers and compares the obtained power
efficiency on these systems with that of mainframe computers and even supercomputers. Anecdotally,
this paper also shows how the power efficiency obtained for the same workloads on personal
computers is similar to that obtained on supercomputers included in the Green500 ranking.Spanish GovernmentEuropean Commission PGC2018-098813-B-C31
MICI
- …