215 research outputs found
Energy-efficiency evaluation of Intel KNL for HPC workloads
Energy consumption is increasingly becoming a limiting factor to the design
of faster large-scale parallel systems, and development of energy-efficient and
energy-aware applications is today a relevant issue for HPC code-developer
communities. In this work we focus on energy performance of the Knights Landing
(KNL) Xeon Phi, the latest many-core architecture processor introduced by Intel
into the HPC market. We take into account the 64-core Xeon Phi 7230, and
analyze its energy performance using both the on-chip MCDRAM and the regular
DDR4 system memory as main storage for the application data-domain. As a
benchmark application we use a Lattice Boltzmann code heavily optimized for
this architecture and implemented using different memory data layouts to store
its lattice. We assessthen the energy consumption using different memory
data-layouts, kind of memory (DDR4 or MCDRAM) and number of threads per core
The Use of MPI and OpenMP Technologies for Subsequence Similarity Search in Very Large Time Series on Computer Cluster System with Nodes Based on the Intel Xeon Phi Knights Landing Many-core Processor
Nowadays, subsequence similarity search is required in a wide range of time
series mining applications: climate modeling, financial forecasts, medical
research, etc. In most of these applications, the Dynamic TimeWarping (DTW)
similarity measure is used since DTW is empirically confirmed as one of the
best similarity measure for most subject domains. Since the DTW measure has a
quadratic computational complexity w.r.t. the length of query subsequence, a
number of parallel algorithms for various many-core architectures have been
developed, namely FPGA, GPU, and Intel MIC. In this article, we propose a new
parallel algorithm for subsequence similarity search in very large time series
on computer cluster systems with nodes based on Intel Xeon Phi Knights Landing
(KNL) many-core processors. Computations are parallelized on two levels as
follows: through MPI at the level of all cluster nodes, and through OpenMP
within one cluster node. The algorithm involves additional data structures and
redundant computations, which make it possible to effectively use the
capabilities of vector computations on Phi KNL. Experimental evaluation of the
algorithm on real-world and synthetic datasets shows that it is highly
scalable.Comment: Accepted for publication in the "Numerical Methods and Programming"
journal (http://num-meth.srcc.msu.ru/english/, in Russian "Vychislitelnye
Metody i Programmirovanie"), in Russia
Optimisation of computational fluid dynamics applications on multicore and manycore architectures
This thesis presents a number of optimisations used for mapping the underlying computational patterns of finite volume CFD applications onto the architectural features of modern multicore and manycore processors. Their effectiveness and impact is demonstrated in a block-structured and an unstructured code of representative size to industrial applications and across a variety of processor architectures that make up contemporary high-performance computing systems.
The importance of vectorization and the ways through which this can be achieved is demonstrated in both structured and unstructured solvers together with the impact that the underlying data layout can have on performance. The utility of auto-tuning for ensuring performance portability across multiple architectures is demonstrated and used for selecting optimal parameters such as prefetch distances for software prefetching or tile sizes for strip mining/loop tiling. On the manycore architectures, running more than one thread per physical core is found to be crucial for good performance on processors with in-order core designs but not required on out-of-order architectures. For architectures with high-bandwidth memory packages, their exploitation, whether explicitly or implicitly, is shown to be imperative for best performance.
The implementation of all of these optimisations led to application speed-ups ranging between 2.7X and 3X on the multicore CPUs and 5.7X to 24X on the manycore processors.Open Acces
Status and Future Perspectives for Lattice Gauge Theory Calculations to the Exascale and Beyond
In this and a set of companion whitepapers, the USQCD Collaboration lays out
a program of science and computing for lattice gauge theory. These whitepapers
describe how calculation using lattice QCD (and other gauge theories) can aid
the interpretation of ongoing and upcoming experiments in particle and nuclear
physics, as well as inspire new ones.Comment: 44 pages. 1 of USQCD whitepapers
Parallelization of Plasma Physics Simulations on Massively Parallel Architectures
Proyecto de Graduación (Maestría en Ingeniería en Computación) Instituto Tecnológico de Costa Rica, Escuela de Ingeniería en Computación, 2017.Clean energy sources have increased its importance in the last few years. Because of that,
the seek for more sustainable sources has been increased too. This effect made to turn the
eyes of the scientific community into plasma physics, specially to the controlled fusion. This
plasma physics developments have to rely on computer simulation processes before start the
implementation of the respective fusion devices. The simulation process has to be done in order
to detect any kind of issues on the theoretical model of the device, saving time and money. To
achieve this, those computer simulation processes have to finish in a timely manner. If not, the
simulation defeats its purpose. However, in recent years, computer systems have passed from
an increment speed approach to a increment parallelism approach. That change represents a
short stop for these applications. Because of these reasons, on this dissertation we took one
plasma physics application for simulation and sped it up by implementing vectorization, shared,
and distributed memory programming in a hybrid model. We ran several experiments regarding
the performance improvement and the scaling of the new implementation of the application
on sumpercomputers using a recent architecture, Intel Xeon Phi - Knights Landing - manycore
processor. The claim of this thesis is that a plasma physics application can be parallelized
achieving around 0.8 of performance under the right configuration and the right architecture
Main memory in HPC: do we need more, or could we live with less?
An important aspect of High-Performance Computing (HPC) system design is the choice of main memory capacity. This choice becomes increasingly important now that 3D-stacked memories are entering the market. Compared with conventional Dual In-line Memory Modules (DIMMs), 3D memory chiplets provide better performance and energy efficiency but lower memory capacities. Therefore, the adoption of 3D-stacked memories in the HPC domain depends on whether we can find use cases that require much less memory than is available now.
This study analyzes the memory capacity requirements of important HPC benchmarks and applications. We find that the High-Performance Conjugate Gradients (HPCG) benchmark could be an important success story for 3D-stacked memories in HPC, but High-Performance Linpack (HPL) is likely to be constrained by 3D memory capacity. The study also emphasizes that the analysis of memory footprints of production HPC applications is complex and that it requires an understanding of application scalability and target category, i.e., whether the users target capability or capacity computing. The results show that most of the HPC applications under study have per-core memory footprints in the range of hundreds of megabytes, but we also detect applications and use cases that require gigabytes per core. Overall, the study identifies the HPC applications and use cases with memory footprints that could be provided by 3D-stacked memory chiplets, making a first step toward adoption of this novel technology in the HPC domain.This work was supported by the Collaboration Agreement between Samsung Electronics Co., Ltd. and BSC, Spanish Government through Severo Ochoa programme (SEV-2015-0493), by the Spanish Ministry of Science and Technology through TIN2015-65316-P project and by the Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272). This work has also received funding from the European Union’s Horizon
2020 research and innovation programme under ExaNoDe project (grant agreement No 671578). Darko Zivanovic holds the Severo Ochoa grant (SVP-2014-068501) of the Ministry of Economy and Competitiveness
of Spain. The authors thank Harald Servat from BSC and Vladimir Marjanovi´c from High Performance Computing Center Stuttgart for their technical support.Postprint (published version
- …