41,033 research outputs found
Exploring the Performance Benefit of Hybrid Memory System on HPC Environments
Hardware accelerators have become a de-facto standard to achieve high
performance on current supercomputers and there are indications that this trend
will increase in the future. Modern accelerators feature high-bandwidth memory
next to the computing cores. For example, the Intel Knights Landing (KNL)
processor is equipped with 16 GB of high-bandwidth memory (HBM) that works
together with conventional DRAM memory. Theoretically, HBM can provide 5x
higher bandwidth than conventional DRAM. However, many factors impact the
effective performance achieved by applications, including the application
memory access pattern, the problem size, the threading level and the actual
memory configuration. In this paper, we analyze the Intel KNL system and
quantify the impact of the most important factors on the application
performance by using a set of applications that are representative of
scientific and data-analytics workloads. Our results show that applications
with regular memory access benefit from MCDRAM, achieving up to 3x performance
when compared to the performance obtained using only DRAM. On the contrary,
applications with random memory access pattern are latency-bound and may suffer
from performance degradation when using only MCDRAM. For those applications,
the use of additional hardware threads may help hide latency and achieve higher
aggregated bandwidth when using HBM
Shared Resource Management for Non-Volatile Asymmetric Memory
Non-volatile memory (NVM), such as Phase-Change Memory (PCM), is a promising energy-efficient candidate to replace DRAM. It is desirable because of its non-volatility, good scalability and low idle power. NVM, nevertheless, faces important challenges. The main problems are: writes are much slower and more power hungry than reads and write bandwidth is much lower than read bandwidth. Hybrid main memory architecture, which consists of a large NVM and a small DRAM, may become a solution for architecting NVM as main memory. Adding an extra layer of cache mitigates the drawbacks of NVM writes. However, writebacks from the last-level cache (LLC) might still (a) overwhelm the limited NVM write bandwidth and stall the application, (b) shorten lifetime and (c) increase energy consumption.
Effectively utilizing shared resources, such as the last-level cache and the memory bandwidth, is crucial to achieving high performance for multi-core systems. No existing cache and bandwidth allocation scheme exploits the read/write asymmetry property, which is fundamental in NVM. This thesis tries to consider the asymmetry property in partitioning the cache and memory bandwidth for NVM systems.
The thesis proposes three writeback-aware schemes to manage the resources in NVM systems. First, a runtime mechanism, Writeback-aware Cache Partitioning (WCP), is proposed to partition the shared LLC among multiple applications. Unlike past partitioning schemes, WCP considers the reduction in cache misses as well as writebacks. Second, a new runtime mechanism, Writeback-aware Bandwidth Partitioning (WBP), partitions NVM service cycles among applications. WBP uses a bandwidth partitioning weight to reflect the importance of writebacks (in addition to LLC misses) to bandwidth allocation. A companion Dynamic Weight Adjustment scheme dynamically selects the cache partitioning weight to maximize system performance. Third, Unified Writeback-aware Partitioning (UWP) partitions the last-level cache and the memory bandwidth cooperatively. UWP can further improve the system performance by considering the interaction of cache partitioning and bandwidth partitioning. The three proposed schemes improve system performance by considering the unique read/write asymmetry property of NVM
Parallel sparse matrix-vector multiplication as a test case for hybrid MPI+OpenMP programming
We evaluate optimized parallel sparse matrix-vector operations for two
representative application areas on widespread multicore-based cluster
configurations. First the single-socket baseline performance is analyzed and
modeled with respect to basic architectural properties of standard multicore
chips. Going beyond the single node, parallel sparse matrix-vector operations
often suffer from an unfavorable communication to computation ratio. Starting
from the observation that nonblocking MPI is not able to hide communication
cost using standard MPI implementations, we demonstrate that explicit overlap
of communication and computation can be achieved by using a dedicated
communication thread, which may run on a virtual core. We compare our approach
to pure MPI and the widely used "vector-like" hybrid programming strategy.Comment: 12 pages, 6 figure
Achieving Efficient Strong Scaling with PETSc using Hybrid MPI/OpenMP Optimisation
The increasing number of processing elements and decreas- ing memory to core
ratio in modern high-performance platforms makes efficient strong scaling a key
requirement for numerical algorithms. In order to achieve efficient scalability
on massively parallel systems scientific software must evolve across the entire
stack to exploit the multiple levels of parallelism exposed in modern
architectures. In this paper we demonstrate the use of hybrid MPI/OpenMP
parallelisation to optimise parallel sparse matrix-vector multiplication in
PETSc, a widely used scientific library for the scalable solution of partial
differential equations. Using large matrices generated by Fluidity, an open
source CFD application code which uses PETSc as its linear solver engine, we
evaluate the effect of explicit communication overlap using task-based
parallelism and show how to further improve performance by explicitly load
balancing threads within MPI processes. We demonstrate a significant speedup
over the pure-MPI mode and efficient strong scaling of sparse matrix-vector
multiplication on Fujitsu PRIMEHPC FX10 and Cray XE6 systems
Hybrid-parallel sparse matrix-vector multiplication with explicit communication overlap on current multicore-based systems
We evaluate optimized parallel sparse matrix-vector operations for several
representative application areas on widespread multicore-based cluster
configurations. First the single-socket baseline performance is analyzed and
modeled with respect to basic architectural properties of standard multicore
chips. Beyond the single node, the performance of parallel sparse matrix-vector
operations is often limited by communication overhead. Starting from the
observation that nonblocking MPI is not able to hide communication cost using
standard MPI implementations, we demonstrate that explicit overlap of
communication and computation can be achieved by using a dedicated
communication thread, which may run on a virtual core. Moreover we identify
performance benefits of hybrid MPI/OpenMP programming due to improved load
balancing even without explicit communication overlap. We compare performance
results for pure MPI, the widely used "vector-like" hybrid programming
strategies, and explicit overlap on a modern multicore-based cluster and a Cray
XE6 system.Comment: 16 pages, 10 figure
- …