10 research outputs found
Demystifying the Performance of HPC Scientific Applications on NVM-based Memory Systems
The emergence of high-density byte-addressable non-volatile memory (NVM) is
promising to accelerate data- and compute-intensive applications. Current NVM
technologies have lower performance than DRAM and, thus, are often paired with
DRAM in a heterogeneous main memory. Recently, byte-addressable NVM hardware
becomes available. This work provides a timely evaluation of representative HPC
applications from the "Seven Dwarfs" on NVM-based main memory. Our results
quantify the effectiveness of DRAM-cached-NVM for accelerating HPC applications
and enabling large problems beyond the DRAM capacity. On uncached-NVM, HPC
applications exhibit three tiers of performance sensitivity, i.e., insensitive,
scaled, and bottlenecked. We identify write throttling and concurrency control
as the priorities in optimizing applications. We highlight that concurrency
change may have a diverging effect on read and write accesses in applications.
Based on these findings, we explore two optimization approaches. First, we
provide a prediction model that uses datasets from a small set of
configurations to estimate performance at various concurrency and data sizes to
avoid exhaustive search in the configuration space. Second, we demonstrate that
write-aware data placement on uncached-NVM could achieve x performance
improvement with a 60% reduction in DRAM usage.Comment: 34th IEEE International Parallel and Distributed Processing Symposium
(IPDPS2020
Recommended from our members
Building Distributed Systems with Non-Volatile Main Memories and RDMA Networks
High-performance, byte-addressable non-volatile main memories (NVMMs) allow application developers to combine storage and memory into a single layer. These high-performance storage systems would be especially useful in large-scale data center environments where data is distributed and replicated across multiple servers.Unfortunately, existing approaches of providing remote storage access rest on the assumption that storage is slow, so the cost of the software and protocols is acceptable. Such assumption no longer holds for the fast NVMM. As a result, taking full advantage of NVMMs’ potential will require changes in system software and networking protocol. This thesis focuses on accessing remote NVMM efficiently using remote direct memory access (RDMA) network. RDMA enables a client to directly access memory on a remote machine without involving its local CPU.This thesis first presents Mojim, a system that provides replicated, reliable, and highly-available NVMM as an operating system service. Applications can access data in Mojim using normal load and store instructions while controlling when and how updates propagate to replicas using system calls. Our evaluation shows Mojim adds little overhead to the un-replicated system and provides 0.4x to 2.7x the throughput of the un-replicated system.This thesis then presents Orion, a distributed file system designed from for NVMM and RDMA networks. Traditional distributed file systems are designed for slower hard drives. These slower media incentivizes complex optimizations (e.g., queuing, striping, and batching) around disk accesses. Orion combines file system functions and network operations into a single layer. It provides low latency metadata accesses and outperforms existing distributed file systems by a large margin.Finally, an NVMM application can map files backed by an NVMM file system into its address space, and accesses them using CPU instructions. In this case, RDMA and NVMM file systems introduce duplication of effort on permissions, naming, and address translation. We introduce two changes to the existing RDMA protocol: the file memory region (FileMR) and range based address translation. By eliminating redundant translations, FileMR minimizes the number of translations done at the NIC, reducing the load on the NIC’s translation cache and resulting in application performance improvement by 1.8x - 2.0x
Extremely fast (a,b)-trees at all contention levels
Many concurrent dictionary implementations are designed and evaluated with only low-contention workloads in mind. This thesis presents several concurrent linearizable (a,b)-tree implementations with the overarching goal of performing well on both low- and high-contention workloads, and especially update-heavy workloads. The OCC-ABtree uses optimistic concurrency control to achieve state-of-the-art low-contention performance. However, under high-contention, cache coherence traffic begins to affect its performance.
This is addressed by replacing its test-and-compare-and-swap locks with MCS queue locks. The resulting MCS-ABtree scales well under both low- and high-contention workloads. This thesis also introduces two coalescing-based trees, the CoMCS-ABtree and the CoPub-ABtree, that achieve substantially better performance under high-contention by reordering and coalescing concurrent inserts and deletes. Comparing these algorithms against the state of the art in concurrent search trees, we find that the fastest algorithm, the CoPub-ABtree, outperforms the next fastest competitor by up to 2x.
This thesis then describes persistent versions of the four trees, whose implementations use fewer sfence instructions than a leading competitor (the FPTree). The persistent trees are proved to be strictly linearizable. Experimentally, the persistent trees are only slightly slower than their volatile counterparts, suggesting that they have great use as in-memory databases that need to be able to recover after a crash