47 research outputs found
Experience and performance of persistent memory for the DUNE data acquisition system
Emerging high-performance storage technologies are opening up the possibility
of designing new distributed data acquisition system architectures, in which
the live acquisition of data and their processing are decoupled through a
storage element. An example of these technologies is 3DXPoint, which promises
to fill the gap between memory and traditional storage and offers unprecedented
high throughput for data persistency.
In this paper, we characterize the performance of persistent memory devices,
which use the 3DXPoint technology, in the context of the data acquisition
system for one large Particle Physics experiment, DUNE. This experiment must be
capable of storing, upon a specific signal, incoming data for up to 100
seconds, with a throughput of 1.5 TB/s, for an aggregate size of 150 TB. The
modular nature of the apparatus allows splitting the problem into 150 identical
units operating in parallel, each at 10 GB/s. The target is to be able to
dedicate a single CPU to each of those units for data acquisition and storage.Comment: Proceedings of the IEEE RT202
CXL Memory as Persistent Memory for Disaggregated HPC: A Practical Approach
In the landscape of High-Performance Computing (HPC), the quest for efficient
and scalable memory solutions remains paramount. The advent of Compute Express
Link (CXL) introduces a promising avenue with its potential to function as a
Persistent Memory (PMem) solution in the context of disaggregated HPC systems.
This paper presents a comprehensive exploration of CXL memory's viability as a
candidate for PMem, supported by physical experiments conducted on cutting-edge
multi-NUMA nodes equipped with CXL-attached memory prototypes. Our study not
only benchmarks the performance of CXL memory but also illustrates the seamless
transition from traditional PMem programming models to CXL, reinforcing its
practicality.
To substantiate our claims, we establish a tangible CXL prototype using an
FPGA card embodying CXL 1.1/2.0 compliant endpoint designs (Intel FPGA CXL IP).
Performance evaluations, executed through the STREAM and STREAM-PMem
benchmarks, showcase CXL memory's ability to mirror PMem characteristics in
App-Direct and Memory Mode while achieving impressive bandwidth metrics with
Intel 4th generation Xeon (Sapphire Rapids) processors.
The results elucidate the feasibility of CXL memory as a persistent memory
solution, outperforming previously established benchmarks. In contrast to
published DCPMM results, our CXL-DDR4 memory module offers comparable bandwidth
to local DDR4 memory configurations, albeit with a moderate decrease in
performance. The modified STREAM-PMem application underscores the ease of
transitioning programming models from PMem to CXL, thus underscoring the
practicality of adopting CXL memory.Comment: 12 pages, 9 figure
Demystifying the Performance of HPC Scientific Applications on NVM-based Memory Systems
The emergence of high-density byte-addressable non-volatile memory (NVM) is
promising to accelerate data- and compute-intensive applications. Current NVM
technologies have lower performance than DRAM and, thus, are often paired with
DRAM in a heterogeneous main memory. Recently, byte-addressable NVM hardware
becomes available. This work provides a timely evaluation of representative HPC
applications from the "Seven Dwarfs" on NVM-based main memory. Our results
quantify the effectiveness of DRAM-cached-NVM for accelerating HPC applications
and enabling large problems beyond the DRAM capacity. On uncached-NVM, HPC
applications exhibit three tiers of performance sensitivity, i.e., insensitive,
scaled, and bottlenecked. We identify write throttling and concurrency control
as the priorities in optimizing applications. We highlight that concurrency
change may have a diverging effect on read and write accesses in applications.
Based on these findings, we explore two optimization approaches. First, we
provide a prediction model that uses datasets from a small set of
configurations to estimate performance at various concurrency and data sizes to
avoid exhaustive search in the configuration space. Second, we demonstrate that
write-aware data placement on uncached-NVM could achieve x performance
improvement with a 60% reduction in DRAM usage.Comment: 34th IEEE International Parallel and Distributed Processing Symposium
(IPDPS2020
Recommended from our members
Compiler and system for resilient distributed heterogeneous graph analytics
Graph analytics systems are used in a wide variety of applications including health care, electronic circuit design, machine learning, and cybersecurity. Graph analytics systems must handle very large graphs such as the Facebook friends graph, which has more than a billion nodes and 200 billion edges. Since machines have limited main memory, distributed-memory clusters with sufficient memory and computation power are required for processing of these graphs. In distributed graph analytics, the graph is partitioned among the machines in a cluster, and communication between partitions is implemented using a substrate like MPI. However, programming distributed-memory systems are not easy and the recent trend towards the processor heterogeneity has added to this complexity. To simplify the programming of graph applications on such platforms, this dissertation first presents a compiler called Abelian that translates shared-memory descriptions of graph algorithms written in the Galois programming model into efficient code for distributed-memory platforms with heterogeneous processors. An important runtime parameter to the compiler-generated distributed code is the partitioning policy. We present an experimental study of partitioning strategies for distributed work-efficient graph analytics applications on different CPU architecture clusters at large scale (up to 256 machines). Based on the study we present a simple rule of thumb to select among myriad policies. Another challenge of distributed graph analytics that we address in this dissertation is to deal with machine fail-stop failures, which is an important concern especially for long-running graph analytics applications on large clusters. We present a novel communication and synchronization substrate called Phoenix that leverages the algorithmic properties of graph analytics applications to recover from faults with zero overheads during fault-free execution and show that Phoenix is 24x faster than previous state-of-the-art systems. In this dissertation, we also look at the new opportunities for graph analytics on massive datasets brought by a new kind of byte-addressable memory technology with higher density and lower cost than DRAM such as intel Optane DC Persistent Memory. This enables the design of affordable systems that support up to 6TB of randomly accessible memory. In this dissertation, we present key runtime and algorithmic principles to consider when performing graph analytics on massive datasets on Optane DC Persistent Memory as well as highlight ideas that apply to graph analytics on all large-memory platforms. Finally, we show that our distributed graph analytics infrastructure can be used for a new domain of applications, in particular, embedding algorithms such as Word2Vec. Word2Vec trains the vector representations of words (also known as word embeddings) on large text corpus and resulting vector embeddings have been shown to capture semantic and syntactic relationships among words. Other examples include Node2Vec, Code2Vec, Sequence2Vec, etc (collectively known as Any2Vec) with a wide variety of uses. We formulate the training of such applications as a graph problem and present GraphAny2Vec, a distributed Any2Vec training framework that leverages the state-of-the-art distributed heterogeneous graph analytics infrastructure developed in this dissertation to scale Any2Vec training to large distributed clusters. GraphAny2Vec also demonstrates a novel way of combining model gradients during training, which allows it to scale without losing accuracyComputer Science