9 research outputs found

    Experience and performance of persistent memory for the DUNE data acquisition system

    Full text link
    Emerging high-performance storage technologies are opening up the possibility of designing new distributed data acquisition system architectures, in which the live acquisition of data and their processing are decoupled through a storage element. An example of these technologies is 3DXPoint, which promises to fill the gap between memory and traditional storage and offers unprecedented high throughput for data persistency. In this paper, we characterize the performance of persistent memory devices, which use the 3DXPoint technology, in the context of the data acquisition system for one large Particle Physics experiment, DUNE. This experiment must be capable of storing, upon a specific signal, incoming data for up to 100 seconds, with a throughput of 1.5 TB/s, for an aggregate size of 150 TB. The modular nature of the apparatus allows splitting the problem into 150 identical units operating in parallel, each at 10 GB/s. The target is to be able to dedicate a single CPU to each of those units for data acquisition and storage.Comment: Proceedings of the IEEE RT202

    Demystifying the Performance of HPC Scientific Applications on NVM-based Memory Systems

    Full text link
    The emergence of high-density byte-addressable non-volatile memory (NVM) is promising to accelerate data- and compute-intensive applications. Current NVM technologies have lower performance than DRAM and, thus, are often paired with DRAM in a heterogeneous main memory. Recently, byte-addressable NVM hardware becomes available. This work provides a timely evaluation of representative HPC applications from the "Seven Dwarfs" on NVM-based main memory. Our results quantify the effectiveness of DRAM-cached-NVM for accelerating HPC applications and enabling large problems beyond the DRAM capacity. On uncached-NVM, HPC applications exhibit three tiers of performance sensitivity, i.e., insensitive, scaled, and bottlenecked. We identify write throttling and concurrency control as the priorities in optimizing applications. We highlight that concurrency change may have a diverging effect on read and write accesses in applications. Based on these findings, we explore two optimization approaches. First, we provide a prediction model that uses datasets from a small set of configurations to estimate performance at various concurrency and data sizes to avoid exhaustive search in the configuration space. Second, we demonstrate that write-aware data placement on uncached-NVM could achieve 22x performance improvement with a 60% reduction in DRAM usage.Comment: 34th IEEE International Parallel and Distributed Processing Symposium (IPDPS2020

    ConnectIt: A Framework for Static and Incremental Parallel Graph Connectivity Algorithms

    Full text link
    Connected components is a fundamental kernel in graph applications due to its usefulness in measuring how well-connected a graph is, as well as its use as subroutines in many other graph algorithms. The fastest existing parallel multicore algorithms for connectivity are based on some form of edge sampling and/or linking and compressing trees. However, many combinations of these design choices have been left unexplored. In this paper, we design the ConnectIt framework, which provides different sampling strategies as well as various tree linking and compression schemes. ConnectIt enables us to obtain several hundred new variants of connectivity algorithms, most of which extend to computing spanning forest. In addition to static graphs, we also extend ConnectIt to support mixes of insertions and connectivity queries in the concurrent setting. We present an experimental evaluation of ConnectIt on a 72-core machine, which we believe is the most comprehensive evaluation of parallel connectivity algorithms to date. Compared to a collection of state-of-the-art static multicore algorithms, we obtain an average speedup of 37.4x (2.36x average speedup over the fastest existing implementation for each graph). Using ConnectIt, we are able to compute connectivity on the largest publicly-available graph (with over 3.5 billion vertices and 128 billion edges) in under 10 seconds using a 72-core machine, providing a 3.1x speedup over the fastest existing connectivity result for this graph, in any computational setting. For our incremental algorithms, we show that our algorithms can ingest graph updates at up to several billion edges per second. Finally, to guide the user in selecting the best variants in ConnectIt for different situations, we provide a detailed analysis of the different strategies in terms of their work and locality

    Adding Machine Intelligence to Hybrid Memory Management

    Get PDF
    Computing platforms increasingly incorporate heterogeneous memory hardware technologies, as a way to scale application performance, memory capacities and achieve cost effectiveness. However, this heterogeneity, along with the greater irregularity in the behavior of emerging workloads, render existing hybrid memory management approaches ineffective, calling for more intelligent methods. To this end, this thesis reveals new insights, develops novel methods and contributes system-level mechanisms towards the practical integration of machine learning to hybrid memory management, boosting application performance and system resource efficiency. First, this thesis builds Kleio; a hybrid memory page scheduler with machine intelligence. Kleio deploys Recurrent Neural Networks to learn memory access patterns at a page granularity and to improve upon the selection of dynamic page migrations across the memory hardware components. Kleio cleverly focuses the machine learning on the page subset whose timely movement will reveal most application performance improvement, while preserving history-based lightweight management for the rest of the pages. In this way, Kleio bridges on average 80% of the relative existing performance gap, while laying the grounds for practical machine intelligent data management with manageable learning overheads. In addition, this thesis contributes three system-level mechanisms to further boost application performance and reduce the operational and learning overheads of machine learning-based hybrid memory management. First, this thesis builds Cori; a system-level solution for tuning the operational frequency of periodic page schedulers for hybrid memories. Cori leverages insights on data reuse times to fine tune the page migration frequency in a lightweight manner. Second, this thesis contributes Coeus; a page grouping mechanism for page schedulers like Kleio. Coeus leverages Cori’s data reuse insights to tune the granularity at which patterns are interpreted by the page scheduler and enable the training of a single Recurrent Neural Network per page cluster, reducing by 3x the model training times. The combined effects of Cori and Coeus provide 3x additional performance improvements to Kleio. Finally, this thesis proposes Cronus; an image-based page selector for page schedulers like Kleio. Cronus uses visualization to accelerate the process of selecting which page patterns should be managed with machine learning, reducing by 75x the operational overheads of Kleio. Cronus lays the foundations for future use of visualization and computer vision methods in memory management, such as image-based memory access pattern classification, recognition and prediction.Ph.D