202 research outputs found
tf-Darshan: Understanding Fine-grained I/O Performance in Machine Learning Workloads
Machine Learning applications on HPC systems have been gaining popularity in
recent years. The upcoming large scale systems will offer tremendous
parallelism for training through GPUs. However, another heavy aspect of Machine
Learning is I/O, and this can potentially be a performance bottleneck.
TensorFlow, one of the most popular Deep-Learning platforms, now offers a new
profiler interface and allows instrumentation of TensorFlow operations.
However, the current profiler only enables analysis at the TensorFlow platform
level and does not provide system-level information. In this paper, we extend
TensorFlow Profiler and introduce tf-Darshan, both a profiler and tracer, that
performs instrumentation through Darshan. We use the same Darshan shared
instrumentation library and implement a runtime attachment without using a
system preload. We can extract Darshan profiling data structures during
TensorFlow execution to enable analysis through the TensorFlow profiler. We
visualize the performance results through TensorBoard, the web-based TensorFlow
visualization tool. At the same time, we do not alter Darshan's existing
implementation. We illustrate tf-Darshan by performing two case studies on
ImageNet image and Malware classification. We show that by guiding optimization
using data from tf-Darshan, we increase POSIX I/O bandwidth by up to 19% by
selecting data for staging on fast tier storage. We also show that Darshan has
the potential of being used as a runtime library for profiling and providing
information for future optimization.Comment: Accepted for publication at the 2020 International Conference on
Cluster Computing (CLUSTER 2020
10 Years Later: Cloud Computing is Closing the Performance Gap
Can cloud computing infrastructures provide HPC-competitive performance for
scientific applications broadly? Despite prolific related literature, this
question remains open. Answers are crucial for designing future systems and
democratizing high-performance computing. We present a multi-level approach to
investigate the performance gap between HPC and cloud computing, isolating
different variables that contribute to this gap. Our experiments are divided
into (i) hardware and system microbenchmarks and (ii) user application proxies.
The results show that today's high-end cloud computing can deliver
HPC-competitive performance not only for computationally intensive applications
but also for memory- and communication-intensive applications - at least at
modest scales - thanks to the high-speed memory systems and interconnects and
dedicated batch scheduling now available on some cloud platforms
A Survey of Prediction and Classification Techniques in Multicore Processor Systems
In multicore processor systems, being able to accurately predict the future provides new optimization opportunities, which otherwise could not be exploited. For example, an oracle able to predict a certain application\u27s behavior running on a smart phone could direct the power manager to switch to appropriate dynamic voltage and frequency scaling modes that would guarantee minimum levels of desired performance while saving energy consumption and thereby prolonging battery life. Using predictions enables systems to become proactive rather than continue to operate in a reactive manner. This prediction-based proactive approach has become increasingly popular in the design and optimization of integrated circuits and of multicore processor systems. Prediction transforms from simple forecasting to sophisticated machine learning based prediction and classification that learns from existing data, employs data mining, and predicts future behavior. This can be exploited by novel optimization techniques that can span across all layers of the computing stack. In this survey paper, we present a discussion of the most popular techniques on prediction and classification in the general context of computing systems with emphasis on multicore processors. The paper is far from comprehensive, but, it will help the reader interested in employing prediction in optimization of multicore processor systems
Adaptive memory-side last-level GPU caching
Emerging GPU applications exhibit increasingly high computation demands which has led GPU manufacturers to build GPUs with an increasingly large number of streaming multiprocessors (SMs). Providing data to the SMs at high bandwidth puts significant pressure on the memory hierarchy and the Network-on-Chip (NoC). Current GPUs typically partition the memory-side last-level cache (LLC) in equally-sized slices that are shared by all SMs. Although a shared LLC typically results in a lower miss rate, we find that for workloads with high degrees of data sharing across SMs, a private LLC leads to a significant performance advantage because of increased bandwidth to replicated cache lines across different LLC slices.
In this paper, we propose adaptive memory-side last-level GPU caching to boost performance for sharing-intensive workloads that need high bandwidth to read-only shared data. Adaptive caching leverages a lightweight performance model that balances increased LLC bandwidth against increased miss rate under private caching. In addition to improving performance for sharing-intensive workloads, adaptive caching also saves energy in a (co-designed) hierarchical two-stage crossbar NoC by power-gating and bypassing the second stage if the LLC is configured as a private cache. Our experimental results using 17 GPU workloads show that adaptive caching improves performance by 28.1% on average (up to 38.1%) compared to a shared LLC for sharing-intensive workloads. In addition, adaptive caching reduces NoC energy by 26.6% on average (up to 29.7%) and total system energy by 6.1% on average (up to 27.2%) when configured as a private cache. Finally, we demonstrate through a GPU NoC design space exploration that a hierarchical two-stage crossbar is both more power- and area-efficient than full and concentrated crossbars with the same bisection bandwidth, thus providing a low-cost cooperative solution to exploit workload sharing behavior in memory-side last-level caches
Evaluating Emerging CXL-enabled Memory Pooling for HPC Systems
Current HPC systems provide memory resources that are statically configured
and tightly coupled with compute nodes. However, workloads on HPC systems are
evolving. Diverse workloads lead to a need for configurable memory resources to
achieve high performance and utilization. In this study, we evaluate a memory
subsystem design leveraging CXL-enabled memory pooling. Two promising use cases
of composable memory subsystems are studied -- fine-grained capacity
provisioning and scalable bandwidth provisioning. We developed an emulator to
explore the performance impact of various memory compositions. We also provide
a profiler to identify the memory usage patterns in applications and their
optimization opportunities. Seven scientific and six graph applications are
evaluated on various emulated memory configurations. Three out of seven
scientific applications had less than 10% performance impact when the pooled
memory backed 75% of their memory footprint. The results also show that a
dynamically configured high-bandwidth system can effectively support
bandwidth-intensive unstructured mesh-based applications like OpenFOAM.
Finally, we identify interference through shared memory pools as a practical
challenge for adoption on HPC systems.Comment: 10 pages, 13 figures. Accepted for publication in Workshop on Memory
Centric High Performance Computing (MCHPC'22) at SC2
RISC-V-Based Platforms for HPC: Analyzing Non-functional Properties for Future HPC and Big-Data Clusters
High-Performance Computing (HPC) have evolved to be used to perform simulations of systems where physical experimentation is prohibitively impractical, expensive, or dangerous. This paper provides a general overview and showcases the analysis of non-functional properties
in RISC-V-based platforms for HPCs. In particular, our analyses target the evaluation of power and energy control, thermal management, and reliability assessment of promising systems, structures, and technologies devised for current and future generation of HPC machines. The main set of design methodologies and technologies developed within the activities of the Future and HPC & Big Data spoke of the National Centre of HPC, Big Data and Quantum Computing project are described along with the description of the testbed for experimenting two-phase cooling approaches
- …