33 research outputs found
A Pattern Language for High-Performance Computing Resilience
High-performance computing systems (HPC) provide powerful capabilities for
modeling, simulation, and data analytics for a broad class of computational
problems. They enable extreme performance of the order of quadrillion
floating-point arithmetic calculations per second by aggregating the power of
millions of compute, memory, networking and storage components. With the
rapidly growing scale and complexity of HPC systems for achieving even greater
performance, ensuring their reliable operation in the face of system
degradations and failures is a critical challenge. System fault events often
lead the scientific applications to produce incorrect results, or may even
cause their untimely termination. The sheer number of components in modern
extreme-scale HPC systems and the complex interactions and dependencies among
the hardware and software components, the applications, and the physical
environment makes the design of practical solutions that support fault
resilience a complex undertaking. To manage this complexity, we developed a
methodology for designing HPC resilience solutions using design patterns. We
codified the well-known techniques for handling faults, errors and failures
that have been devised, applied and improved upon over the past three decades
in the form of design patterns. In this paper, we present a pattern language to
enable a structured approach to the development of HPC resilience solutions.
The pattern language reveals the relations among the resilience patterns and
provides the means to explore alternative techniques for handling a specific
fault model that may have different efficiency and complexity characteristics.
Using the pattern language enables the design and implementation of
comprehensive resilience solutions as a set of interconnected resilience
patterns that can be instantiated across layers of the system stack.Comment: Proceedings of the 22nd European Conference on Pattern Languages of
Program
ML-based Modeling to Predict I/O Performance on Different Storage Sub-systems
Parallel applications can spend a significant amount of time performing I/O
on large-scale supercomputers. Fast near-compute storage accelerators called
burst buffers can reduce the time a processor spends performing I/O and
mitigate I/O bottlenecks. However, determining if a given application could be
accelerated using burst buffers is not straightforward even for storage
experts. The relationship between an application's I/O characteristics (such as
I/O volume, processes involved, etc.) and the best storage sub-system for it
can be complicated. As a result, adapting parallel applications to use burst
buffers efficiently is a trial-and-error process. In this work, we present a
Python-based tool called PrismIO that enables programmatic analysis of I/O
traces. Using PrismIO, we identify bottlenecks on burst buffers and parallel
file systems and explain why certain I/O patterns perform poorly. Further, we
use machine learning to model the relationship between I/O characteristics and
burst buffer selections. We run IOR (an I/O benchmark) with various I/O
characteristics on different storage systems and collect performance data. We
use the data as the input for training the model. Our model can predict if a
file of an application should be placed on BBs for unseen IOR scenarios with an
accuracy of 94.47% and for four real applications with an accuracy of 95.86%
INFRASTRUCTURE FOR PERFORMANCE TUNING MPI APPLICATIONS
Clusters of workstations are becoming increasingly popular as a low-budget alternative for supercomputing power. In these systems,message-passing is often used to allow the separate nodes to act as a single computing machine. Programmers of such systems face a daunting challenge in understanding the performance bottlenecks of their applications. This is largely due to the vast amount of performance data that is collected, and the time and expertise necessary to use traditional parallel performance tools to analyze that data. The goal of this project is to increase the level of performance tool support for message-passing application programmers on clusters of workstations. We added sup-port for LAM/MPI into the existing parallel performance tool,Paradyn. LAM/MPI is a commonly used, freely-available implementation of the Message Passing Interface (MPI),and also includes several newer MPI features,such as dynamic process creation. In addition, we added support for non-shared filesystems into Paradyn and enhanced the existing support for the MPICH implementation of MPI. We verified that Paradyn correctly measures the performance of the majority of LAM/MPI programs on Linux clusters and show the results of those tests. In addition,we discuss MPI-2 features that are of interest to parallel performance tool developers and design support for these fea-tures for Paradyn
Scalable event tracking on high-end parallel systems
Accurate performance analysis of high end systems requires event-based traces to correctly identify the root cause of a number of the complex performance problems that arise on these highly parallel systems. These high-end architectures contain tens to hundreds of thousands of processors, pushing application scalability challenges to new heights. Unfortunately, the collection of event-based data presents scalability challenges itself: the large volume of collected data increases tool overhead, and results in data files that are difficult to store and analyze. Our solution to these problems is a new measurement technique called trace profiling that collects the information needed to diagnose performance problems that traditionally require traces, but at a greatly reduced data volume. The trace profiling technique reduces the amount of data measured and stored by capitalizing on the repeated behavior of programs, and on the similarity of the behavior and performance of parallel processes in an application run. Trace profiling is a hybrid between profiling and tracing, collecting summary information about the event patterns in an application run. Because the data has already been classified into behavior categories, we can present reduced, partially analyzed performance data to the user, highlighting the performance behaviors that comprised most of the execution time
Evaluating Similarity-based Trace Reduction Techniques for Scalable Performance Analysis
Event traces are required to correctly diagnose a number of performance problems that arise on today’s highly parallel systems. Unfortunately, the collection of event traces can produce a large volume of data that is difficult, or even impossible, to store and analyze. One approach for compressing a trace is to identify repeating trace patterns and retain only one representative of each pattern. However, determining the similarity of sections of traces, i.e., identifying patterns, is not straightforward. In this paper, we investigate pattern-based methods for reducing traces that will be used for performance analysis. We evaluate the different methods against several criteria, including size reduction, introduced error, and retention of performance trends, using both benchmarks with carefully chosen performance behaviors, and a real application