Search CORE

33 research outputs found

A Pattern Language for High-Performance Computing Resilience

Author: Chung Jinsuk
Mohror Kathryn
Saridakis Titos
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 30/10/2017
Field of study

High-performance computing systems (HPC) provide powerful capabilities for modeling, simulation, and data analytics for a broad class of computational problems. They enable extreme performance of the order of quadrillion floating-point arithmetic calculations per second by aggregating the power of millions of compute, memory, networking and storage components. With the rapidly growing scale and complexity of HPC systems for achieving even greater performance, ensuring their reliable operation in the face of system degradations and failures is a critical challenge. System fault events often lead the scientific applications to produce incorrect results, or may even cause their untimely termination. The sheer number of components in modern extreme-scale HPC systems and the complex interactions and dependencies among the hardware and software components, the applications, and the physical environment makes the design of practical solutions that support fault resilience a complex undertaking. To manage this complexity, we developed a methodology for designing HPC resilience solutions using design patterns. We codified the well-known techniques for handling faults, errors and failures that have been devised, applied and improved upon over the past three decades in the form of design patterns. In this paper, we present a pattern language to enable a structured approach to the development of HPC resilience solutions. The pattern language reveals the relations among the resilience patterns and provides the means to explore alternative techniques for handling a specific fault model that may have different efficiency and complexity characteristics. Using the pattern language enables the design and implementation of comprehensive resilience solutions as a set of interconnected resilience patterns that can be instantiated across layers of the system stack.Comment: Proceedings of the 22nd European Conference on Pattern Languages of Program

arXiv.org e-Print Archive

Crossref

ML-based Modeling to Predict I/O Performance on Different Storage Sub-systems

Author: Bhatele Abhinav
Devarajan Hariharan
Mohror Kathryn
Sivaraman Pranav
Xu Yiheng
Publication venue
Publication date: 11/01/2024
Field of study

Parallel applications can spend a significant amount of time performing I/O on large-scale supercomputers. Fast near-compute storage accelerators called burst buffers can reduce the time a processor spends performing I/O and mitigate I/O bottlenecks. However, determining if a given application could be accelerated using burst buffers is not straightforward even for storage experts. The relationship between an application's I/O characteristics (such as I/O volume, processes involved, etc.) and the best storage sub-system for it can be complicated. As a result, adapting parallel applications to use burst buffers efficiently is a trial-and-error process. In this work, we present a Python-based tool called PrismIO that enables programmatic analysis of I/O traces. Using PrismIO, we identify bottlenecks on burst buffers and parallel file systems and explain why certain I/O patterns perform poorly. Further, we use machine learning to model the relationship between I/O characteristics and burst buffer selections. We run IOR (an I/O benchmark) with various I/O characteristics on different storage systems and collect performance data. We use the data as the input for training the model. Our model can predict if a file of an application should be placed on BBs for unseen IOR scenarios with an accuracy of 94.47% and for four real applications with an accuracy of 95.86%

arXiv.org e-Print Archive

INFRASTRUCTURE FOR PERFORMANCE TUNING MPI APPLICATIONS

Author: Kathryn Marie Mohror
Publication venue
Publication date: 01/01/2004
Field of study

Clusters of workstations are becoming increasingly popular as a low-budget alternative for supercomputing power. In these systems,message-passing is often used to allow the separate nodes to act as a single computing machine. Programmers of such systems face a daunting challenge in understanding the performance bottlenecks of their applications. This is largely due to the vast amount of performance data that is collected, and the time and expertise necessary to use traditional parallel performance tools to analyze that data. The goal of this project is to increase the level of performance tool support for message-passing application programmers on clusters of workstations. We added sup-port for LAM/MPI into the existing parallel performance tool,Paradyn. LAM/MPI is a commonly used, freely-available implementation of the Message Passing Interface (MPI),and also includes several newer MPI features,such as dynamic process creation. In addition, we added support for non-shared filesystems into Paradyn and enhanced the existing support for the MPICH implementation of MPI. We verified that Paradyn correctly measures the performance of the majority of LAM/MPI programs on Linux clusters and show the results of those tests. In addition,we discuss MPI-2 features that are of interest to parallel performance tool developers and design support for these fea-tures for Paradyn

CiteSeerX

PDXScholar (Portland State University)

Scalable event tracking on high-end parallel systems

Author: Mohror Kathryn Marie
Publication venue: PDXScholar
Publication date: 01/01/2010
Field of study

Accurate performance analysis of high end systems requires event-based traces to correctly identify the root cause of a number of the complex performance problems that arise on these highly parallel systems. These high-end architectures contain tens to hundreds of thousands of processors, pushing application scalability challenges to new heights. Unfortunately, the collection of event-based data presents scalability challenges itself: the large volume of collected data increases tool overhead, and results in data files that are difficult to store and analyze. Our solution to these problems is a new measurement technique called trace profiling that collects the information needed to diagnose performance problems that traditionally require traces, but at a greatly reduced data volume. The trace profiling technique reduces the amount of data measured and stored by capitalizing on the repeated behavior of programs, and on the similarity of the behavior and performance of parallel processes in an application run. Trace profiling is a hybrid between profiling and tracing, collecting summary information about the event patterns in an application run. Because the data has already been classified into behavior categories, we can present reduced, partially analyzed performance data to the user, highlighting the performance behaviors that comprised most of the execution time

PDXScholar (Portland State University)

Evaluating Similarity-based Trace Reduction Techniques for Scalable Performance Analysis

Author: Karavanic Karen L.
Mohror Kathryn Marie
Publication venue: PDXScholar
Publication date: 01/01/2009
Field of study

Event traces are required to correctly diagnose a number of performance problems that arise on today’s highly parallel systems. Unfortunately, the collection of event traces can produce a large volume of data that is difficult, or even impossible, to store and analyze. One approach for compressing a trace is to identify repeating trace patterns and retain only one representative of each pattern. However, determining the similarity of sections of traces, i.e., identifying patterns, is not straightforward. In this paper, we investigate pattern-based methods for reducing traces that will be used for performance analysis. We evaluate the different methods against several criteria, including size reduction, introduced error, and retention of performance trends, using both benchmarks with carefully chosen performance behaviors, and a real application

PDXScholar (Portland State University)

Challenges and Opportunities of User-Level File Systems for HPC (Dagstuhl Seminar 17202)

Author: Mohror Kathryn
Yu Weikuan
Publication venue: Dagstuhl Reports. Dagstuhl Reports, Volume 7, Issue 5
Publication date: 01/01/2017
Field of study

DROPS Dagstuhl Research Online Publication Server