2 research outputs found

    Infrastructure for Performance Monitoring and Analysis of Systems and Applications

    Get PDF
    The growth of High Performance Computer (HPC) systems increases the complexity with respect to understanding resource utilization, system management, and performance issues. HPC performance monitoring tools need to collect information at both the application and system levels to yield a complete performance picture. Existing approaches limit the abilities of the users to do meaningful analysis on actionable timescale. Efficient infrastructures are required to support largescale systems performance data analysis for both run-time troubleshooting and post-run processing modes. In this dissertation, we present methods to fill these gaps in the infrastructure for HPC performance monitoring and analysis. First, we enhance the architecture of a monitoring system to integrate streaming analysis capabilities at arbitrary locations within its data collection, transport, and aggregation facilities. Next, we present an approach to streaming collection of application performance data. We integrate these methods with a monitoring system used on large-scale computational platforms. Finally, we present a new approach for constructing durable transactional linked data structures that takes advantage of byte-addressable non-volatile memory technologies. Transactional data structures are building blocks of in-memory databases that are used by HPC monitoring systems to store and retrieve data efficiently. We evaluate the presented approaches on a series of case studies. The experiment results demonstrate the impact of our tools, while keeping the overhead in an acceptable margin

    Performance Observability and Monitoring of High Performance Computing with Microservices

    Get PDF
    Traditionally, High Performance Computing (HPC) softwarehas been built and deployed as bulk-synchronous, parallel executables based on the message-passing interface (MPI) programming model. The rise of data-oriented computing paradigms and an explosion in the variety of applications that need to be supported on HPC platforms have forced a re-think of the appropriate programming and execution models to integrate this new functionality. In situ workflows demarcate a paradigm shift in HPC software development methodologies enabling a range of new applications --- from user-level data services to machine learning (ML) workflows that run alongside traditional scientific simulations. By tracing the evolution of HPC software developmentover the past 30 years, this dissertation identifies the key elements and trends responsible for the emergence of coupled, distributed, in situ workflows. This dissertation's focus is on coupled in situ workflows involving composable, high-performance microservices. After outlining the motivation to enable performance observability of these services and why existing HPC performance tools and techniques can not be applied in this context, this dissertation proposes a solution wherein a set of techniques gathers, analyzes, and orients performance data from different sources to generate observability. By leveraging microservice components initially designed to build high performance data services, this dissertation demonstrates their broader applicability for building and deploying performance monitoring and visualization as services within an in situ workflow. The results from this dissertation suggest that: (1) integration of performance data from different sources is vital to understanding the performance of service components, (2) the in situ (online) analysis of this performance data is needed to enable the adaptivity of distributed components and manage monitoring data volume, (3) statistical modeling combined with performance observations can help generate better service configurations, and (4) services are a promising architecture choice for deploying in situ performance monitoring and visualization functionality. This dissertation includes previously published and co-authored material and unpublished co-authored material
    corecore