8 research outputs found

    Extending Capability and Implementing a Web Interface for the XALT Software Monitoring Tool

    Get PDF
    As high performance computing centers evolve in terms of hardware, software, and user-base, the act of monitoring and managing such systems requires specialized tools. The tool discussed in this thesis is XALT, which is a collaborative effort between the National Institute for Computational Sciences and Texas Advanced Computing Center. XALT is designed to track link-time and job level information for applications that are compiled and executed on any Linux cluster, workstation, or high-end supercomputer. The key objectives of this work are to extend the existing functionality of XALT and implement a real-time web portal to easily visualize the tracked data. A prototype is developed to track function calls resolved by external libraries which helps software management. The web portal generates reports and metrics which would improve efficiency and effectiveness for an extensive community of stakeholders including users, support organizations, and development teams. In addition, we discuss use cases of interest to center support staff and researchers on identifying users based on given counters and generating provenance reports. This work details the opportunity and challenges to further push XALT towards becoming a complete package

    Making Speculative Scheduling Robust to Incomplete Data

    Get PDF
    International audienceIn this work, we study the robustness of SpeculativeScheduling to data incompleteness. Speculative scheduling hasallowed to incorporate future types of applications into thedesign of HPC schedulers, specifically applications whose runtimeis not perfectly known but can be modeled with probabilitydistributions. Preliminary studies show the importance of spec-ulative scheduling in dealing with stochastic applications whenthe application runtime model is completely known. In this workwe show how one can extract enough information even fromincomplete behavioral data for a given HPC applications sothat speculative scheduling still performs well. Specifically, weshow that for synthetic runtimes who follow usual probabilitydistributions such as truncated normal or exponential, we canextract enough data from as little as 10 previous runs, to bewithin 5% of the solution which has exact information. For realtraces of applications, the performance with 10 data points varieswith the applications (within 20% of the full-knowledge solution),but converges fast (5% with 100 previous samples).Finally a side effect of this study is to show the importanceof the theoretical results obtained on continuous probabilitydistributions for speculative scheduling. Indeed, we observe thatthe solutions for such distributions are more robust to incompletedata than the solutions for discrete distributions

    Utilizing Software Analytics to Guide Software Development

    Get PDF
    Modern software systems often produce vast amounts of software usage data. Previous work, however, has indicated that such data is often left unutilized. This leaves a gap for methods and practices that put the data to use. The objective of this thesis is to determine and test concrete methods for utilizing software usage data and to learn what use cases and benefits can be achieved via such methods. The study consists of two interconnected parts. Firstly, a semi-structured literature review is conducted to identify methods and use cases for software usage data. Secondly, a subset of the identified methods is experimented with by conducting a case study to determine how developers and managers experience the methods. We found that there exists a wide range of methods for utilizing software usage data. Via these methods, a wide range of software development-related use cases can be fulfilled. However, in practice, apart from debugging purposes, software usage data is largely left unutilized. Furthermore, developers and managers share a positive attitude towards employing methods of utilizing software usage data. In conclusion, software usage data has a lot of potential. Besides, developers and managers are interested in putting software usage data utilization methods to use. Furthermore, the information available via these methods is difficult to replace. In other words, methods for utilizing software usage data can provide irreplaceable information that is relevant and useful for both managers and developers. Therefore, practitioners should consider introducing methods for utilizing software usage data in their development practices

    Comprehensive Resource Use Monitoring for HPC Systems with TACC Stats

    No full text

    Infrastructure for Performance Monitoring and Analysis of Systems and Applications

    Get PDF
    The growth of High Performance Computer (HPC) systems increases the complexity with respect to understanding resource utilization, system management, and performance issues. HPC performance monitoring tools need to collect information at both the application and system levels to yield a complete performance picture. Existing approaches limit the abilities of the users to do meaningful analysis on actionable timescale. Efficient infrastructures are required to support largescale systems performance data analysis for both run-time troubleshooting and post-run processing modes. In this dissertation, we present methods to fill these gaps in the infrastructure for HPC performance monitoring and analysis. First, we enhance the architecture of a monitoring system to integrate streaming analysis capabilities at arbitrary locations within its data collection, transport, and aggregation facilities. Next, we present an approach to streaming collection of application performance data. We integrate these methods with a monitoring system used on large-scale computational platforms. Finally, we present a new approach for constructing durable transactional linked data structures that takes advantage of byte-addressable non-volatile memory technologies. Transactional data structures are building blocks of in-memory databases that are used by HPC monitoring systems to store and retrieve data efficiently. We evaluate the presented approaches on a series of case studies. The experiment results demonstrate the impact of our tools, while keeping the overhead in an acceptable margin
    corecore