16 research outputs found

    Software Roadmap to Plug and Play Petaflop/s

    Full text link

    Tools for analyzing parallel I/O

    Get PDF
    Parallel application I/O performance often does not meet user expectations. Additionally, slight access pattern modifications may lead to significant changes in performance due to complex interactions between hardware and software. These issues call for sophisticated tools to capture, analyze, understand, and tune application I/O. In this paper, we highlight advances in monitoring tools to help address these issues. We also describe best practices, identify issues in measure- ment and analysis, and provide practical approaches to translate parallel I/O analysis into actionable outcomes for users, facility operators, and researchers

    Capturing the impact of external interference on HPC application performance

    Get PDF
    HPC applications are large software packages with high computation and storage requirements. To meet these requirements, the architectures of supercomputers are continuously evolving and their capabilities are continuously increasing. Present-day supercomputers have achieved petaflops of computational power by utilizing thousands to millions of compute cores, connected through specialized communication networks, and are equipped with petabytes of storage using a centralized I/O subsystem. While fulfilling the high resource demands of HPC applications, such a design also entails its own challenges. Applications running on these systems own the computation resources exclusively, but share the communication interconnect and the I/O subsystem with other concurrently running applications. Simultaneous access to these shared resources causes contention and inter-application interference, leading to degraded application performance. Inter-application interference is one of the sources of run-to-run variation. While other sources of variation, such as operating system jitter, have been investigated before, this doctoral thesis specifically focuses on inter-application interference and studies it from the perspective of an application. Variation in execution time not only causes uncertainty and affects user expectations (especially during performance analysis), but also causes suboptimal usage of HPC resources. Therefore, this thesis aims to evaluate inter-application interference, establish trends among applications under contention, and approximate the impact of external influences on the runtime of an application. To this end, this thesis first presents a method to correlate the performance of applications running side-by-side. The method divides the runtime of a system into globally synchronized, fine-grained time slices for which application performance data is recorded separately. The evaluation of the method demonstrates that correlating application performance data can identify inter-application interference. The thesis further uses the method to study I/O interference and shows that file access patterns are a significant factor in determining the interference potential of an application. This thesis also presents a technique to estimate the impact of external influences on an application run. The technique introduces the concept of intrinsic performance characteristics to cluster similar application execution segments. Anomalies in the cluster are the result of external interference. An evaluation with several benchmarks shows high accuracy in estimating the impact of interference from a single application run. The contributions of this thesis will help establish interference trends and devise interference mitigation techniques. Similarly, estimating the impact of external interference will restore user expectations and help performance analysts separate application performance from external influence

    Holistic energy and failure aware workload scheduling in Cloud datacenters

    Get PDF
    The global uptake of Cloud computing has attracted increased interest within both academia and industry resulting in the formation of large-scale and complex distributed systems. This has led to increased failure occurrence within computing systems that induce substantial negative impact upon system performance and task reliability perceived by users. Such systems also consume vast quantities of power, resulting in significant operational costs perceived by providers. Virtualization – a commonly deployed technology within Cloud datacenters – can enable flexible scheduling of virtual machines to maximize system reliability and energy-efficiency. However, existing work address these two objectives separately, providing limited understanding towards studying the explicit trade-offs towards dependable and energy-efficient compute infrastructure. In this paper, we propose two failure-aware energy-efficient scheduling algorithms that exploit the holistic operational characteristics of the Cloud datacenter comprising the cooling unit, computing infrastructure and server failures. By comprehensively modeling the power and failure profiles of a Cloud datacenter, we propose workload scheduling algorithms Ella-W and Ella-B, capable of reducing cooling and compute energy while minimizing the impact of system failures. A novel and overall metric is proposed that combines energy efficiency and reliability to specify the performance of various algorithms. We evaluate our algorithms against Random, MaxUtil, TASA, MTTE and OBFIT under various system conditions of failure prediction accuracy and workload intensity. Evaluation results demonstrate that Ella-W can reduce energy usage by 29.5% and improve task completion rate by 3.6%, while Ella-B reduces energy usage by 32.7% with no degradation to task completion rate

    The Fifth Workshop on HPC Best Practices: File Systems and Archives

    Full text link
    The workshop on High Performance Computing (HPC) Best Practices on File Systems and Archives was the fifth in a series sponsored jointly by the Department Of Energy (DOE) Office of Science and DOE National Nuclear Security Administration. The workshop gathered technical and management experts for operations of HPC file systems and archives from around the world. Attendees identified and discussed best practices in use at their facilities, and documented findings for the DOE and HPC community in this report

    Laboratory Directed Research and Development Program FY 2008 Annual Report

    Full text link

    Cyber-Human Systems, Space Technologies, and Threats

    Get PDF
    CYBER-HUMAN SYSTEMS, SPACE TECHNOLOGIES, AND THREATS is our eighth textbook in a series covering the world of UASs / CUAS/ UUVs / SPACE. Other textbooks in our series are Space Systems Emerging Technologies and Operations; Drone Delivery of CBNRECy – DEW Weapons: Emerging Threats of Mini-Weapons of Mass Destruction and Disruption (WMDD); Disruptive Technologies with applications in Airline, Marine, Defense Industries; Unmanned Vehicle Systems & Operations On Air, Sea, Land; Counter Unmanned Aircraft Systems Technologies and Operations; Unmanned Aircraft Systems in the Cyber Domain: Protecting USA’s Advanced Air Assets, 2nd edition; and Unmanned Aircraft Systems (UAS) in the Cyber Domain Protecting USA’s Advanced Air Assets, 1st edition. Our previous seven titles have received considerable global recognition in the field. (Nichols & Carter, 2022) (Nichols, et al., 2021) (Nichols R. K., et al., 2020) (Nichols R. , et al., 2020) (Nichols R. , et al., 2019) (Nichols R. K., 2018) (Nichols R. K., et al., 2022)https://newprairiepress.org/ebooks/1052/thumbnail.jp
    corecore