285,273 research outputs found

    A dynamic prediction and monitoring framework for distributed applications

    Get PDF
    This research builds on an application performance prediction and characterisation environment (known as PACE), whose aim is to characterise the performance-critical elements of both an application and its target execution environment and deduce from this model a predicted behaviour of the application prior to its execution. Underlying the research presented in this thesis are a number of themes: the tasks involved in the performance characterisation of applications and how this might be semi- automated: the level of abstraction at which these characterisations are performed in order to maintain a sufficient predictive accuracy: the automated refinement of these characterisations from runtime performance data: the extension of both the target programming languages and the class of application at which these techniques are aimed. In this thesis a number of novel extensions to PACE are described. These include: a new transaction-based performance characterisation language that provides a flexible framework for describing broader classes of application; a performance monitoring framework (based on an extension to the OpenGroup’s Application Response Measurement (ARM) standard) for the runtime monitoring of an application's data-dependent components and the automated refinement of performance models: an adaptation of this performance characterisation for the prediction of Java applications. These contributions are demonstrated through their application to a number of scientific kernels. This thesis also documents how these predictive results can be used in a real-time distributed runtime management environment, and also how these techniques can be applied to non-scientific codes, in particular to an IBM request-driven distributed web services demonstrator

    Statistical Regression Methods for GPGPU Design Space Exploration

    Get PDF
    General Purpose Graphics Processing Units (GPGPUs) have leveraged the performance and power efficiency of today\u27s heterogeneous systems to usher in a new era of innovation in high-performance scientific computing. These systems can offer significantly high performance for massively parallel applications; however, their resources may be wasted due to inefficient tuning strategies. Previous application tuning studies pre-dominantly employ low-level, architecture specific tuning which can make the performance modeling task difficult and less generic. In this research, we explore the GPGPU design space featuring the memory hierarchy for application tuning using regression-based performance prediction framework and rank the design space based on the runtime performance. The regression-based framework models the GPGPU device computations using algorithm characteristics such as the number of floating-point operations, total number of bytes, and hardware parameters pertaining to the GPGPU memory hierarchy as predictor variables. The computation component regression models are developed using several instrumented executions of the algorithms that include a range of FLOPS-to-Byte requirement. We validate our model with a Synchronous Iterative Algorithm (SIA) set that includes Spiking Neural Networks (SNNs) and Anisotropic Diffusion Filtering (ADF) for massive images. The highly parallel nature of the above mentioned algorithms, in addition to their wide range of communication-to-computation complexities, makes them good candidates for this study. A hierarchy of implementations for the SNNs and ADF is constructed and ranked using the regression-based framework. We further illustrate the Synchronous Iterative GPGPU Execution (SIGE) model on the GPGPU-augmented Palmetto Cluster. The performance prediction framework maps appropriate design space implementation for 4 out of 5 case studies used in this research. The final goal of this research is to establish the efficacy of the regression-based framework to accurately predict the application kernel runtime, allowing developers to correctly rank their design space prior to the large-scale implementation

    Applying machine learning to improve simulations of a chaotic dynamical system using empirical error correction

    Full text link
    Dynamical weather and climate prediction models underpin many studies of the Earth system and hold the promise of being able to make robust projections of future climate change based on physical laws. However, simulations from these models still show many differences compared with observations. Machine learning has been applied to solve certain prediction problems with great success, and recently it's been proposed that this could replace the role of physically-derived dynamical weather and climate models to give better quality simulations. Here, instead, a framework using machine learning together with physically-derived models is tested, in which it is learnt how to correct the errors of the latter from timestep to timestep. This maintains the physical understanding built into the models, whilst allowing performance improvements, and also requires much simpler algorithms and less training data. This is tested in the context of simulating the chaotic Lorenz '96 system, and it is shown that the approach yields models that are stable and that give both improved skill in initialised predictions and better long-term climate statistics. Improvements in long-term statistics are smaller than for single time-step tendencies, however, indicating that it would be valuable to develop methods that target improvements on longer time scales. Future strategies for the development of this approach and possible applications to making progress on important scientific problems are discussed.Comment: 26p, 7 figures To be published in Journal of Advances in Modeling Earth System

    APACE: AlphaFold2 and advanced computing as a service for accelerated discovery in biophysics

    Full text link
    The prediction of protein 3D structure from amino acid sequence is a computational grand challenge in biophysics, and plays a key role in robust protein structure prediction algorithms, from drug discovery to genome interpretation. The advent of AI models, such as AlphaFold, is revolutionizing applications that depend on robust protein structure prediction algorithms. To maximize the impact, and ease the usability, of these novel AI tools we introduce APACE, AlphaFold2 and advanced computing as a service, a novel computational framework that effectively handles this AI model and its TB-size database to conduct accelerated protein structure prediction analyses in modern supercomputing environments. We deployed APACE in the Delta supercomputer, and quantified its performance for accurate protein structure predictions using four exemplar proteins: 6AWO, 6OAN, 7MEZ, and 6D6U. Using up to 200 ensembles, distributed across 50 nodes in Delta, equivalent to 200 A100 NVIDIA GPUs, we found that APACE is up to two orders of magnitude faster than off-the-shelf AlphaFold2 implementations, reducing time-to-solution from weeks to minutes. This computational approach may be readily linked with robotics laboratories to automate and accelerate scientific discovery.Comment: 7 pages, 4 figures, 2 table

    Simulation of the performance of complex data-intensive workflows

    Get PDF
    PhD ThesisRecently, cloud computing has been used for analytical and data-intensive processes as it offers many attractive features, including resource pooling, on-demand capability and rapid elasticity. Scientific workflows use these features to tackle the problems of complex data-intensive applications. Data-intensive workflows are composed of many tasks that may involve large input data sets and produce large amounts of data as output, which typically runs in highly dynamic environments. However, the resources should be allocated dynamically depending on the demand changes of the work flow, as over-provisioning increases the cost and under-provisioning causes Service Level Agreement (SLA) violation and poor Quality of Service (QoS). Performance prediction of complex workflows is a necessary step prior to the deployment of the workflow. Performance analysis of complex data-intensive workflows is a challenging task due to the complexity of their structure, diversity of big data, and data dependencies, in addition to the required examination to the performance and challenges associated with running their workflows in the real cloud. In this thesis, a solution is explored to address these challenges, using a Next Generation Sequencing (NGS) workflow pipeline as a case study, which may require hundreds/ thousands of CPU hours to process a terabyte of data. We propose a methodology to model, simulate and predict runtime and the number of resources used by the complex data-intensive workflows. One contribution of our simulation methodology is that it provides an ability to extract the simulation parameters (e.g., MIPs and BW values) that are required for constructing a training set and a fairly accurate prediction of the run time for input for cluster sizes much larger than ones used in training of the prediction model. The proposed methodology permits the derivation of run time prediction based on historical data from the provenance fi les. We present the run time prediction of the complex workflow by considering different cases of its running in the cloud such as execution failure and library deployment time. In case of failure, the framework can apply the prediction only partially considering the successful parts of the pipeline, in the other case the framework can predict with or without considering the time to deploy libraries. To further improve the accuracy of prediction, we propose a simulation model that handles I/O contention

    GPGPU Reliability Analysis: From Applications to Large Scale Systems

    Get PDF
    Over the past decade, GPUs have become an integral part of mainstream high-performance computing (HPC) facilities. Since applications running on HPC systems are usually long-running, any error or failure could result in significant loss in scientific productivity and system resources. Even worse, since HPC systems face severe resilience challenges as progressing towards exascale computing, it is imperative to develop a better understanding of the reliability of GPUs. This dissertation fills this gap by providing an understanding of the effects of soft errors on the entire system and on specific applications. To understand system-level reliability, a large-scale study on GPU soft errors in the field is conducted. The occurrences of GPU soft errors are linked to several temporal and spatial features, such as specific workloads, node location, temperature, and power consumption. Further, machine learning models are proposed to predict error occurrences on GPU nodes so as to proactively and dynamically turning on/off the costly error protection mechanisms based on prediction results. To understand the effects of soft errors at the application level, an effective fault-injection framework is designed aiming to understand the reliability and resilience characteristics of GPGPU applications. This framework is effective in terms of reducing the tremendous number of fault injection locations to a manageable size while still preserving remarkable accuracy. This framework is validated with both single-bit and multi-bit fault models for various GPGPU benchmarks. Lastly, taking advantage of the proposed fault-injection framework, this dissertation develops a hierarchical approach to understanding the error resilience characteristics of GPGPU applications at kernel, CTA, and warp levels. In addition, given that some corrupted application outputs due to soft errors may be acceptable, we present a use case to show how to enable low-overhead yet reliable GPU computing for GPGPU applications

    A Taxonomy of Workflow Management Systems for Grid Computing

    Full text link
    With the advent of Grid and application technologies, scientists and engineers are building more and more complex applications to manage and process large data sets, and execute scientific experiments on distributed resources. Such application scenarios require means for composing and executing complex workflows. Therefore, many efforts have been made towards the development of workflow management systems for Grid computing. In this paper, we propose a taxonomy that characterizes and classifies various approaches for building and executing workflows on Grids. We also survey several representative Grid workflow systems developed by various projects world-wide to demonstrate the comprehensiveness of the taxonomy. The taxonomy not only highlights the design and engineering similarities and differences of state-of-the-art in Grid workflow systems, but also identifies the areas that need further research.Comment: 29 pages, 15 figure
    • …
    corecore