Search CORE

13,833 research outputs found

Resource management for extreme scale high performance computing systems in the presence of failures

Author: Dauwe Daniel
Publication venue: Colorado State University. Libraries
Publication date: 01/01/2018
Field of study

2018 Summer.Includes bibliographical references.High performance computing (HPC) systems, such as data centers and supercomputers, coordinate the execution of large-scale computation of applications over tens or hundreds of thousands of multicore processors. Unfortunately, as the size of HPC systems continues to grow towards exascale complexities, these systems experience an exponential growth in the number of failures occurring in the system. These failures reduce performance and increase energy use, reducing the efficiency and effectiveness of emerging extreme-scale HPC systems. Applications executing in parallel on individual multicore processors also suffer from decreased performance and increased energy use as a result of applications being forced to share resources, in particular, the contention from multiple application threads sharing the last-level cache causes performance degradation. These challenges make it increasingly important to characterize and optimize the performance and behavior of applications that execute in these systems. To address these challenges, in this dissertation we propose a framework for intelligently characterizing and managing extreme-scale HPC system resources. We devise various techniques to mitigate the negative effects of failures and resource contention in HPC systems. In particular, we develop new HPC resource management techniques for intelligently utilizing system resources through the (a) optimal scheduling of applications to HPC nodes and (b) the optimal configuration of fault resilience protocols. These resource management techniques employ information obtained from historical analysis as well as theoretical and machine learning methods for predictions. We use these data to characterize system performance, energy use, and application behavior when operating under the uncertainty of performance degradation from both system failures and resource contention. We investigate how to better characterize and model the negative effects from system failures as well as application co-location on large-scale HPC computing systems. Our analysis of application and system behavior also investigates: the interrelated effects of network usage of applications and fault resilience protocols; checkpoint interval selection and its sensitivity to system parameters for various checkpoint-based fault resilience protocols; and performance comparisons of various promising strategies for fault resilience in exascale-sized systems

Mountain Scholar (Digital Collections of Colorado and Wyoming)

Quality of Service Driven Runtime Resource Allocation in Reconfigurable HPC Architectures

Author: Becker Tobias
Bolchini Cristiana
Durelli GIANLUCA CARLO
Miele ANTONIO ROSARIO
Pogliani Marcello
Sanders Peter
Santambrogio MARCO DOMENICO
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2016
Field of study

Heterogeneous System Architectures (HSA) are gaining importance in the High Performance Computing (HPC) domain due to increasing computational requirements coupled with energy consumption concerns, which conventional CPU architectures fail to effectively address. Systems based on Field Programmable Gate Array (FPGA) recently emerged as an effective alternative to Graphical Processing Units (GPUs) for demanding HPC applications, although they lack the abstractions available in conventional CPU-based systems. This work tackles the problem of runtime resource management of a system using FPGA-based co-processors to accelerate multi-programmed HPC workloads. We propose a novel resource manager able to dynamically vary the number of FPGAs allocated to each of the jobs running in a multi-accelerator system, with the goal of meeting a given Quality of Service metric for the running jobs measured in terms of deadline or throughput. We implement the proposed resource manager in a commercial HPC system, evaluating its behavior with representative workloads

Archivio istituzionale della ricerca - Politecnico di Milano

Crossref

JMS: an open source workflow management system and web-based cluster front-end for high performance computing

Author: Bishop Ozlem Tastan
Brown David K
Musyoka Thommas M
Penkler David L
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2015
Field of study

Complex computational pipelines are becoming a staple of modern scientific research. Often these pipelines are resource intensive and require days of computing time. In such cases, it makes sense to run them over high performance computing (HPC) clusters where they can take advantage of the aggregated resources of many powerful computers. In addition to this, researchers often want to integrate their workflows into their own web servers. In these cases, software is needed to manage the submission of jobs from the web interface to the cluster and then return the results once the job has finished executing. We have developed the Job Management System (JMS), a workflow management system and web interface for high performance computing (HPC). JMS provides users with a user-friendly web interface for creating complex workflows with multiple stages. It integrates this workflow functionality with the resource manager, a tool that is used to control and manage batch jobs on HPC clusters. As such, JMS combines workflow management functionality with cluster administration functionality

Public Library of Science (PLOS)

CiteSeerX

Directory of Open Access Journals

PubMed Central

South East Academic Libraries System (SEALS)

National Research Foundation

A Runtime Framework for Energy Efficient HPC Systems Without a Priori Knowledge of Applications

Author: Da Costa Georges
Lefèvre Laurent
Pierson Jean-Marc
Stolf Patricia
Tsafack Chetsa Ghislain Landry
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 16/11/2012
Field of study

International audienceThe rising computing demands of scientific endeavours often require the creation and management of High Performance Computing (HPC) systems for running experiments and processing vast amounts of data. These HPC systems generally operate at peak performance, consuming a large quantity of electricity, even though their workload varies over time. Understanding the behavioural patterns i.e., phases) of HPC systems during their use is key to adjust performance to resource demand and hence improve the energy efficiency. In this paper, we describe (i) a method to detect phases of an HPC system based on its workload, and (ii) a partial phase recognition technique that works cooperatively with on-the-fly dynamic management. We implement a prototype that guides the use of energy saving capabilities to demonstrate the benefits of our approach. Experimental results reveal the effectiveness of the phase detection method under real-life workload and benchmarks. A comparison with baseline unmanaged execution shows that the partial phase recognition technique saves up to 15% of energy with less than 1% performance degradation

HAL-ENS-LYON

Crossref

Scientific Publications of the University of Toulouse II Le Mirail

INRIA a CCSD electronic archive server

Hal-Diderot

Power Bounded Computing on Current & Emerging HPC Systems

Author: Zou Pengfei
Publication venue: Clemson University Libraries
Publication date: 01/05/2020
Field of study

Power has become a critical constraint for the evolution of large scale High Performance Computing (HPC) systems and commercial data centers. This constraint spans almost every level of computing technologies, from IC chips all the way up to data centers due to physical, technical, and economic reasons. To cope with this reality, it is necessary to understand how available or permissible power impacts the design and performance of emergent computer systems. For this reason, we propose power bounded computing and corresponding technologies to optimize performance on HPC systems with limited power budgets. We have multiple research objectives in this dissertation. They center on the understanding of the interaction between performance, power bounds, and a hierarchical power management strategy. First, we develop heuristics and application aware power allocation methods to improve application performance on a single node. Second, we develop algorithms to coordinate power across nodes and components based on application characteristic and power budget on a cluster. Third, we investigate performance interference induced by hardware and power contentions, and propose a contention aware job scheduling to maximize system throughput under given power budgets for node sharing system. Fourth, we extend to GPU-accelerated systems and workloads and develop an online dynamic performance & power approach to meet both performance requirement and power efficiency. Power bounded computing improves performance scalability and power efficiency and decreases operation costs of HPC systems and data centers. This dissertation opens up several new ways for research in power bounded computing to address the power challenges in HPC systems. The proposed power and resource management techniques provide new directions and guidelines to green exscale computing and other computing systems

Clemson University: TigerPrints

The Hydro-Modeling Platform (HydroMP) - Enabling Cloud-Based Environmental Modeling Using Software-As-A-Service (SaaS) Cloud Computing

Author: Liu Haiyan
Liu Ronghua
Wei Jiahua
Publication venue: CUNY Academic Works
Publication date: 01/08/2014
Field of study

Hydro-model has become important tool for water resources management, with higher demand in simulation precision and speed of decision support, models designed for sectoral application becoming outmoded, and original mode that massive schemes are run sequentially cannot meet the real-time requirements, especially with the computation increase by finer discretization granularity and broader research range. Water management organizations are increasingly looking for new generation tools that allow integration across domains, and can provide extensible computing resources to assist their decision making processes. In response to this need, a hydro-modeling platform(HydroMP) based on cloud computing is designed and implemented, which can deployed in distributed HPC Cluster and center HPC Cluster use a resources balancer to manage load balancing. This platform integrates multi models and computing resources (i.e. blade computer) dynamically to assure models integrated in platform get extensible computing capacity. A server, hosting HydroMP Web Service and interfaces, is connected to the HPC Cluster and Internet constituting the gateway for registered users. Any terminal (i.e. decision making system) can reference library and Web service of HydroMP in their systems. Massive modeling schemes can be submitted by different users simultaneously, and terminal can get simulation results from HydroMP real-time. Some key approaches and techniques are utilized including: i) a standard model component wrapper communicating with platform by named pipe have developed. OpenMI-compliant model-components can be integrated to this wrapper; ii) API and Event-Handler interface provided by HPC Server, task scheduler and calculation management table is employed to dispatch computing resource, while controlling multiple concurrent scheme submitting; iii) Interface array(i.e. SchemesSubmit, StatusInquiry, GetResult) in the Web Service is supplied to make terminal communicate with platform; iv) Oracle database is used to manage massive model data, results and model-components. This paper describes the details of design and implementation, and gives a case presentation platform application

City University of New York

Extending a run-time resource management framework to support OpenCL and heterogeneous systems

Author: Bellasi Patrick
Caffarri Chiara
Fornaciari William
Massari Giuseppe
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2014
Field of study

From Mobile to High-Performance Computing (HPC) systems, performance and energy efficiency are becoming always more challenging requirements. In this regard, heterogeneous systems, made by a general-purpose processor and one or more hardware accelerators, are emerging as affordable solutions. However, the effective exploitation of such platforms requires specific programming languages, like for instance OpenCL, and suitable run-time software layers. This work illustrates the extension of a run-time resource management (RTRM) framework, to support the execution of OpenCL applications on systems featuring a multi-core CPU and multiple GPUs. Early results show how this solution leads to benefits both for the applications, in terms of performance, and for the system, in terms of resource utilization, i.e. load balancing and thermal leveling over the computing devices

Archivio istituzionale della ricerca - Politecnico di Milano

Crossref

Improving efficiency and resilience in large-scale computing systems through analytics and data-driven management

Author: Tuncer Ozan
Publication venue
Publication date: 03/07/2018
Field of study

Applications running in large-scale computing systems such as high performance computing (HPC) or cloud data centers are essential to many aspects of modern society, from weather forecasting to financial services. As the number and size of data centers increase with the growing computing demand, scalable and efficient management becomes crucial. However, data center management is a challenging task due to the complex interactions between applications, middleware, and hardware layers such as processors, network, and cooling units. This thesis claims that to improve robustness and efficiency of large-scale computing systems, significantly higher levels of automated support than what is available in today's systems are needed, and this automation should leverage the data continuously collected from various system layers. Towards this claim, we propose novel methodologies to automatically diagnose the root causes of performance and configuration problems and to improve efficiency through data-driven system management. We first propose a framework to diagnose software and hardware anomalies that cause undesired performance variations in large-scale computing systems. We show that by training machine learning models on resource usage and performance data collected from servers, our approach successfully diagnoses 98% of the injected anomalies at runtime in real-world HPC clusters with negligible computational overhead. We then introduce an analytics framework to address another major source of performance anomalies in cloud data centers: software misconfigurations. Our framework discovers and extracts configuration information from cloud instances such as containers or virtual machines. This is the first framework to provide comprehensive visibility into software configurations in multi-tenant cloud platforms, enabling systematic analysis for validating the correctness of software configurations. This thesis also contributes to the design of robust and efficient system management methods that leverage continuously monitored resource usage data. To improve performance under power constraints, we propose a workload- and cooling-aware power budgeting algorithm that distributes the available power among servers and cooling units in a data center, achieving up to 21% improvement in throughput per Watt compared to the state-of-the-art. Additionally, we design a network- and communication-aware HPC workload placement policy that reduces communication overhead by up to 30% in terms of hop-bytes compared to existing policies.2019-07-02T00:00:00

Boston University Institutional Repository (OpenBU)

Recommended from our members

Scheduling, Characterization and Prediction of HPC Workloads for Distributed Computing Environments

Author: Naghshnejad Mina
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

As High Performance Computing (HPC) has grown considerably and is expected to grow even more, effective resource management for distributed computing sys- tems is motivated more than ever. As the computational workloads grow in quantity, it is becoming more crucial to apply efficient resource management and workload scheduling to use resources efficiently while keeping the computational performance reasonably good. The problem of efficiently scheduling workloads on resources while meeting performance standards is hard. Additionally, non-clairvoyance of job dimen- sions makes resource management even harder in real-world scenarios. Our research methodology investigates the scheduling problem compliant for HPC and researches the challenges for deploying the scheduling in real world-scenarios using state of the art machine learning and data science techniques.To this end, this Ph.D. dissertation makes the following core contributions: a) We perform a theoretical analysis of space-sharing, non-preemptive scheduling: we studied this scheduling problem and proposed scheduling algorithms with polyno- mial computation time. We also proved constant upper-bounds for the performance of these algorithms. b) We studied the sensitivity of scheduling algorithms to the accuracy of runtime and devised a meta-learning approach to estimate prediction accuracy for newly submitted jobs to the HPC system. c) We studied the runtime prediction problem for HPC applications. For this purpose, we studied the distri- bution of available public workloads and proposed two different solutions that can predict multi-modal distributions: switching state-space models and Mixture Density Networks. d) We studied the effectiveness of recent recurrent neural network models for CPU usage trace prediction for individual VM traces as well as aggregate CPU usage traces. In this dissertation, we explore solutions to improve the performance of scheduling workloads on distributed systems.We begin by looking at the problem from the theoretical perspective. Modeling the problem mathematically, we first propose a scheduling algorithm that finds a constant approximation of the optimal solution for the problem in polynomial time. We prove that the performance of the algorithm (average completion time is the constant approximation of the performance of the optimal scheduling. We next look at the problem in real-world scenarios. Considering High-Performance Computing (HPC) workload computing environments as the most similar real-world equivalent of our mathematical model, we explore the problem of predicting application runtime. We propose an algorithm to handle the existing uncertainties in the real world and show-case our algorithm with demonstrative effectiveness in terms of response time and resource utilization. After looking at the uncertainty problem, we focus on trying to improve the accuracy of existing prediction approaches for HPC application runtime. We propose two solutions, one based on Kalman filters and one based on deep density mixture networks. We showcase the effectiveness of our prediction approaches by comparing with previous prediction approaches in terms of prediction accuracy and impact on improving scheduling performance. In the end, we focus on predicting resource usage for individual applications during their execution. We explore the application of recurrent neural networks for predicting resource usage of applications deployed on individual virtual machines. To validate our proposed models and solutions, we performed extensive trace-driven simulation and measured the effectiveness of our approaches

eScholarship - University of California