69 research outputs found
ANALYZING SUPERCOMPUTER UTILIZATION UNDER QUEUING WITH A PRIORITY FORMULA AND A STRICT BACKFILL POLICY
Supercomputers have become increasingly important in recent years due to the growing amount of data available and the increasing demand for quicker results in the scientific community. Since supercomputers carry a high cost to build and maintain, efficiency becomes more important to the owners, administrators, and users of these supercomputers. One important factor in determining the efficiency of a supercomputer is the scheduling of jobs that are submitted by users of the system. Previous work has dealt with optimizing the schedule on the system’s end while the users are blinded from the process. The work presented in this thesis investigates a scheduling system that is implemented at the Oak Ridge National Laboratory (ORNL) supercomputer Kraken with a backfilling policy and attempts to outline the optimal methods from the user’s point of view in the scheduling system, along with using a simulation approach to optimize the priority formula. Normally the user has no idea which scheduling algorithms are used, but the users at ORNL not only know how the scheduling works but they can also view the current activity of the system. This gives an advantage to the users who are willing to benefit from this knowledge by utilizing some elementary game theory to optimize their strategies. The results will show a benefit to both the users, since they will be able to process their jobs sooner, and the system, since it will better utilized with little expense to the administrators, through competition.
Queuing models and simulation have been well studied in almost all relevant aspects of the modern world. Higher efficiency is the goal of many researchers in several different fields; the supercomputer queues are no different. Efficient use of the resources makes the system administrator pleased while benefiting the users with more timely results. Studying these queuing models through simulation should help all parties involved by increasing utilization. The simulation will be validated and the utilization improvement will be measured and reported. User defined formulas will be developed for future users to help maximize utilization and minimize wait times
An Elastic Scheduling Algorithm For Resource Co-Allocation Based on System Generated Predictions With Priority
Resource Co-Allocation is basically used to execute multiple site jobs in a large scale computing environments with secure, faultless and in transparent manner. To be precise we are actually allocating multiple resources for different jobs taking into account the time parameter. Here we make use of the Scheduling queue and Resource Co-Allocation to reduce the Turn-around time with an advanced concept of System Generated Prediction based on Priority. In existing works we are scheduling the resource co-allocation request from user runtime estimation. As user runtime estimations are usually very imprecise that is not clear. In proposed work we are scheduling the resource co-allocation request based on system generated predictions through Discovery service & Priority (fairness and user experience) through topological sorting technique. The system generated predictions are better parameters than user runtime estimates for Resource co-Allocation scheduling, because System generated predictions reduce the scheduling time through proxy ser based discovery service technique. The proposed work consider priorities like advanced reservation, system Generated Predictions, Negotiation, Co-scheduling, policy (SLA, Price, Trust) for resource Co-Allocation. The system generated predictions are better than user runtime estimates for Resource co- Allocation scheduling, using the experimental data’s we proved this concept. End User doesn’t want the grid and resource knowledge only submit job to the portal. This proposed portal will take care of all knowledge about the resource collocation automatically with fast and efficient manner
Recommended from our members
Scheduling, Characterization and Prediction of HPC Workloads for Distributed Computing Environments
As High Performance Computing (HPC) has grown considerably and is expected to grow even more, effective resource management for distributed computing sys- tems is motivated more than ever. As the computational workloads grow in quantity, it is becoming more crucial to apply efficient resource management and workload scheduling to use resources efficiently while keeping the computational performance reasonably good. The problem of efficiently scheduling workloads on resources while meeting performance standards is hard. Additionally, non-clairvoyance of job dimen- sions makes resource management even harder in real-world scenarios. Our research methodology investigates the scheduling problem compliant for HPC and researches the challenges for deploying the scheduling in real world-scenarios using state of the art machine learning and data science techniques.To this end, this Ph.D. dissertation makes the following core contributions: a) We perform a theoretical analysis of space-sharing, non-preemptive scheduling: we studied this scheduling problem and proposed scheduling algorithms with polyno- mial computation time. We also proved constant upper-bounds for the performance of these algorithms. b) We studied the sensitivity of scheduling algorithms to the accuracy of runtime and devised a meta-learning approach to estimate prediction accuracy for newly submitted jobs to the HPC system. c) We studied the runtime prediction problem for HPC applications. For this purpose, we studied the distri- bution of available public workloads and proposed two different solutions that can predict multi-modal distributions: switching state-space models and Mixture Density Networks. d) We studied the effectiveness of recent recurrent neural network models for CPU usage trace prediction for individual VM traces as well as aggregate CPU usage traces. In this dissertation, we explore solutions to improve the performance of scheduling workloads on distributed systems.We begin by looking at the problem from the theoretical perspective. Modeling the problem mathematically, we first propose a scheduling algorithm that finds a constant approximation of the optimal solution for the problem in polynomial time. We prove that the performance of the algorithm (average completion time is the constant approximation of the performance of the optimal scheduling. We next look at the problem in real-world scenarios. Considering High-Performance Computing (HPC) workload computing environments as the most similar real-world equivalent of our mathematical model, we explore the problem of predicting application runtime. We propose an algorithm to handle the existing uncertainties in the real world and show-case our algorithm with demonstrative effectiveness in terms of response time and resource utilization. After looking at the uncertainty problem, we focus on trying to improve the accuracy of existing prediction approaches for HPC application runtime. We propose two solutions, one based on Kalman filters and one based on deep density mixture networks. We showcase the effectiveness of our prediction approaches by comparing with previous prediction approaches in terms of prediction accuracy and impact on improving scheduling performance. In the end, we focus on predicting resource usage for individual applications during their execution. We explore the application of recurrent neural networks for predicting resource usage of applications deployed on individual virtual machines. To validate our proposed models and solutions, we performed extensive trace-driven simulation and measured the effectiveness of our approaches
Provisioning Spot Market Cloud Resources to Create Cost-effective Virtual Clusters
Infrastructure-as-a-Service providers are offering their unused resources in
the form of variable-priced virtual machines (VMs), known as "spot instances",
at prices significantly lower than their standard fixed-priced resources. To
lease spot instances, users specify a maximum price they are willing to pay per
hour and VMs will run only when the current price is lower than the user's bid.
This paper proposes a resource allocation policy that addresses the problem of
running deadline-constrained compute-intensive jobs on a pool of composed
solely of spot instances, while exploiting variations in price and performance
to run applications in a fast and economical way. Our policy relies on job
runtime estimations to decide what are the best types of VMs to run each job
and when jobs should run. Several estimation methods are evaluated and
compared, using trace-based simulations, which take real price variation traces
obtained from Amazon Web Services as input, as well as an application trace
from the Parallel Workload Archive. Results demonstrate the effectiveness of
running computational jobs on spot instances, at a fraction (up to 60% lower)
of the price that would normally cost on fixed priced resources.Comment: 14 pages, 4 figures, 11th International Conference on Algorithms and
Architectures for Parallel Processing (ICA3PP-11); Lecture Notes in Computer
Science, Vol. 7016, 201
Using Unused: Non-Invasive Dynamic FaaS Infrastructure with HPC-Whisk
Modern HPC workload managers and their careful tuning contribute to the high
utilization of HPC clusters. However, due to inevitable uncertainty it is
impossible to completely avoid node idleness. Although such idle slots are
usually too short for any HPC job, they are too long to ignore them.
Function-as-a-Service (FaaS) paradigm promisingly fills this gap, and can be a
good match, as typical FaaS functions last seconds, not hours. Here we show how
to build a FaaS infrastructure on idle nodes in an HPC cluster in such a way
that it does not affect the performance of the HPC jobs significantly. We
dynamically adapt to a changing set of idle physical machines, by integrating
open-source software Slurm and OpenWhisk.
We designed and implemented a prototype solution that allowed us to cover up
to 90\% of the idle time slots on a 50k-core cluster that runs production
workloads
Supercomputing Frontiers
This open access book constitutes the refereed proceedings of the 6th Asian Supercomputing Conference, SCFA 2020, which was planned to be held in February 2020, but unfortunately, the physical conference was cancelled due to the COVID-19 pandemic. The 8 full papers presented in this book were carefully reviewed and selected from 22 submissions. They cover a range of topics including file systems, memory hierarchy, HPC cloud platform, container image configuration workflow, large-scale applications, and scheduling
Matching Renewable Energy Supply and Demand in Green Datacenters
In this paper, we propose GreenSlot, a scheduler for parallel batch jobs in a datacenter powered by a photovoltaic solar array and the electrical grid (as a backup). GreenSlot predicts the amount of solar energy that will be available in the near future, and schedules the workload to maximize the green energy consumption while meeting the jobs’ deadlines. If grid energy must be used to avoid deadline violations, the scheduler selects times when it is cheap. Our results for both scientific computing workloads and data processing workloads demonstrate that GreenSlot can increase solar energy consumption by up to 117% and decrease energy cost by up to 39%, compared to conventional schedulers. Based on these positive results, we conclude that green datacenters and green-energy-aware scheduling can have a significant role in building a more sustainable IT ecosystem
ADEPT Runtime/Scalability Predictor in support of Adaptive Scheduling
A job scheduler determines the order and duration of the allocation of resources, e.g. CPU, to the tasks waiting to run on a computer. Round-Robin and First-Come-First-Serve are examples of algorithms for making such resource allocation decisions. Parallel job schedulers make resource allocation decisions for applications that need multiple CPU cores, on computers consisting of many CPU cores connected by different interconnects. An adaptive parallel scheduler is a parallel scheduler that is capable of adjusting its resource allocation decisions based on the current resource usage and demand. Adaptive parallel schedulers that decide the numbers of CPU cores to allocate to a parallel job provide more flexibility and potentially improve performance significantly for both local and grid job scheduling compared to non-adaptive schedulers. A major reason why adaptive schedulers are not yet used practically is due to lack of knowledge of the scalability curves of the applications, and high cost of existing white-box approaches for scalability prediction. We show that a runtime and scalability prediction tool can be developed with 3 requirements: accuracy comparable to white-box methods, applicability, and robustness. Applicability depends only on knowledge feasible to gain in a production environment. Robustness addresses anomalous behaviour and unreliable predictions. We present ADEPT, a speedup and runtime prediction tool that satisfies all criteria for both single problem size and across different problem sizes of a parallel application. ADEPT is also capable of handling anomalies and judging reliability of its predictions. We demonstrate these using experiments with MPI and OpenMP implementations of NAS benchmarks and seven real applications
- …