520 research outputs found

    Real-time and dynamic fault-tolerant scheduling for scientific workflows in clouds

    Get PDF
    Cloud computing has become a popular technology for executing scientific workflows. However, with a large number of hosts and virtual machines (VMs) being deployed, the cloud resource failures, such as the permanent failure of hosts (HPF), the transient failure of hosts (HTF), and the transient failure of VMs (VMTF), bring the service reliability problem. Therefore, fault tolerance for time-consuming scientific workflows is highly essential in the cloud. However, existing fault-tolerant (FT) approaches consider only one or two above failure types and easily neglect the others, especially for the HTF. This paper proposes a Real-time and dynamic Fault-tolerant Scheduling (ReadyFS) algorithm for scientific workflow execution in a cloud, which guarantees deadline constraints and improves resource utilization even in the presence of any resource failure. Specifically, we first introduce two FT mechanisms, i.e., the replication with delay execution (RDE) and the checkpointing with delay execution (CDE), to cope with HPF and VMTF, simultaneously. Additionally, the rescheduling (ReSC) is devised to tackle the HTF that affects the resource availability of the entire cloud datacenter. Then, the resource adjustment (RA) strategy, including the resource scaling-up (RS-Up) and the resource scaling-down (RS-Down), is used to adjust resource demands and improve resource utilization dynamically. Finally, the ReadyFS algorithm is presented to schedule real-time scientific workflows by combining all the above FT mechanisms with RA strategy. We conduct the performance evaluation with real-world scientific workflows and compare ReadyFS with five vertical comparison algorithms and three horizontal comparison algorithms. Simulation results confirm that ReadyFS is indeed able to guarantee the fault tolerance of scientific workflow execution and improve cloud resource utilization

    Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016)

    Get PDF
    Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016) Timisoara, Romania. February 8-11, 2016.The PhD Symposium was a very good opportunity for the young researchers to share information and knowledge, to present their current research, and to discuss topics with other students in order to look for synergies and common research topics. The idea was very successful and the assessment made by the PhD Student was very good. It also helped to achieve one of the major goals of the NESUS Action: to establish an open European research network targeting sustainable solutions for ultrascale computing aiming at cross fertilization among HPC, large scale distributed systems, and big data management, training, contributing to glue disparate researchers working across different areas and provide a meeting ground for researchers in these separate areas to exchange ideas, to identify synergies, and to pursue common activities in research topics such as sustainable software solutions (applications and system software stack), data management, energy efficiency, and resilience.European Cooperation in Science and Technology. COS

    Multi-dependency and time based resource scheduling algorithm for scientific applications in cloud computing

    Get PDF
    Workflow scheduling is one of the significant issues for scientific applications among virtual machine migration, database management, security, performance, fault tolerance, server consolidation, etc. In this paper, existing time-based scheduling algorithms, such as first come first serve (FCFS), min–min, max–min, and minimum completion time (MCT), along with dependency-based scheduling algorithm MaxChild have been considered. These time-based scheduling algorithms only compare the burst time of tasks. Based on the burst time, these schedulers, schedule the sub-tasks of the application on suitable virtual machines according to the scheduling criteria. During this process, not much attention was given to the proper utilization of the resources. A novel dependency and time-based scheduling algorithm is proposed that considers the parent to child (P2C) node dependencies, child to parent node dependencies, and the time of different tasks in the workflows. The proposed P2C algorithm emphasizes proper utilization of the resources and overcomes the limitations of these time-based schedulers. The scientific applications, such as CyberShake, Montage, Epigenomics, Inspiral, and SIPHT, are represented in terms of the workflow. The tasks can be represented as the nodes, and relationships between the tasks can be represented as the dependencies in the workflows. All the results have been validated by using the simulation-based environment created with the help of the WorkflowSim simulator for the cloud environment. It has been observed that the proposed approach outperforms the mentioned time and dependency-based scheduling algorithms in terms of the total execution time by efficiently utilizing the resources.peer-reviewe

    Energy and performance-optimized scheduling of tasks in distributed cloud and edge computing systems

    Get PDF
    Infrastructure resources in distributed cloud data centers (CDCs) are shared by heterogeneous applications in a high-performance and cost-effective way. Edge computing has emerged as a new paradigm to provide access to computing capacities in end devices. Yet it suffers from such problems as load imbalance, long scheduling time, and limited power of its edge nodes. Therefore, intelligent task scheduling in CDCs and edge nodes is critically important to construct energy-efficient cloud and edge computing systems. Current approaches cannot smartly minimize the total cost of CDCs, maximize their profit and improve quality of service (QoS) of tasks because of aperiodic arrival and heterogeneity of tasks. This dissertation proposes a class of energy and performance-optimized scheduling algorithms built on top of several intelligent optimization algorithms. This dissertation includes two parts, including background work, i.e., Chapters 3–6, and new contributions, i.e., Chapters 7–11. 1) Background work of this dissertation. Chapter 3 proposes a spatial task scheduling and resource optimization method to minimize the total cost of CDCs where bandwidth prices of Internet service providers, power grid prices, and renewable energy all vary with locations. Chapter 4 presents a geography-aware task scheduling approach by considering spatial variations in CDCs to maximize the profit of their providers by intelligently scheduling tasks. Chapter 5 presents a spatio-temporal task scheduling algorithm to minimize energy cost by scheduling heterogeneous tasks among CDCs while meeting their delay constraints. Chapter 6 gives a temporal scheduling algorithm considering temporal variations of revenue, electricity prices, green energy and prices of public clouds. 2) Contributions of this dissertation. Chapter 7 proposes a multi-objective optimization method for CDCs to maximize their profit, and minimize the average loss possibility of tasks by determining task allocation among Internet service providers, and task service rates of each CDC. A simulated annealing-based bi-objective differential evolution algorithm is proposed to obtain an approximate Pareto optimal set. A knee solution is selected to schedule tasks in a high-profit and high-quality-of-service way. Chapter 8 formulates a bi-objective constrained optimization problem, and designs a novel optimization method to cope with energy cost reduction and QoS improvement. It jointly minimizes both energy cost of CDCs, and average response time of all tasks by intelligently allocating tasks among CDCs and changing task service rate of each CDC. Chapter 9 formulates a constrained bi-objective optimization problem for joint optimization of revenue and energy cost of CDCs. It is solved with an improved multi-objective evolutionary algorithm based on decomposition. It determines a high-quality trade-off between revenue maximization and energy cost minimization by considering CDCs’ spatial differences in energy cost while meeting tasks’ delay constraints. Chapter 10 proposes a simulated annealing-based bees algorithm to find a close-to-optimal solution. Then, a fine-grained spatial task scheduling algorithm is designed to minimize energy cost of CDCs by allocating tasks among multiple green clouds, and specifies running speeds of their servers. Chapter 11 proposes a profit-maximized collaborative computation offloading and resource allocation algorithm to maximize the profit of systems and guarantee that response time limits of tasks are met in cloud-edge computing systems. A single-objective constrained optimization problem is solved by a proposed simulated annealing-based migrating birds optimization. This dissertation evaluates these algorithms, models and software with real-life data and proves that they improve scheduling precision and cost-effectiveness of distributed cloud and edge computing systems

    Constructing Reliable Computing Environments on Top of Amazon EC2 Spot Instances

    Get PDF
    Cloud provider Amazon Elastic Compute Cloud (EC2) gives access to resources in the form of virtual servers, also known as instances. EC2 spot instances (SIs) offer spare computational capacity at steep discounts compared to reliable and fixed price on-demand instances. The drawback, however, is that the delay in acquiring spots can be incredible high. Moreover, SIs may not always be available as they can be reclaimed by EC2 at any given time, with a two-minute interruption notice. In this paper, we propose a multi-workflow scheduling algorithm, allied with a container migration-based mechanism, to dynamically construct and readjust virtual clusters on top of non-reserved EC2 pricing model instances. Our solution leverages recent findings on performance and behavior characteristics of EC2 spots. We conducted simulations by submitting real-life workflow applications, constrained by user-defined deadline and budget quality of service (QoS) parameters. The results indicate that our solution improves the rate of completed tasks by almost 20%, and the rate of completed workflows by at least 30%, compared with other state-of-the-art algorithms, for a worse-case scenarioinfo:eu-repo/semantics/publishedVersio

    Data-Driven Intelligent Scheduling For Long Running Workloads In Large-Scale Datacenters

    Get PDF
    Cloud computing is becoming a fundamental facility of society today. Large-scale public or private cloud datacenters spreading millions of servers, as a warehouse-scale computer, are supporting most business of Fortune-500 companies and serving billions of users around the world. Unfortunately, modern industry-wide average datacenter utilization is as low as 6% to 12%. Low utilization not only negatively impacts operational and capital components of cost efficiency, but also becomes the scaling bottleneck due to the limits of electricity delivered by nearby utility. It is critical and challenge to improve multi-resource efficiency for global datacenters. Additionally, with the great commercial success of diverse big data analytics services, enterprise datacenters are evolving to host heterogeneous computation workloads including online web services, batch processing, machine learning, streaming computing, interactive query and graph computation on shared clusters. Most of them are long-running workloads that leverage long-lived containers to execute tasks. We concluded datacenter resource scheduling works over last 15 years. Most previous works are designed to maximize the cluster efficiency for short-lived tasks in batch processing system like Hadoop. They are not suitable for modern long-running workloads of Microservices, Spark, Flink, Pregel, Storm or Tensorflow like systems. It is urgent to develop new effective scheduling and resource allocation approaches to improve efficiency in large-scale enterprise datacenters. In the dissertation, we are the first of works to define and identify the problems, challenges and scenarios of scheduling and resource management for diverse long-running workloads in modern datacenter. They rely on predictive scheduling techniques to perform reservation, auto-scaling, migration or rescheduling. It forces us to pursue and explore more intelligent scheduling techniques by adequate predictive knowledges. We innovatively specify what is intelligent scheduling, what abilities are necessary towards intelligent scheduling, how to leverage intelligent scheduling to transfer NP-hard online scheduling problems to resolvable offline scheduling issues. We designed and implemented an intelligent cloud datacenter scheduler, which automatically performs resource-to-performance modeling, predictive optimal reservation estimation, QoS (interference)-aware predictive scheduling to maximize resource efficiency of multi-dimensions (CPU, Memory, Network, Disk I/O), and strictly guarantee service level agreements (SLA) for long-running workloads. Finally, we introduced a large-scale co-location techniques of executing long-running and other workloads on the shared global datacenter infrastructure of Alibaba Group. It effectively improves cluster utilization from 10% to averagely 50%. It is far more complicated beyond scheduling that involves technique evolutions of IDC, network, physical datacenter topology, storage, server hardwares, operating systems and containerization. We demonstrate its effectiveness by analysis of newest Alibaba public cluster trace in 2017. We are the first of works to reveal the global view of scenarios, challenges and status in Alibaba large-scale global datacenters by data demonstration, including big promotion events like Double 11 . Data-driven intelligent scheduling methodologies and effective infrastructure co-location techniques are critical and necessary to pursue maximized multi-resource efficiency in modern large-scale datacenter, especially for long-running workloads
    • …
    corecore