486 research outputs found

    The Inter-cloud meta-scheduling

    Get PDF
    Inter-cloud is a recently emerging approach that expands cloud elasticity. By facilitating an adaptable setting, it purposes at the realization of a scalable resource provisioning that enables a diversity of cloud user requirements to be handled efficiently. This study’s contribution is in the inter-cloud performance optimization of job executions using metascheduling concepts. This includes the development of the inter-cloud meta-scheduling (ICMS) framework, the ICMS optimal schemes and the SimIC toolkit. The ICMS model is an architectural strategy for managing and scheduling user services in virtualized dynamically inter-linked clouds. This is achieved by the development of a model that includes a set of algorithms, namely the Service-Request, Service-Distribution, Service-Availability and Service-Allocation algorithms. These along with resource management optimal schemes offer the novel functionalities of the ICMS where the message exchanging implements the job distributions method, the VM deployment offers the VM management features and the local resource management system details the management of the local cloud schedulers. The generated system offers great flexibility by facilitating a lightweight resource management methodology while at the same time handling the heterogeneity of different clouds through advanced service level agreement coordination. Experimental results are productive as the proposed ICMS model achieves enhancement of the performance of service distribution for a variety of criteria such as service execution times, makespan, turnaround times, utilization levels and energy consumption rates for various inter-cloud entities, e.g. users, hosts and VMs. For example, ICMS optimizes the performance of a non-meta-brokering inter-cloud by 3%, while ICMS with full optimal schemes achieves 9% optimization for the same configurations. The whole experimental platform is implemented into the inter-cloud Simulation toolkit (SimIC) developed by the author, which is a discrete event simulation framework

    Market-Based Scheduling in Distributed Computing Systems

    Get PDF
    In verteilten Rechensystemen (bspw. im Cluster und Grid Computing) kann eine Knappheit der zur VerfĂŒgung stehenden Ressourcen auftreten. Hier haben Marktmechanismen das Potenzial, Ressourcenbedarf und -angebot durch geeignete Anreizmechanismen zu koordinieren und somit die ökonomische Effizienz des Gesamtsystems zu steigern. Diese Arbeit beschĂ€ftigt sich anhand vier spezifischer Anwendungsszenarien mit der Frage, wie Marktmechanismen fĂŒr verteilte Rechensysteme ausgestaltet sein sollten

    OCCL: a Deadlock-free Library for GPU Collective Communication

    Full text link
    Various distributed deep neural network (DNN) training technologies lead to increasingly complicated use of collective communications on GPU. The deadlock-prone collectives on GPU force researchers to guarantee that collectives are enqueued in a consistent order on each GPU to prevent deadlocks. In complex distributed DNN training scenarios, manual hardcoding is the only practical way for deadlock prevention, which poses significant challenges to the development of artificial intelligence. This paper presents OCCL, which is, to the best of our knowledge, the first deadlock-free collective communication library for GPU supporting dynamic decentralized preemption and gang-scheduling for collectives. Leveraging the preemption opportunity of collectives on GPU, OCCL dynamically preempts collectives in a decentralized way via the deadlock-free collective execution framework and allows dynamic decentralized gang-scheduling via the stickiness adjustment scheme. With the help of OCCL, researchers no longer have to struggle to get all GPUs to launch collectives in a consistent order to prevent deadlocks. We implement OCCL with several optimizations and integrate OCCL with a distributed deep learning framework OneFlow. Experimental results demonstrate that OCCL achieves comparable or better latency and bandwidth for collectives compared to NCCL, the state-of-the-art. When used in distributed DNN training, OCCL can improve the peak training throughput by up to 78% compared to statically sequenced NCCL, while introducing overheads of less than 6.5% across various distributed DNN training approaches

    Decentralized load balancing in heterogeneous computational grids

    Get PDF
    With the rapid development of high-speed wide-area networks and powerful yet low-cost computational resources, grid computing has emerged as an attractive computing paradigm. The space limitations of conventional distributed systems can thus be overcome, to fully exploit the resources of under-utilised computing resources in every region around the world for distributed jobs. Workload and resource management are key grid services at the service level of grid software infrastructure, where issues of load balancing represent a common concern for most grid infrastructure developers. Although these are established research areas in parallel and distributed computing, grid computing environments present a number of new challenges, including large-scale computing resources, heterogeneous computing power, the autonomy of organisations hosting the resources, uneven job-arrival pattern among grid sites, considerable job transfer costs, and considerable communication overhead involved in capturing the load information of sites. This dissertation focuses on designing solutions for load balancing in computational grids that can cater for the unique characteristics of grid computing environments. To explore the solution space, we conducted a survey for load balancing solutions, which enabled discussion and comparison of existing approaches, and the delimiting and exploration of the apportion of solution space. A system model was developed to study the load-balancing problems in computational grid environments. In particular, we developed three decentralised algorithms for job dispatching and load balancing—using only partial information: the desirability-aware load balancing algorithm (DA), the performance-driven desirability-aware load-balancing algorithm (P-DA), and the performance-driven region-based load-balancing algorithm (P-RB). All three are scalable, dynamic, decentralised and sender-initiated. We conducted extensive simulation studies to analyse the performance of our load-balancing algorithms. Simulation results showed that the algorithms significantly outperform preexisting decentralised algorithms that are relevant to this research

    DESIGN AND EVALUATION OF RESOURCE ALLOCATION AND JOB SCHEDULING ALGORITHMS ON COMPUTATIONAL GRIDS

    Get PDF
    Grid, an infrastructure for resource sharing, currently has shown its importance in many scientific applications requiring tremendously high computational power. Grid computing enables sharing, selection and aggregation of resources for solving complex and large-scale scientific problems. Grids computing, whose resources are distributed, heterogeneous and dynamic in nature, introduces a number of fascinating issues in resource management. Grid scheduling is the key issue in grid environment in which its system must meet the functional requirements of heterogeneous domains, which are sometimes conflicting in nature also, like user, application, and network. Moreover, the system must satisfy non-functional requirements like reliability, efficiency, performance, effective resource utilization, and scalability. Thus, overall aim of this research is to introduce new grid scheduling algorithms for resource allocation as well as for job scheduling for enabling a highly efficient and effective utilization of the resources in executing various applications. The four prime aspects of this work are: firstly, a model of the grid scheduling problem for dynamic grid computing environment; secondly, development of a new web based simulator (SyedWSim), enabling the grid users to conduct a statistical analysis of grid workload traces and provides a realistic basis for experimentation in resource allocation and job scheduling algorithms on a grid; thirdly, proposal of a new grid resource allocation method of optimal computational cost using synthetic and real workload traces with respect to other allocation methods; and finally, proposal of some new job scheduling algorithms of optimal performance considering parameters like waiting time, turnaround time, response time, bounded slowdown, completion time and stretch time. The issue is not only to develop new algorithms, but also to evaluate them on an experimental computational grid, using synthetic and real workload traces, along with the other existing job scheduling algorithms. Experimental evaluation confirmed that the proposed grid scheduling algorithms possess a high degree of optimality in performance, efficiency and scalability

    Archer: A Community Distributed Computing Infrastructure for Computer Architecture Research and Education

    Full text link
    This paper introduces Archer, a community-based computing resource for computer architecture research and education. The Archer infrastructure integrates virtualization and batch scheduling middleware to deliver high-throughput computing resources aggregated from resources distributed across wide-area networks and owned by different participating entities in a seamless manner. The paper discusses the motivations leading to the design of Archer, describes its core middleware components, and presents an analysis of the functionality and performance of a prototype wide-area deployment running a representative computer architecture simulation workload.Comment: 11 pages, 2 figures. Describes the Archer project, http://archer-project.or

    Failure Analysis in Next-Generation Critical Cellular Communication Infrastructures

    Full text link
    The advent of communication technologies marks a transformative phase in critical infrastructure construction, where the meticulous analysis of failures becomes paramount in achieving the fundamental objectives of continuity, security, and availability. This survey enriches the discourse on failures, failure analysis, and countermeasures in the context of the next-generation critical communication infrastructures. Through an exhaustive examination of existing literature, we discern and categorize prominent research orientations with focuses on, namely resource depletion, security vulnerabilities, and system availability concerns. We also analyze constructive countermeasures tailored to address identified failure scenarios and their prevention. Furthermore, the survey emphasizes the imperative for standardization in addressing failures related to Artificial Intelligence (AI) within the ambit of the sixth-generation (6G) networks, accounting for the forward-looking perspective for the envisioned intelligence of 6G network architecture. By identifying new challenges and delineating future research directions, this survey can help guide stakeholders toward unexplored territories, fostering innovation and resilience in critical communication infrastructure development and failure prevention

    CILP: Co-simulation based imitation learner for dynamic resource provisioning in cloud computing environments

    Get PDF
    Intelligent Virtual Machine (VM) provisioning is central to cost and resource efficient computation in cloud computing environments. As bootstrapping VMs is time-consuming, a key challenge for latency-critical tasks is to predict future workload demands to provision VMs proactively. However, existing AI-based solutions tend to not holistically consider all crucial aspects such as provisioning overheads, heterogeneous VM costs and Quality of Service (QoS) of the cloud system. To address this, we propose a novel method, called CILP, that formulates the VM provisioning problem as two sub-problems of prediction and optimization, where the provisioning plan is optimized based on predicted workload demands. CILP leverages a neural network as a surrogate model to predict future workload demands with a co-simulated digital-twin of the infrastructure to compute QoS scores. We extend the neural network to also act as an imitation learner that dynamically decides the optimal VM provisioning plan. A transformer based neural model reduces training and inference overheads while our novel two-phase decision making loop facilitates in making informed provisioning decisions. Crucially, we address limitations of prior work by including resource utilization, deployment costs and provisioning overheads to inform the provisioning decisions in our imitation learning framework. Experiments with three public benchmarks demonstrate that CILP gives up to 22% higher resource utilization, 14% higher QoS scores and 44% lower execution costs compared to the current online and offline optimization based state-of-the-art methods

    LBSim: A simulation system for dynamic load-balancing algorithms for distributed systems.

    Get PDF
    In a distributed system consisting of autonomous computational units, the total computational power of all the units needs to be utilized efficiently by applying suitable load-balancing policies. For accomplishing the task, a large number of load balancing algorithms have been proposed in the literature. To facilitate the performance study of each of these load-balancing strategies, simulation has been widely used. However comparison of the load balancing algorithms becomes difficult if a different simulator is used for each case. There have been few studies on generalized simulation of load-balancing algorithms in distributed systems. Most of the simulation systems address the experiments for some particular load-balancing algorithms, whereas this thesis aims to study the simulation for a broad range of algorithms. After the characterization of the distributed systems and the extraction of the common components of load-balancing algorithms, a simulation system, called LBSim, has been built. LBSim is a generalized event-driven simulator for studying load-balancing algorithms with coarse-grained applications running on distributed networks of autonomous processing nodes. In order to verify that the simulation model can represent actual systems reasonably well, we have validated LBSim both qualitatively and quantitatively. As a toolkit of simulation, LBSim programming libraries can be reused to implement load-balancing algorithms for the purpose of performance measurement and analysis from different perspectives. As a framework of algorithm simulation can be extended with a moderate effort by following object-oriented methodology, to meet any new requirements that may arise in the future.Dept. of Computer Science. Paper copy at Leddy Library: Theses & Major Papers - Basement, West Bldg. / Call Number: Thesis2004 .D8. Source: Masters Abstracts International, Volume: 43-05, page: 1747. Adviser: A. K. Aggarwal. Thesis (M.Sc.)--University of Windsor (Canada), 2004
    • 

    corecore