486 research outputs found
The Inter-cloud meta-scheduling
Inter-cloud is a recently emerging approach that expands cloud elasticity. By facilitating an adaptable setting, it purposes at the realization of a scalable resource provisioning that enables a diversity of cloud user requirements to be handled efficiently. This studyâs contribution is in the inter-cloud performance optimization of job executions using metascheduling concepts. This includes the development of the inter-cloud meta-scheduling (ICMS) framework, the ICMS optimal schemes and the SimIC toolkit. The ICMS model is an architectural strategy for managing and scheduling user services in virtualized dynamically inter-linked clouds. This is achieved by the development of a model that includes a set of algorithms, namely the Service-Request, Service-Distribution, Service-Availability and Service-Allocation algorithms. These along with resource management optimal schemes offer the novel functionalities of the ICMS where the message exchanging implements the job distributions method, the VM deployment offers the VM management features and the local resource management system details the management of the local cloud schedulers. The generated system offers great flexibility by facilitating a lightweight resource management methodology while at the same time handling the heterogeneity of different clouds through advanced service level agreement coordination. Experimental results are productive as the proposed ICMS model achieves enhancement of the performance of service distribution for a variety of criteria such as service execution times, makespan, turnaround times, utilization levels and energy consumption rates for various inter-cloud entities, e.g. users, hosts and VMs. For example, ICMS optimizes the performance of a non-meta-brokering inter-cloud by 3%, while ICMS with full optimal schemes achieves 9% optimization for the same configurations. The whole experimental platform is implemented into the inter-cloud Simulation toolkit (SimIC) developed by the author, which is a discrete event simulation framework
Market-Based Scheduling in Distributed Computing Systems
In verteilten Rechensystemen (bspw. im Cluster und Grid Computing) kann eine Knappheit der zur VerfĂŒgung stehenden Ressourcen auftreten. Hier haben Marktmechanismen das Potenzial, Ressourcenbedarf und -angebot durch geeignete Anreizmechanismen zu koordinieren und somit die ökonomische Effizienz des Gesamtsystems zu steigern. Diese Arbeit beschĂ€ftigt sich anhand vier spezifischer Anwendungsszenarien mit der Frage, wie Marktmechanismen fĂŒr verteilte Rechensysteme ausgestaltet sein sollten
OCCL: a Deadlock-free Library for GPU Collective Communication
Various distributed deep neural network (DNN) training technologies lead to
increasingly complicated use of collective communications on GPU. The
deadlock-prone collectives on GPU force researchers to guarantee that
collectives are enqueued in a consistent order on each GPU to prevent
deadlocks. In complex distributed DNN training scenarios, manual hardcoding is
the only practical way for deadlock prevention, which poses significant
challenges to the development of artificial intelligence. This paper presents
OCCL, which is, to the best of our knowledge, the first deadlock-free
collective communication library for GPU supporting dynamic decentralized
preemption and gang-scheduling for collectives. Leveraging the preemption
opportunity of collectives on GPU, OCCL dynamically preempts collectives in a
decentralized way via the deadlock-free collective execution framework and
allows dynamic decentralized gang-scheduling via the stickiness adjustment
scheme. With the help of OCCL, researchers no longer have to struggle to get
all GPUs to launch collectives in a consistent order to prevent deadlocks. We
implement OCCL with several optimizations and integrate OCCL with a distributed
deep learning framework OneFlow. Experimental results demonstrate that OCCL
achieves comparable or better latency and bandwidth for collectives compared to
NCCL, the state-of-the-art. When used in distributed DNN training, OCCL can
improve the peak training throughput by up to 78% compared to statically
sequenced NCCL, while introducing overheads of less than 6.5% across various
distributed DNN training approaches
Decentralized load balancing in heterogeneous computational grids
With the rapid development of high-speed wide-area networks and powerful yet low-cost computational resources, grid computing has emerged as an attractive computing paradigm. The space limitations of conventional distributed systems can thus be overcome, to fully exploit the resources of under-utilised computing resources in every region around the world for distributed jobs. Workload and resource management are key grid services at the service level of grid software infrastructure, where issues of load balancing represent a common concern for most grid infrastructure developers. Although these are established research areas in parallel and distributed computing, grid computing environments present a number of new challenges, including large-scale computing resources, heterogeneous computing power, the autonomy of organisations hosting the resources, uneven job-arrival pattern among grid sites, considerable job transfer costs, and considerable communication overhead involved in capturing the load information of sites. This dissertation focuses on designing solutions for load balancing in computational grids that can cater for the unique characteristics of grid computing environments. To explore the solution space, we conducted a survey for load balancing solutions, which enabled discussion and comparison of existing approaches, and the delimiting and exploration of the apportion of solution space. A system model was developed to study the load-balancing problems in computational grid environments. In particular, we developed three decentralised algorithms for job dispatching and load balancingâusing only partial information: the desirability-aware load balancing algorithm (DA), the performance-driven desirability-aware load-balancing algorithm (P-DA), and the performance-driven region-based load-balancing algorithm (P-RB). All three are scalable, dynamic, decentralised and sender-initiated. We conducted extensive simulation studies to analyse the performance of our load-balancing algorithms. Simulation results showed that the algorithms significantly outperform preexisting decentralised algorithms that are relevant to this research
DESIGN AND EVALUATION OF RESOURCE ALLOCATION AND JOB SCHEDULING ALGORITHMS ON COMPUTATIONAL GRIDS
Grid, an infrastructure for resource sharing, currently has shown its importance in
many scientific applications requiring tremendously high computational power. Grid
computing enables sharing, selection and aggregation of resources for solving
complex and large-scale scientific problems. Grids computing, whose resources are
distributed, heterogeneous and dynamic in nature, introduces a number of fascinating
issues in resource management. Grid scheduling is the key issue in grid environment
in which its system must meet the functional requirements of heterogeneous domains,
which are sometimes conflicting in nature also, like user, application, and network.
Moreover, the system must satisfy non-functional requirements like reliability,
efficiency, performance, effective resource utilization, and scalability. Thus, overall
aim of this research is to introduce new grid scheduling algorithms for resource
allocation as well as for job scheduling for enabling a highly efficient and effective
utilization of the resources in executing various applications.
The four prime aspects of this work are: firstly, a model of the grid scheduling
problem for dynamic grid computing environment; secondly, development of a new
web based simulator (SyedWSim), enabling the grid users to conduct a statistical
analysis of grid workload traces and provides a realistic basis for experimentation in
resource allocation and job scheduling algorithms on a grid; thirdly, proposal of a new
grid resource allocation method of optimal computational cost using synthetic and
real workload traces with respect to other allocation methods; and finally, proposal of
some new job scheduling algorithms of optimal performance considering parameters
like waiting time, turnaround time, response time, bounded slowdown, completion
time and stretch time. The issue is not only to develop new algorithms, but also to
evaluate them on an experimental computational grid, using synthetic and real
workload traces, along with the other existing job scheduling algorithms.
Experimental evaluation confirmed that the proposed grid scheduling algorithms
possess a high degree of optimality in performance, efficiency and scalability
Archer: A Community Distributed Computing Infrastructure for Computer Architecture Research and Education
This paper introduces Archer, a community-based computing resource for
computer architecture research and education. The Archer infrastructure
integrates virtualization and batch scheduling middleware to deliver
high-throughput computing resources aggregated from resources distributed
across wide-area networks and owned by different participating entities in a
seamless manner. The paper discusses the motivations leading to the design of
Archer, describes its core middleware components, and presents an analysis of
the functionality and performance of a prototype wide-area deployment running a
representative computer architecture simulation workload.Comment: 11 pages, 2 figures. Describes the Archer project,
http://archer-project.or
Failure Analysis in Next-Generation Critical Cellular Communication Infrastructures
The advent of communication technologies marks a transformative phase in
critical infrastructure construction, where the meticulous analysis of failures
becomes paramount in achieving the fundamental objectives of continuity,
security, and availability. This survey enriches the discourse on failures,
failure analysis, and countermeasures in the context of the next-generation
critical communication infrastructures. Through an exhaustive examination of
existing literature, we discern and categorize prominent research orientations
with focuses on, namely resource depletion, security vulnerabilities, and
system availability concerns. We also analyze constructive countermeasures
tailored to address identified failure scenarios and their prevention.
Furthermore, the survey emphasizes the imperative for standardization in
addressing failures related to Artificial Intelligence (AI) within the ambit of
the sixth-generation (6G) networks, accounting for the forward-looking
perspective for the envisioned intelligence of 6G network architecture. By
identifying new challenges and delineating future research directions, this
survey can help guide stakeholders toward unexplored territories, fostering
innovation and resilience in critical communication infrastructure development
and failure prevention
CILP: Co-simulation based imitation learner for dynamic resource provisioning in cloud computing environments
Intelligent Virtual Machine (VM) provisioning is central to cost and resource efficient computation in cloud computing environments. As bootstrapping VMs is time-consuming, a key challenge for latency-critical tasks is to predict future workload demands to provision VMs proactively. However, existing AI-based solutions tend to not holistically consider all crucial aspects such as provisioning overheads, heterogeneous VM costs and Quality of Service (QoS) of the cloud system. To address this, we propose a novel method, called CILP, that formulates the VM provisioning problem as two sub-problems of prediction and optimization, where the provisioning plan is optimized based on predicted workload demands. CILP leverages a neural network as a surrogate model to predict future workload demands with a co-simulated digital-twin of the infrastructure to compute QoS scores. We extend the neural network to also act as an imitation learner that dynamically decides the optimal VM provisioning plan. A transformer based neural model reduces training and inference overheads while our novel two-phase decision making loop facilitates in making informed provisioning decisions. Crucially, we address limitations of prior work by including resource utilization, deployment costs and provisioning overheads to inform the provisioning decisions in our imitation learning framework. Experiments with three public benchmarks demonstrate that CILP gives up to 22% higher resource utilization, 14% higher QoS scores and 44% lower execution costs compared to the current online and offline optimization based state-of-the-art methods
LBSim: A simulation system for dynamic load-balancing algorithms for distributed systems.
In a distributed system consisting of autonomous computational units, the total computational power of all the units needs to be utilized efficiently by applying suitable load-balancing policies. For accomplishing the task, a large number of load balancing algorithms have been proposed in the literature. To facilitate the performance study of each of these load-balancing strategies, simulation has been widely used. However comparison of the load balancing algorithms becomes difficult if a different simulator is used for each case. There have been few studies on generalized simulation of load-balancing algorithms in distributed systems. Most of the simulation systems address the experiments for some particular load-balancing algorithms, whereas this thesis aims to study the simulation for a broad range of algorithms. After the characterization of the distributed systems and the extraction of the common components of load-balancing algorithms, a simulation system, called LBSim, has been built. LBSim is a generalized event-driven simulator for studying load-balancing algorithms with coarse-grained applications running on distributed networks of autonomous processing nodes. In order to verify that the simulation model can represent actual systems reasonably well, we have validated LBSim both qualitatively and quantitatively. As a toolkit of simulation, LBSim programming libraries can be reused to implement load-balancing algorithms for the purpose of performance measurement and analysis from different perspectives. As a framework of algorithm simulation can be extended with a moderate effort by following object-oriented methodology, to meet any new requirements that may arise in the future.Dept. of Computer Science. Paper copy at Leddy Library: Theses & Major Papers - Basement, West Bldg. / Call Number: Thesis2004 .D8. Source: Masters Abstracts International, Volume: 43-05, page: 1747. Adviser: A. K. Aggarwal. Thesis (M.Sc.)--University of Windsor (Canada), 2004
- âŠ