688 research outputs found

    Generalized Cost-Based Job Scheduling in Very Large Heterogeneous Cluster Systems

    Get PDF
    We study job assignment in large, heterogeneous resource-sharing clusters of servers with finite buffers. This load balancing problem arises naturally in today's communication and big data systems, such as Amazon Web Services, Network Service Function Chains, and Stream Processing. Arriving jobs are dispatched to a server, following a load balancing policy that optimizes a performance criterion such as job completion time. Our contribution is a randomized Cost-Based Scheduling (CBS) policy in which the job assignment is driven by general cost functions of the server queue lengths. Beyond existing schemes, such as the Join the Shortest Queue (JSQ), the power of d or the SQ(d) and the capacity-weighted JSQ, the notion of CBS yields new application-specific policies such as hybrid locally uniform JSQ. As today's data center clusters have thousands of servers, exact analysis of CBS policies is tedious. In this article, we derive a scaling limit when the number of servers grows large, facilitating a comparison of various CBS policies with respect to their transient as well as steady state behavior. A byproduct of our derivations is the relationship between the queue filling proportions and the server buffer sizes, which cannot be obtained from infinite buffer models. Finally, we provide extensive numerical evaluations and discuss several applications including multi-stage systems

    Learning Scheduling Algorithms for Data Processing Clusters

    Full text link
    Efficiently scheduling data processing jobs on distributed compute clusters requires complex algorithms. Current systems, however, use simple generalized heuristics and ignore workload characteristics, since developing and tuning a scheduling policy for each workload is infeasible. In this paper, we show that modern machine learning techniques can generate highly-efficient policies automatically. Decima uses reinforcement learning (RL) and neural networks to learn workload-specific scheduling algorithms without any human instruction beyond a high-level objective such as minimizing average job completion time. Off-the-shelf RL techniques, however, cannot handle the complexity and scale of the scheduling problem. To build Decima, we had to develop new representations for jobs' dependency graphs, design scalable RL models, and invent RL training methods for dealing with continuous stochastic job arrivals. Our prototype integration with Spark on a 25-node cluster shows that Decima improves the average job completion time over hand-tuned scheduling heuristics by at least 21%, achieving up to 2x improvement during periods of high cluster load

    Communication-Aware Scheduling of Serial Tasks for Dispersed Computing

    Full text link
    There is a growing interest in development of in-network dispersed computing paradigms that leverage the computing capabilities of heterogeneous resources dispersed across the network for processing massive amount of data is collected at the edge of the network. We consider the problem of task scheduling for such networks, in a dynamic setting in which arriving computation jobs are modeled as chains, with nodes representing tasks, and edges representing precedence constraints among tasks. In our proposed model, motivated by significant communication costs in dispersed computing environments, the communication times are taken into account. More specifically, we consider a network where servers are capable of serving all task types, and sending the results of processed tasks from one server to another server results in some communication delay that makes the design of optimal scheduling policy significantly more challenging than classical queueing networks. As the main contributions of the paper, we first characterize the capacity region of the network, then propose a novel virtual queueing network encoding the state of the network. Finally, we propose a Max-Weight type scheduling policy, and considering the virtual queueing network in the fluid limit, we use a Lyapunov argument to show that the policy is throughput-optimal.Comment: accepted to appear in IEEE/ACM Transactions on Networkin

    A decentralized control and optimization framework for autonomic performance management of web-server systems

    Get PDF
    Web-based services such as online banking and e-commerce are often hosted on distributed computing systems comprising heterogeneous and networked servers in a data-center setting. To operate such systems efficiently while satisfying stringent quality-of-service (QoS) requirements, multiple performance-related parameters must be dynamically tuned to track changing operating conditions. For example, the workload to be processed may be time varying and hardware/software resources may fail during system operation. To cope with their growing scale and complexity, such computing systems must become largely autonomic, capable of being managed with minimal human intervention.This study develops a distributed cooperative-control framework using concepts from optimal control theory and hybrid dynamical systems to adaptively manage the performance of computer clusters operating in dynamic and uncertain environments. As case studies, we focus on power management and dynamic resource provisioning problems in such clusters.First, we apply the control framework to minimize the power consumed by a server cluster under a time-varying workload. The overall power-management problem is decomposed into smaller sub-problems and solved in cooperative fashion by individual controllers on each server. This approach allows for the scalable control of large computing systems. The control framework also adapts to controller failures and allows for the dynamic addition and removal of controllers during system operation. We validate the proposed approach using a discrete-event simulator with real-world workload traces, and our results indicate that the controllers achieve a 55% reduction in power consumption when compared to an uncontrolled system in which each server operates at its maximum frequency at all times.We then develop a distributed resource provisioning framework to achieve di®erentiated QoS among multiple online services using concepts from hybrid control. We use a discrete hybrid automaton to model the operation of the computing cluster. The resource provisioning problem combining both QoS control and power management is then solved using a decentralized model predictive controller to maximize the operating profits generated by the cluster according to a specified service level agreement. Simulation results indicate that the controller generates 27% additional profit when compared to an uncontrolled system.Ph.D., Electrical Engineering -- Drexel University, 200

    Scalable and Distributed Resource Management Protocols for Cloud and Big Data Clusters

    Get PDF
    Cloud data centers require an operating system to manage resources and satisfy operational requirements and management objectives. The growth of popularity in cloud services causes the appearance of a new spectrum of services with sophisticated workload and resource management requirements. Also, data centers are growing by addition of various type of hardware to accommodate the ever-increasing requests of users. Nowadays a large percentage of cloud resources are executing data-intensive applications which need continuously changing workload fluctuations and specific resource management. To this end, cluster computing frameworks are shifting towards distributed resource management for better scalability and faster decision making. Such systems benefit from the parallelization of control and are resilient to failures. Throughout this thesis we investigate algorithms, protocols and techniques to address these challenges in large-scale data centers. We introduce a distributed resource management framework which consolidates virtual machine to as few servers as possible to reduce the energy consumption of data center and hence decrease the cost of cloud providers. This framework can characterize the workload of virtual machines and hence handle trade-off energy consumption and Service Level Agreement (SLA) of customers efficiently. The algorithm is highly scalable and requires low maintenance cost with dynamic workloads and it tries to minimize virtual machines migration costs. We also introduce a scalable and distributed probe-based scheduling algorithm for Big data analytics frameworks. This algorithm can efficiently address the problem job heterogeneity in workloads that has appeared after increasing the level of parallelism in jobs. The algorithm is massively scalable and can reduce significantly average job completion times in comparison with the-state of-the-art. Finally, we propose a probabilistic fault-tolerance technique as part of the scheduling algorithm

    Activity Report 2022

    Get PDF

    Enabling flexibility through strategic management of complex engineering systems

    Get PDF
    ”Flexibility is a highly desired attribute of many systems operating in changing or uncertain conditions. It is a common theme in complex systems to identify where flexibility is generated within a system and how to model the processes needed to maintain and sustain flexibility. The key research question that is addressed is: how do we create a new definition of workforce flexibility within a human-technology-artificial intelligence environment? Workforce flexibility is the management of organizational labor capacities and capabilities in operational environments using a broad and diffuse set of tools and approaches to mitigate system imbalances caused by uncertainties or changes. We establish a baseline reference for managers to use in choosing flexibility methods for specific applications and we determine the scope and effectiveness of these traditional flexibility methods. The unique contributions of this research are: a) a new definition of workforce flexibility for a human-technology work environment versus traditional definitions; b) using a system of systems (SoS) approach to create and sustain that flexibility; and c) applying a coordinating strategy for optimal workforce flexibility within the human- technology framework. This dissertation research fills the gap of how we can model flexibility using SoS engineering to show where flexibility emerges and what strategies a manager can use to manage flexibility within this technology construct”--Abstract, page iii

    Performance Evaluation of Transition-based Systems with Applications to Communication Networks

    Get PDF
    Since the beginning of the twenty-first century, communication systems have witnessed a revolution in terms of their hardware capabilities. This transformation has enabled modern networks to stand up to the diversity and the scale of the requirements of the applications that they support. Compared to their predecessors that primarily consisted of a handful of homogeneous devices communicating via a single communication technology, today's networks connect myriads of systems that are intrinsically different in their functioning and purpose. In addition, many of these devices communicate via different technologies or a combination of them at a time. All these developments, coupled with the geographical disparity of the physical infrastructure, give rise to network environments that are inherently dynamic and unpredictable. To cope with heterogeneous environments and the growing demands, network units have taken a leap from the paradigm of static functioning to that of adaptivity. In this thesis, we refer to adaptive network units as transition-based systems (TBSs) and the act of adapting is termed as transition. We note that TBSs not only reside in diverse environment conditions, their need to adapt also arises following different phenomena. Such phenomena are referred to as triggers and they can occur at different time scales. We additionally observe that the nature of a transition is dictated by the specified performance objective of the relevant TBS and we seek to build an analytical framework that helps us derive a policy for performance optimization. As the state of the art lacks a unified approach to modelling the diverse functioning of the TBSs and their varied performance objectives, we first propose a general framework based on the theory of Markov Decision Processes. This framework facilitates optimal policy derivation in TBSs in a principled manner. In addition, we note the importance of bespoke analyses in specific classes of TBSs where the general formulation leads to a high-dimensional optimization problem. Specifically, we consider performance optimization in open systems employing parallelism and closed systems exploiting the benefits of service batching. In these examples, we resort to approximation techniques such as a mean-field limit for the state evolution whenever the underlying TBS deals with a large number of entities. Our formulation enables calculation of optimal policies and provides tangible alternatives to existing frameworks for Quality of Service evaluation. Compared to the state of the art, the derived policies facilitate transitions in Communication Systems that yield superior performance as shown through extensive evaluations in this thesis
    • …
    corecore