1,278 research outputs found

    Human-Machine Collaborative Optimization via Apprenticeship Scheduling

    Full text link
    Coordinating agents to complete a set of tasks with intercoupled temporal and resource constraints is computationally challenging, yet human domain experts can solve these difficult scheduling problems using paradigms learned through years of apprenticeship. A process for manually codifying this domain knowledge within a computational framework is necessary to scale beyond the ``single-expert, single-trainee" apprenticeship model. However, human domain experts often have difficulty describing their decision-making processes, causing the codification of this knowledge to become laborious. We propose a new approach for capturing domain-expert heuristics through a pairwise ranking formulation. Our approach is model-free and does not require enumerating or iterating through a large state space. We empirically demonstrate that this approach accurately learns multifaceted heuristics on a synthetic data set incorporating job-shop scheduling and vehicle routing problems, as well as on two real-world data sets consisting of demonstrations of experts solving a weapon-to-target assignment problem and a hospital resource allocation problem. We also demonstrate that policies learned from human scheduling demonstration via apprenticeship learning can substantially improve the efficiency of a branch-and-bound search for an optimal schedule. We employ this human-machine collaborative optimization technique on a variant of the weapon-to-target assignment problem. We demonstrate that this technique generates solutions substantially superior to those produced by human domain experts at a rate up to 9.5 times faster than an optimization approach and can be applied to optimally solve problems twice as complex as those solved by a human demonstrator.Comment: Portions of this paper were published in the Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI) in 2016 and in the Proceedings of Robotics: Science and Systems (RSS) in 2016. The paper consists of 50 pages with 11 figures and 4 table

    Idle regulation in non-clairvoyant scheduling of parallel jobs

    Get PDF
    AbstractThe optimization of parallel applications is difficult to achieve by classical optimization techniques because of their diversity and the variety of actual parallel and distributed platforms and/or environments. Adaptive algorithmic schemes, capable of dynamically changing the allocation of jobs during the execution to optimize global system behavior, are the best alternatives for solving this problem. In this paper, we focus on non-clairvoyant scheduling of parallel jobs with known resource requirements but unknown running times, with emphasis on the regulation of idle periods in the context of general list policies. We consider a new family of scheduling strategies based on two phases which successively combine sequential and parallel execution of jobs. We generalize known worst-case performance bounds by considering two extra parameters, in addition to the number of processors and maximum processor requirements considered in the literature, namely, job parallelization penalty and idle regulation factor. Furthermore, we prove that under certain conditions of idle regulation, the performance guarantee of parallel job scheduling in space-sharing mode can be improved

    Investigating different optimization criteria for a hybrid job scheduling approach based on heuristics and metaheuristics

    Get PDF
    The Information Technology industry has revolutionized through the advent of cloud computing as the cloud offers dynamic computing utilities to global users. The performance of cloud computing services depends on the process of job scheduling. There has been a great research focus on the different amalgamation of heuristics with meta-heuristics (hybrid scheduling approaches) in the cloud computing scheduling context with the aim of optimizing several performance metrics. This paper discusses a hybrid job scheduling approach that intends to optimize the performance metrics namely makespan, average flow time, average waiting time, and throughput. The main focus of this paper is to evaluate this hybrid job scheduling approach based on different optimization criteria which includes single-objective and multi-objectives functions based on the aforementioned performance metrics on different large-scale problem instances. This helps us to investigate and identify the best optimization criteria for the hybrid job scheduling approach

    Coordinating Resource Use in Open Distributed Systems

    Get PDF
    In an open distributed system, computational resources are peer-owned, and distributed over time and space. The system is open to interactions with its environment, and the resources can dynamically join or leave the system, or can be discovered at runtime. This dynamicity leads to opportunities to carry out computations without statically owned resources, harnessing the collective compute power of the resources connected by the Internet. However, realizing this potential requires efficient and scalable resource discovery, coordination, and control, which present challenges in a dynamic, open environment. In this thesis, I present an approach to address these challenges by separating the functionality concerns of concurrent computations from those of coordinating their resource use, with the purpose of reducing programming complexity, and aiding development of correct, efficient, and resource-aware concurrent programs. As a first step towards effectively coordinating distributed resources, I developed DREAM, a Distributed Resource Estimation and Allocation Model, which enables computations to reason about future availability of resources. I then developed a fine-grained resource coordination scheme for distributed computations. The coordination scheme integrates DREAM-based resource reasoning into a distributed scheduler, for deciding and enforcing fine-grained resource-use schedules for distributed computations. To control the overhead caused by the coordination, a tuner is implemented which explicitly balances the overhead of the control mechanisms against the extent of control exercised. The effectiveness and performance of the resource coordination approach have been evaluated using a number of case studies. Experimental results show that the approach can effectively schedule computations for supporting various types of coordination objectives, such as ensuring Quality-of-Service, power-efficient execution, and dynamic load balancing. The overhead caused by the coordination mechanism is relatively modest, and adjustable through the tuner. In addition, the coordination mechanism does not add extra programming complexity to computations

    Autonomic Management And Performance Optimization For Cloud Computing Services

    Get PDF
    Cloud computing has become an increasingly important computing paradigm. It offers three levels of on-demand services to cloud users: software as a service (SaaS), platform as a service (PaaS), and infrastructure as a service (IaaS) . The success of cloud services heavily depends on the effectiveness of cloud management strategies. In this dissertation work, we aim to design and implement an automatic cloud management system to improve application performance, increase platform efficiency and optimize resource allocation. For large-scale multi-component applications, especially web-based cloud applica- tions, parameter setting is crucial to the service availability and quality. The increas- ing system complexity requires an automatic and efficient application configuration strategy. To improve the quality of application services, we propose a reinforcement learning(RL)-based autonomic configuration framework. It is able to adapt appli- cation parameter settings not only to the variations in workload, but also to the change of virtual resource allocation. The RL approach is enhanced with an efficient initialization policy to reduce the learning time for online decision. Experiments on Xen-based virtual cluster with TPC-W benchmarks show that the framework can drive applications into a optimal configuration in less than 25 iterations. For cloud platform service, one of the key challenges is to efficiently adapt the offered platforms to the virtualized environment, meanwhile maintaining their service features. MapReduce has become an important distributed parallel programming paradigm. Offering MapReduce cloud service presents an attractive usage model for enterprises. In a virtual MapReduce cluster, the interference between virtual machines (VMs) causes performance degradation of map and reduce tasks and renders existing data locality-aware task scheduling policy, like delay scheduling, no longer effective. On the other hand, virtualization offers an extra opportunity of data locality for co-hosted VMs. To address these issues, we present a task scheduling strategy to mitigate interference and meanwhile preserving task data locality for MapReduce applications. The strategy includes an interference-aware scheduling policy, based on a task performance prediction model, and an adaptive delay scheduling algorithm for data locality improvement. Experimental results on a 72-node Xen-based virtual cluster show that the scheduler is able to achieve a speedup of 1.5 to 6.5 times for individual jobs and yield an improvement of up to 1.9 times in system throughput in comparison with four other MapReduce schedulers. Cloud computing has a key requirement for resource configuration in a real-time manner. In such virtualized environments, both virtual machines (VMs) and hosted applications need to be configured on-the fly to adapt to system dynamics. The in- terplay between the layers of VMs and applications further complicates the problem of cloud configuration. Independent tuning of each aspect may not lead to optimal system wide performance. In this work, we propose a framework for coordinated configuration of VMs and resident applications. At the heart of the framework is a model-free hybrid reinforcement learning (RL) approach, which combines the advan- tages of Simplex method and RL method and is further enhanced by the use of system knowledge guided exploration policies. Experimental results on Xen based virtualized environments with TPC-W and TPC-C benchmarks demonstrate that the framework is able to drive a virtual server cluster into an optimal or near-optimal configuration state on the fly, in response to the change of workload. It improves the systems throughput by more than 30% over independent tuning strategies. In comparison with the coordinated tuning strategies based on basic RL or Simplex algorithm, the hybrid RL algorithm gains 25% to 40% throughput improvement

    Deadline-Aware Reservation-Based Scheduling

    Get PDF
    The ever-growing need to improve return-on-investment (ROI) for cluster infrastructure that processes data which is being continuously generated at a higher rate than ever before introduces new challenges for big-data processing frameworks. Highly complex mixed workload arriving at modern clusters along with a growing number of time-sensitive critical production jobs necessitates cluster management systems to evolve. Most big-data systems are not only required to guarantee that production jobs will complete before their deadline, but also minimize the latency for best-effort jobs to increase ROI. This research presents DARSS, a deadline-aware reservation-based scheduling system. DARSS addresses the above-stated problem by using a reservation-based approach to scheduling that supports temporal requirements of production jobs while keeping the latency for best-effort jobs low. Fined-grained resource allocation enables DARSS to schedule more tasks than a coarser-grained approach would. Furthermore, DARSS schedules production jobs as close to their deadlines as possible. This scheduling policy allows the system to maximize the number of low-priority tasks that can be scheduled opportunistically. DARSS is a scalable system that can be integrated with YARN. DARSS is evaluated on a simulated cluster of 300 nodes against a workload derived from Google Borg's trace. DARSS is compared with Microsoft's Rayon and YARN's built-in scheduler. DARSS achieves better production job acceptance rate than both YARN and Rayon. The experiments show that all of the production jobs accepted by DARSS complete before their deadlines. Furthermore, DARSS has a higher number of best-effort jobs serviced than Rayon. And finally, DARSS has lower latency for best-effort jobs than Rayon

    Holistic Slowdown Driven Scheduling and Resource Management for Malleable Jobs

    Get PDF
    In job scheduling, the concept of malleability has been explored since many years ago. Research shows that malleability improves system performance, but its utilization in HPC never became widespread. The causes are the difficulty in developing malleable applications, and the lack of support and integration of the different layers of the HPC software stack. However, in the last years, malleability in job scheduling is becoming more critical because of the increasing complexity of hardware and workloads. In this context, using nodes in an exclusive mode is not always the most efficient solution as in traditional HPC jobs, where applications were highly tuned for static allocations, but offering zero flexibility to dynamic executions. This paper proposes a new holistic, dynamic job scheduling policy, Slowdown Driven (SD-Policy), which exploits the malleability of applications as the key technology to reduce the average slowdown and response time of jobs. SD-Policy is based on backfill and node sharing. It applies malleability to running jobs to make room for jobs that will run with a reduced set of resources, only when the estimated slowdown improves over the static approach. We implemented SD-Policy in SLURM and evaluated it in a real production environment, and with a simulator using workloads of up to 198K jobs. Results show better resource utilization with the reduction of makespan, response time, slowdown, and energy consumption, up to respectively 7%, 50%, 70%, and 6%, for the evaluated workloads

    Optimizing Tilera's process scheduling via reinforcement learning

    Get PDF
    Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2013.Cataloged from PDF version of thesis.Includes bibliographical references (pages 45-48).As multicore processors become more prevalent, system complexities are increasing. It is no longer practical for an average programmer to balance all of the system constraints to ensure that the system will always perform optimally. One apparent solution to managing these resources efficiently is to design a self-aware system that utilizes machine learning to optimally manage its own resources and tune its own parameters. Tilera is a multicore processor architecture designed to highly scalable. The aim of the proposed project is to use reinforcement learning to develop a reward function that will enable the Tilera's scheduler to tune its own parameters. By enabling the parameters to come from the system's "reward function," we aim eliminate the burden on the programmer to produce these parameters. Our contribution to this aim is a library of reinforcement learning functions, borrowed from Sutton and Barto (1998) [35], and a lightweight benchmark, capable of modifying processor affinities. When combined, these two tools should provide a sound basis for Tilera's scheduler to tune its own parameters. Furthermore, this thesis describes how this combination may effectively be done and explores several manually tuned processor affinities. The results of this exploration demonstrates the necessity of an autonomously-tuned scheduler.by Deborah Hanus.M. Eng
    • …
    corecore