43,663 research outputs found

    Learning Scheduling Algorithms for Data Processing Clusters

    Full text link
    Efficiently scheduling data processing jobs on distributed compute clusters requires complex algorithms. Current systems, however, use simple generalized heuristics and ignore workload characteristics, since developing and tuning a scheduling policy for each workload is infeasible. In this paper, we show that modern machine learning techniques can generate highly-efficient policies automatically. Decima uses reinforcement learning (RL) and neural networks to learn workload-specific scheduling algorithms without any human instruction beyond a high-level objective such as minimizing average job completion time. Off-the-shelf RL techniques, however, cannot handle the complexity and scale of the scheduling problem. To build Decima, we had to develop new representations for jobs' dependency graphs, design scalable RL models, and invent RL training methods for dealing with continuous stochastic job arrivals. Our prototype integration with Spark on a 25-node cluster shows that Decima improves the average job completion time over hand-tuned scheduling heuristics by at least 21%, achieving up to 2x improvement during periods of high cluster load

    UDRF: Multi-resource Fairness for Complex Jobs with Placement Constraints

    Get PDF
    In this paper, we study the problem of multi- resource fairness in systems running complex jobs that consist of multiple interconnected tasks. A job is considered finished when all its corresponding tasks have been executed in the system. Tasks can have different resource requirements. Because of special demands on particular hardware or software, tasks may have placement constraints limiting the type of machines they can run on. We develop User-Dependence Dominant Resource Fairness (UDRF), a generalized version of max-min fairness that combines graph theory and the notion of dominant re- source shares to ensure multi-resource fairness between complex workflows. UDRF satisfies several desirable properties including strategy proofness, which ensures that users do not benefit from misreporting their true resource demands. We propose an offline algorithm that computes optimal UDRF allocation. But optimality comes at a cost, especially for systems where schedulers need to make thousands of online scheduling decisions per second. Therefore, we develop a lightweight online algorithm that closely approximates UDRF. Besides that, we propose a simple mechanism to decentralize the UDRF scheduling process across multiple schedulers. Large-scale simulations driven by Google cluster-usage traces show that UDRF achieves better resource utilization and throughput compared to the current state-of-the-art in fair resource allocation

    HPC Cloud for Scientific and Business Applications: Taxonomy, Vision, and Research Challenges

    Full text link
    High Performance Computing (HPC) clouds are becoming an alternative to on-premise clusters for executing scientific applications and business analytics services. Most research efforts in HPC cloud aim to understand the cost-benefit of moving resource-intensive applications from on-premise environments to public cloud platforms. Industry trends show hybrid environments are the natural path to get the best of the on-premise and cloud resources---steady (and sensitive) workloads can run on on-premise resources and peak demand can leverage remote resources in a pay-as-you-go manner. Nevertheless, there are plenty of questions to be answered in HPC cloud, which range from how to extract the best performance of an unknown underlying platform to what services are essential to make its usage easier. Moreover, the discussion on the right pricing and contractual models to fit small and large users is relevant for the sustainability of HPC clouds. This paper brings a survey and taxonomy of efforts in HPC cloud and a vision on what we believe is ahead of us, including a set of research challenges that, once tackled, can help advance businesses and scientific discoveries. This becomes particularly relevant due to the fast increasing wave of new HPC applications coming from big data and artificial intelligence.Comment: 29 pages, 5 figures, Published in ACM Computing Surveys (CSUR

    A Taxonomy for Management and Optimization of Multiple Resources in Edge Computing

    Full text link
    Edge computing is promoted to meet increasing performance needs of data-driven services using computational and storage resources close to the end devices, at the edge of the current network. To achieve higher performance in this new paradigm one has to consider how to combine the efficiency of resource usage at all three layers of architecture: end devices, edge devices, and the cloud. While cloud capacity is elastically extendable, end devices and edge devices are to various degrees resource-constrained. Hence, an efficient resource management is essential to make edge computing a reality. In this work, we first present terminology and architectures to characterize current works within the field of edge computing. Then, we review a wide range of recent articles and categorize relevant aspects in terms of 4 perspectives: resource type, resource management objective, resource location, and resource use. This taxonomy and the ensuing analysis is used to identify some gaps in the existing research. Among several research gaps, we found that research is less prevalent on data, storage, and energy as a resource, and less extensive towards the estimation, discovery and sharing objectives. As for resource types, the most well-studied resources are computation and communication resources. Our analysis shows that resource management at the edge requires a deeper understanding of how methods applied at different levels and geared towards different resource types interact. Specifically, the impact of mobility and collaboration schemes requiring incentives are expected to be different in edge architectures compared to the classic cloud solutions. Finally, we find that fewer works are dedicated to the study of non-functional properties or to quantifying the footprint of resource management techniques, including edge-specific means of migrating data and services.Comment: Accepted in the Special Issue Mobile Edge Computing of the Wireless Communications and Mobile Computing journa
    corecore