Search CORE

23 research outputs found

Idle regulation in non-clairvoyant scheduling of parallel jobs

Author: Brizuela Carlos
Scherson Isaac
Tchernykh Andrei
Trystram Denis
Publication venue: Elsevier B.V.
Publication date: 01/01/2009
Field of study

AbstractThe optimization of parallel applications is difficult to achieve by classical optimization techniques because of their diversity and the variety of actual parallel and distributed platforms and/or environments. Adaptive algorithmic schemes, capable of dynamically changing the allocation of jobs during the execution to optimize global system behavior, are the best alternatives for solving this problem. In this paper, we focus on non-clairvoyant scheduling of parallel jobs with known resource requirements but unknown running times, with emphasis on the regulation of idle periods in the context of general list policies. We consider a new family of scheduling strategies based on two phases which successively combine sequential and parallel execution of jobs. We generalize known worst-case performance bounds by considering two extra parameters, in addition to the number of processors and maximum processor requirements considered in the literature, namely, job parallelization penalty and idle regulation factor. Furthermore, we prove that under certain conditions of idle regulation, the performance guarantee of parallel job scheduling in space-sharing mode can be improved

Elsevier - Publisher Connector

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

Holistic Slowdown Driven Scheduling and Resource Management for Malleable Jobs

Author: Cirne W.
Kale L. V.
Kumbhar P.
Lopez V.
Lucero A.
Ludwig Walter
M. Cera
Vadhiyar Sathish S.
Yoo A. B.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2019
Field of study

In job scheduling, the concept of malleability has been explored since many years ago. Research shows that malleability improves system performance, but its utilization in HPC never became widespread. The causes are the difficulty in developing malleable applications, and the lack of support and integration of the different layers of the HPC software stack. However, in the last years, malleability in job scheduling is becoming more critical because of the increasing complexity of hardware and workloads. In this context, using nodes in an exclusive mode is not always the most efficient solution as in traditional HPC jobs, where applications were highly tuned for static allocations, but offering zero flexibility to dynamic executions. This paper proposes a new holistic, dynamic job scheduling policy, Slowdown Driven (SD-Policy), which exploits the malleability of applications as the key technology to reduce the average slowdown and response time of jobs. SD-Policy is based on backfill and node sharing. It applies malleability to running jobs to make room for jobs that will run with a reduced set of resources, only when the estimated slowdown improves over the static approach. We implemented SD-Policy in SLURM and evaluated it in a real production environment, and with a simulator using workloads of up to 198K jobs. Results show better resource utilization with the reduction of makespan, response time, slowdown, and energy consumption, up to respectively 7%, 50%, 70%, and 6%, for the evaluated workloads

arXiv.org e-Print Archive

Crossref

UPCommons. Portal del coneixement obert de la UPC

Towards ServMark, an Architecture for Testing Grid Services

Author: Andreica Mugurel Ionut
Dumitrescu Catalin
Epema Dick
Foster Ian
Iosup Alexandru
Raicu Ioan
Ripeanu Matei
Tapus Nicolae
Publication venue: HAL CCSD
Publication date: 01/01/2006
Field of study

Technical University of Delft - Technical Report ServMark-2006-002, July 2006Grid computing provides a natural way to aggregate resources from different administrative domains for building large scale distributed environments. The Web Services paradigm proposes a way by which virtual services can be seamlessly integrated into global-scale solutions to complex problems. While the usage of Grid technology ranges from academia and research to business world and production, two issues must be considered: that the promised functionality can be accurately quantified and that the performance can be evaluated based on well defined means. Without adequate functionality demonstrators, systems cannot be tuned or adequately configured, and Web services cannot be stressed adequately in production environment. Without performance evaluation systems, the system design and procurement processes are limp, and the performance of Web Services in production cannot be assessed. In this paper, we present ServMark, a carefully researched tool for Grid performance evaluation. While we acknowledge that a lot of ground must be covered to fulfill the requirements of a system for testing Grid environments, and Web (and Grid) Services, we believe that ServMark addresses the minimal set of critical issues

HAL-ENS-LYON

CiteSeerX

INRIA a CCSD electronic archive server

Hal-Diderot

Discrete-Event Modeling of a High-Performance Computing Cluster with Service Rate Control

Author: basmadjian
da kiefer
gandhi
kim
lelong
morozov
morozov
pierson
rumyantsev
sueur
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/05/2018
Field of study

We present a stochastic recursion based discrete-event model of a high-performance computing cluster with service rate switching capabilities. The model is easily adopted to many common settings of modern supercomputers, such as specific scheduling disciplines and various control policies. We also provide some illustrative numerical experiments and discuss further generalizations of the model

Crossref

Directory of Open Access Journals

CoMeT: An Integrated Interval Thermal Simulation Toolchain for 2D, 2.5 D, and 3D Processor-Memory Systems

Author: Henkel J.
Kedia R.
Panda P.R.
Pandey S.
Pathania A.
Rapp M.
Siddhu L.
Publication venue
Publication date: 16/03/2022
Field of study

Processing cores and the accompanying main memory working in tandem enable the modern processors. Dissipating heat produced from computation, memory access remains a significant problem for processors. Therefore, processor thermal management continues to be an active research topic. Most thermal management research takes place using simulations, given the challenges of measuring temperature in real processors. Since core and memory are fabricated on separate packages in most existing processors, with the memory having lower power densities, thermal management research in processors has primarily focused on the cores. Memory bandwidth limitations associated with 2D processors lead to high-density 2.5D and 3D packaging technology. 2.5D packaging places cores and memory on the same package. 3D packaging technology takes it further by stacking layers of memory on the top of cores themselves. Such packagings significantly increase the power density, making processors prone to heating. Therefore, mitigating thermal issues in high-density processors (packaged with stacked memory) becomes an even more pressing problem. However, given the lack of thermal modeling for memories in existing interval thermal simulation toolchains, they are unsuitable for studying thermal management for high-density processors. To address this issue, we present CoMeT, the first integrated Core and Memory interval Thermal simulation toolchain. CoMeT comprehensively supports thermal simulation of high- and low-density processors corresponding to four different core-memory configurations - off-chip DDR memory, off-chip 3D memory, 2.5D, and 3D. CoMeT supports several novel features that facilitate overlying system research. Compared to an equivalent state-of-the-art core-only toolchain, CoMeT adds only a ~5% simulation-time overhead. The source code of CoMeT has been made open for public use under the MIT license.Comment: https://github.com/marg-tools/CoMe

arXiv.org e-Print Archive

International Migration, Integration and Social Cohesion online publications

UvA-DARE

Tuning EASY-Backfilling Queues

Author: Lelong Jérôme
Reis Valentin
Trystram Denis
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 29/05/2017
Field of study

International audienceEASY-Backfilling is a popular scheduling heuristic for allocating jobs in large scale High Performance Computing platforms. While its aggressive reservation mechanism is fast and prevents job starvation, it does not try to optimize any scheduling objective per se. We consider in this work the problem of tuning EASY using queue reordering policies. More precisely, we propose to tune the reordering using a simulation-based methodology. For a given system, we choose the policy in order to minimize the average waiting time. This methodology departs from the First-Come, First-Serve rule and introduces a risk on the maximum values of the waiting time, which we control using a queue thresholding mechanism. This new approach is evaluated through a comprehensive experimental campaign on five production logs. In particular, we show that the behavior of the systems under study is stable enough to learn a heuristic that generalizes in a train/test fashion. Indeed, the average waiting time can be reduced consistently (between 11% to 42% for the logs used) compared to EASY, with almost no increase in maximum waiting times. This work departs from previous learning-based approaches and shows that scheduling heuristics for HPC can be learned directly in a policy space

Crossref

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

The Resource Usage Aware Backfilling

Author: B.G. Lawson
D.G. Feitelson
D.G. Feitelson
D.G. Feitelson
D.G. Feitelson
D.G. Feitelson
E. Shmueli
J. Skovira
M. Calzarossa
M. Calzarossa
S.-H. Chiang
S.J. Chapin
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

Abstract. Job scheduling policies for HPC centers have been extensively stud-ied in the last few years, especially backfilling based policies. Almost all of these studies have been done using simulation tools. All the existent simulators use the runtime (either estimated or real) provided in the workload as a basis of their sim-ulations. In our previous work we analyzed the impact on system performance of considering the resource sharing (memory bandwidth) of running jobs including a new resource model in the Alvio simulator. Based on this studies we proposed the LessConsume and LessConsume Threshold resource selection policies. Both are oriented to reduce the saturation of the shared resources thus increasing the performance of the system. The results showed how both resource allocation poli-cies shown how the performance of the system can be improved by considering where the jobs are finally allocated. Using the LessConsume Threshold Resource Selection Policy, we propose a new backfilling strategy: the Resource Usage Aware Backfilling job scheduling policy. This is a backfilling based scheduling policy where the algorithms which decide which job has to be executed and how jobs have to be backfilled are based on a different Threshold configurations. This backfilling variant that considers how the shared resources are used by the scheduled jobs. Rather than backfilling the first job that can moved to the run queue based on the job arrival time or job size, it looks ahead to the next queued jobs, and tries to allocate jobs that would experience lower penalized runtime caused by the resource sharing saturation. In the paper we demostrate how the exchange of scheduling information between the local resource manager and the scheduler can improve substantially the per-formance of the system when the resource sharing is considered. We show how it can achieve a close response time performance that the shorest job first Back-filling with First Fit (oriented to improve the start time for the allocated jobs) providing a qualitative improvement in the number of killed jobs and in the per-centage of penalized runtime.

CiteSeerX

Crossref

Improving the Performance of Batch Schedulers Using Online Job Size Classification

Author: De Camargo Raphael,
Legrand Arnaud
Trystram Denis
Zrigui Salah
Publication venue: HAL CCSD
Publication date: 25/10/2019
Field of study

Job scheduling in high-performance computing platforms is a hard problem that involves uncertainties on both the job arrival process and their execution time. Users typically provide a loose upper bound estimate for job execution times that are hardly useful. Previous studies attempted to improve these estimates using regression techniques. Although these attempts provide reasonable predictions, they require a long period of training data. Furthermore, aiming for perfect prediction may be of limited use for scheduling purposes. In this work, we propose a simpler approach by classifying jobs as small or large and prioritizing the execution of small jobs over large ones. Indeed, small jobs are the most impacted by queuing delays but they typically represent a light load and incur a small burden on the other jobs. The classifier operates online and learns by using data collected over the previous weeks, facilitating its deployment and enabling fast adaptations to changes in workload characteristics. We evaluate our approach using four scheduling policies on six HPC platform workload traces. We show that: (i) incorporating such classification reduces the average bounded slowdown of jobs in all scenarios, and (ii) the obtained improvements are comparable, in most scenarios, to the ideal hypothetical situation where the scheduler would know the exact running time of jobs in advance

INRIA a CCSD electronic archive server

Effectively utilizing global cluster memory for large data-intensive parallel programs

Author: J. Oleszkiewicz
L. Xiao
Yunhao Liu
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

CoMeT: An Integrated Interval Thermal Simulation Toolchain for 2D, 2.5D, and 3D Processor-Memory Systems

Author: et al .
Kedia Rajesh
Pandey Shailja
Siddhu Lokesh
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2022
Field of study

Processing cores and the accompanying main memory working in tandem enable modern processors. Dissipating heat produced from computation remains a significant problem for processors. Therefore, the thermal management of processors continues to be an active subject of research. Most thermal management research is performed using simulations, given the challenges in measuring temperatures in real processors. Fast yet accurate interval thermal simulation toolchains remain the research tool of choice to study thermal management in processors at the system level. However, the existing toolchains focus on the thermal management of cores in the processors, since they exhibit much higher power densities than memory. The memory bandwidth limitations associated with 2D processors lead to high-density 2.5D and 3D packaging technology: 2.5D packaging technology places cores and memory on the same package; 3D packaging technology takes it further by stacking layers of memory on the top of cores themselves. These new packagings significantly increase the power density of the processors, making them prone to overheating. Therefore, mitigating thermal issues in high-density processors (packaged with stacked memory) becomes even more pressing. However, given the lack of thermal modeling for memories in existing interval thermal simulation toolchains, they are unsuitable for studying thermal management for high-density processors. To address this issue, we present the first integrated Core and Memory interval Thermal (CoMeT) simulation toolchain. CoMeT comprehensively supports thermal simulation of high- and low-density processors corresponding to four different core-memory (integration) configurations-off-chip DDR memory, off-chip 3D memory, 2.5D, and 3D. CoMeT supports several novel features that facilitate overlying system research. CoMeT adds only an additional similar to 5% simulation-time overhead compared to an equivalent state-of-the-art core-only toolchain. The source code of CoMeT has been made open for public use under the MIT license

Research Archive of Indian Institute of Technology Hyderabad