187 research outputs found

    Space sharing job scheduling policies for parallel computers

    Get PDF
    The distinguishing characteristic of space sharing parallel job scheduling policies is that applications are allocated non-overlapping processor subsets. The interference among jobs is reduced, the synchronization delays and message latencies can be predictable, and distinct processors may be allocated to cooperating processes so as to avoid the overhead of context switches associated with traditional time-multiplexing;The processor allocation strategy, the job selection criteria, and workload characteristics are fundamental factors that influence system performance under space sharing. Allocation can be static or dynamic. The processor subset allocated to an application is fixed under static space sharing, whereas it can change during execution under dynamic space sharing. Static allocation can produce more predictable run times, permits a wide range of compiler optimizations (e.g., static data distribution and binding), and avoids the processor releases and reallocations associated with dynamic allocation. Its major problem is that it can induce high processor fragmentation;In this dissertation, alternative static and dynamic space sharing policies that differ in the allocation discipline and the job selection criteria are studied. The results show that significantly superior performance can be achieved under static space sharing if applications can be folded (i.e., allocated fewer processors than they requested). Folding typically increases program efficiency and can reduce processor fragmentation. Policies that increase folding with the system load are proposed and compared to schemes that use unconstrained folding, no folding, and fixed maximum folding factors. The adaptive policies produced higher and more stable system utilization, significantly shorter mean response times, and good fairness curves. However, unconstrained folding resulted in considerably more severe processor fragmentation than no folding. Its advantage is that it exploits the efficiency improvement that typically results when an application is allocated fewer processors. Consequently, it can produce shorter mean response times than no folding under medium to heavy loads;Also because of this efficiency improvement, dynamic policies that reduce waiting times by executing a large number of jobs simultaneously are more promising than schemes that limit the number of active jobs. However, limiting the number of active applications can be the superior approach when folding does not improve application efficiency

    Provably Efficient Adaptive Scheduling for Parallel Jobs

    Get PDF
    Scheduling competing jobs on multiprocessors has always been an important issue for parallel and distributed systems. The challenge is to ensure global, system-wide efficiency while offering a level of fairness to user jobs. Various degrees of successes have been achieved over the years. However, few existing schemes address both efficiency and fairness over a wide range of work loads. Moreover, in order to obtain analytical results, most of them require prior information about jobs, which may be difficult to obtain in real applications. This paper presents two novel adaptive scheduling algorithms -- GRAD for centralized scheduling, and WRAD for distributed scheduling. Both GRAD and WRAD ensure fair allocation under all levels of workload, and they offer provable efficiency without requiring prior information of job's parallelism. Moreover, they provide effective control over the scheduling overhead and ensure efficient utilization of processors. To the best of our knowledge, they are the first non-clairvoyant scheduling algorithms that offer such guarantees. We also believe that our new approach of resource request-allotment protocol deserves further exploration. Specifically, both GRAD and WRAD are O(1)-competitive with respect to mean response time for batched jobs, and O(1)-competitive with respect to makespan for non-batched jobs with arbitrary release times. The simulation results show that, for non-batched jobs, the makespan produced by GRAD is no more than 1.39 times of the optimal on average and it never exceeds 4.5 times. For batched jobs, the mean response time produced by GRAD is no more than 2.37 times of the optimal on average, and it never exceeds 5.5 times.Singapore-MIT Alliance (SMA

    Fault- and Yield-Aware On-Chip Memory Design and Management

    Get PDF
    Ever decreasing device size causes more frequent hard faults, which becomes a serious burden to processor design and yield management. This problem is particularly pronounced in the on-chip memory which consumes up to 70% of a processor' s total chip area. Traditional circuit-level techniques, such as redundancy and error correction code, become less effective in error-prevalent environments because of their large area overhead. In this work, we suggest an architectural solution to building reliable on-chip memory in the future processor environment. Our approaches have two parts, a design framework and architectural techniques for on-chip memory structures. Our design framework provides important architectural evaluation metrics such as yield, area, and performance based on low level defects and process variations parameters. Processor architects can quickly evaluate their designs' characteristics in terms of yield, area, and performance. With the framework, we develop architectural yield enhancement solutions for on-chip memory structures including L1 cache, L2 cache and directory memory. Our proposed solutions greatly improve yield with negligible area and performance overhead. Furthermore, we develop a decoupled yield model of compute cores and L2 caches in CMPs, which show that there will be many more L2 caches than compute cores in a chip. We propose efficient utilization techniques for excess caches. Evaluation results show that excess caches significantly improve overall performance of CMPs

    Optimizing virtual machine scheduling in NUMA multicore systems

    Full text link
    An increasing number of new multicore systems use the Non-Uniform Memory Access architecture due to its scalable memory performance. However, the complex interplay among data locality, contention on shared on-chip memory resources, and cross-node data sharing overhead, makes the delivery of an optimal and predictable program performance difficult. Vir-tualization further complicates the scheduling problem. Due to abstract and inaccurate mappings from virtual hardware to machine hardware, program and system-level optimizations are often not effective within virtual machines. We find that the penalty to access the “uncore ” memory subsystem is an effective metric to predict program perfor-mance in NUMA multicore systems. Based on this metric, we add NUMA awareness to the virtual machine scheduling. We propose a Bias Random vCPU Migration (BRM) algorithm that dynamically migrates vCPUs to minimize the system-wide uncore penalty. We have implemented the scheme in the Xen virtual machine monitor. Experiment results on a two-way Intel NUMA multicore system with various workloads show that BRM is able to improve application performance by up to 31.7 % compared with the default Xen credit scheduler. More-over, BRM achieves predictable performance with, on average, no more than 2 % runtime variations. 1

    Adaptive space-time sharing with SCOJO.

    Get PDF
    Coscheduling is a technique used to improve the performance of parallel computer applications under time sharing, i.e., to provide better response times than standard time sharing or space sharing. Dynamic coscheduling and gang scheduling are two main forms of coscheduling. In SCOJO (Share-based Job Coscheduling), we have introduced our own original framework to employ loosely coordinated dynamic coscheduling and a dynamic directory service in support of scheduling cross-site jobs in grid scheduling. SCOJO guarantees effective CPU shares by taking coscheduling effects into consideration and supports both time and CPU share reservation for cross-site job. However, coscheduling leads to high memory pressure and still involves problems like fragmentation and context-switch overhead, especially when applying higher multiprogramming levels. As main part of this thesis, we employ gang scheduling as more directly suitable approach for combined space-time sharing and extend SCOJO for clusters to incorporate adaptive space sharing into gang scheduling. We focus on taking advantage of moldable and malleable characteristics of realistic job mixes to dynamically adapt to varying system workloads and flexibly reduce fragmentation. In addition, our adaptive scheduling approach applies standard job-scheduling techniques like a priority and aging system, backfilling or easy backfilling. We demonstrate by the results of a discrete-event simulation that this dynamic adaptive space-time sharing approach can deliver better response times and bounded relative response times even with a lower multiprogramming level than traditional gang scheduling.Dept. of Computer Science. Paper copy at Leddy Library: Theses & Major Papers - Basement, West Bldg. / Call Number: Thesis2004 .H825. Source: Masters Abstracts International, Volume: 43-01, page: 0237. Adviser: A. Sodan. Thesis (M.Sc.)--University of Windsor (Canada), 2004

    "Virtual malleability" applied to MPI jobs to improve their execution in a multiprogrammed environment"

    Get PDF
    This work focuses on scheduling of MPI jobs when executing in shared-memory multiprocessors (SMPs). The objective was to obtain the best performance in response time in multiprogrammed multiprocessors systems using batch systems, assuming all the jobs have the same priority. To achieve that purpose, the benefits of supporting malleability on MPI jobs to reduce fragmentation and consequently improve the performance of the system were studied. The contributions made in this work can be summarized as follows:· Virtual malleability: A mechanism where a job is assigned a dynamic processor partition, where the number of processes is greater than the number of processors. The partition size is modified at runtime, according to external requirements such as the load of the system, by varying the multiprogramming level, making the job contend for resources with itself. In addition to this, a mechanism which decides at runtime if applying local or global process queues to an application depending on the load balancing between processes of it. · A job scheduling policy, that takes decisions such as how many processes to start with and the maximum multiprogramming degree based on the type and number of applications running and queued. Moreover, as soon as a job finishes execution and where there are queued jobs, this algorithm analyzes whether it is better to start execution of another job immediately or just wait until there are more resources available. · A new alternative to backfilling strategies for the problema of window execution time expiring. Virtual malleability is applied to the backfilled job, reducing its partition size but without aborting or suspending it as in traditional backfilling. The evaluation of this thesis has been done using a practical approach. All the proposals were implemented, modifying the three scheduling levels: queuing system, processor scheduler and runtime library. The impact of the contributions were studied under several types of workloads, varying machine utilization, communication and, balance degree of the applications, multiprogramming level, and job size. Results showed that it is possible to offer malleability over MPI jobs. An application obtained better performance when contending for the resources with itself than with other applications, especially in workloads with high machine utilization. Load imbalance was taken into account obtaining better performance if applying the right queue type to each application independently.The job scheduling policy proposed exploited virtual malleability by choosing at the beginning of execution some parameters like the number of processes and maximum multiprogramming level. It performed well under bursty workloads with low to medium machine utilizations. However as the load increases, virtual malleability was not enough. That is because, when the machine is heavily loaded, the jobs, once shrunk are not able to expand, so they must be executed all the time with a partition smaller than the job size, thus degrading performance. Thus, at this point the job scheduling policy concentrated just in moldability.Fragmentation was alleviated also by applying backfilling techniques to the job scheduling algorithm. Virtual malleability showed to be an interesting improvement in the window expiring problem. Backfilled jobs even on a smaller partition, can continue execution reducing memory swapping generated by aborts/suspensions In this way the queueing system is prevented from reinserting the backfilled job in the queue and re-executing it in the future.Postprint (published version

    Doctor of Philosophy

    Get PDF
    dissertationIn recent years, a number of trends have started to emerge, both in microprocessor and application characteristics. As per Moore's law, the number of cores on chip will keep doubling every 18-24 months. International Technology Roadmap for Semiconductors (ITRS) reports that wires will continue to scale poorly, exacerbating the cost of on-chip communication. Cores will have to navigate an on-chip network to access data that may be scattered across many cache banks. The number of pins on the package, and hence available off-chip bandwidth, will at best increase at sublinear rate and at worst, stagnate. A number of disruptive memory technologies, e.g., phase change memory (PCM) have begun to emerge and will be integrated into the memory hierarchy sooner than later, leading to non-uniform memory access (NUMA) hierarchies. This will make the cost of accessing main memory even higher. In previous years, most of the focus has been on deciding the memory hierarchy level where data must be placed (L1 or L2 caches, main memory, disk, etc.). However, in modern and future generations, each level is getting bigger and its design is being subjected to a number of constraints (wire delays, power budget, etc.). It is becoming very important to make an intelligent decision about where data must be placed within a level. For example, in a large non-uniform access cache (NUCA), we must figure out the optimal bank. Similarly, in a multi-dual inline memory module (DIMM) non uniform memory access (NUMA) main memory, we must figure out the DIMM that is the optimal home for every data page. Studies have indicated that heterogeneous main memory hierarchies that incorporate multiple memory technologies are on the horizon. We must develop solutions for data management that take heterogeneity into account. For these memory organizations, we must again identify the appropriate home for data. In this dissertation, we attempt to verify the following thesis statement: "Can low-complexity hardware and OS mechanisms manage data placement within each memory hierarchy level to optimize metrics such as performance and/or throughput?" In this dissertation we argue for a hardware-software codesign approach to tackle the above mentioned problems at different levels of the memory hierarchy. The proposed methods utilize techniques like page coloring and shadow addresses and are able to handle a large number of problems ranging from managing wire-delays in large, shared NUCA caches to distributing shared capacity among different cores. We then examine data-placement issues in NUMA main memory for a many-core processor with a moderate number of on-chip memory controllers. Using codesign approaches, we achieve efficient data placement by modifying the operating system's (OS) page allocation algorithm for a wide variety of main memory architectures

    The impact of operating system scheduling policies and synchronization methods of performance of parallel applications

    Full text link

    XChange: A market-based approach to scalable dynamic multi-resource allocation in multicore architectures

    Full text link
    Efficiently allocating shared on-chip resources across cores is critical to optimize execution in chip multiprocessors (CMPs). Techniques proposed in the literature often rely on global, centralized mechanisms that seek to maximize system through-put. Global optimization may hurt scalability: as more cores are integrated on a die, the search space grows exponentially, making it harder to achieve optimal or even acceptable oper-ating points at run-time without incurring significant over-heads. In this paper, we propose XChange, a novel CMP resource allocation mechanism that delivers scalable high through-put and fairness. Through XChange, the CMP functions as a market, where each shared resource is assigned a price which changes over time, and each core seeks to maximize its own utility, by bidding for these shared resources. Because each core works largely independently, the resource alloca-tion becomes a scalable, mostly distributed decision-making process. In addition, by distributing the resources propor-tionally to the bids, the system avoids unfairness, treating each core in an unbiased manner. Our evaluation shows that, using detailed simulations of a 64-core CMP configuration running a variety of multipro-grammed workloads, the proposed XChange mechanism im-proves system throughput (weighted speedup) by about 21% on average, and fairness (harmonic speedup) by about 24% on average, compared with equal-share on-chip cache and power distribution. On both metrics, that is at least about twice as much improvement over equal-share as a state-of-the-art centralized allocation scheme. Furthermore, our re-sults show that XChange is significantly more scalable than the state-of-the-art centralized allocation scheme we com-pare against. 1
    corecore