105 research outputs found

    Randomized algorithms for fully online multiprocessor scheduling with testing

    Full text link
    We contribute the first randomized algorithm that is an integration of arbitrarily many deterministic algorithms for the fully online multiprocessor scheduling with testing problem. When there are two machines, we show that with two component algorithms its expected competitive ratio is already strictly smaller than the best proven deterministic competitive ratio lower bound. Such algorithmic results are rarely seen in the literature. Multiprocessor scheduling is one of the first combinatorial optimization problems that have received numerous studies. Recently, several research groups examined its testing variant, in which each job JjJ_j arrives with an upper bound uju_j on the processing time and a testing operation of length tjt_j; one can choose to execute JjJ_j for uju_j time, or to test JjJ_j for tjt_j time to obtain the exact processing time pjp_j followed by immediately executing the job for pjp_j time. Our target problem is the fully online version, in which the jobs arrive in sequence so that the testing decision needs to be made at the job arrival as well as the designated machine. We propose an expected (φ+3+1)(≈3.1490)(\sqrt{\varphi + 3} + 1) (\approx 3.1490)-competitive randomized algorithm as a non-uniform probability distribution over arbitrarily many deterministic algorithms, where φ=5+12\varphi = \frac {\sqrt{5} + 1}2 is the Golden ratio. When there are two machines, we show that our randomized algorithm based on two deterministic algorithms is already expected 3φ+313−7φ4(≈2.1839)\frac {3 \varphi + 3 \sqrt{13 - 7\varphi}}4 (\approx 2.1839)-competitive. Besides, we use Yao's principle to prove lower bounds of 1.66821.6682 and 1.65221.6522 on the expected competitive ratio for any randomized algorithm at the presence of at least three machines and only two machines, respectively, and prove a lower bound of 2.21172.2117 on the competitive ratio for any deterministic algorithm when there are only two machines.Comment: 21 pages with 1 plot; an extended abstract to be submitte

    Mapping tree-shaped workflows on systems with different memory sizes and processor speeds

    Get PDF
    Directed acyclic graphs are commonly used to model scientific workflows, by expressing dependencies between tasks, as well as the resource requirements of the workflow. As a special case, rooted directed trees occur in several applications, for instance in sparse matrix computations. Since typical workflows are modeled by large trees, it is crucial to schedule them efficiently, so that their execution time (or makespan) is minimized. Furthermore, it is usually beneficial to distribute the execution on several compute nodes, hence increasing the available memory, and allowing us to parallelize parts of the execution. To exploit the heterogeneity of modern clusters in this context, we investigate the partitioning and mapping of tree‐shaped workflows on two types of target architecture models: in AM1, each processor can have a different memory size, and in AM2, each processor can also have a different speed (in addition to a different memory size). We design a three‐step heuristic for AM1, which adapts and extends previous work for homogeneous clusters [Gou C, Benoit A, Marchal L. Partitioning tree‐shaped task graphs for distributed platforms with limited memory. IEEE Trans Parallel Dist Syst 2020; 31(7): 1533–1544]. The changes we propose concern the assignment to processors (accounting for the different memory sizes) and the availability of suitable processors when splitting or merging subtrees. For AM2, we extend the heuristic for AM1 with a two‐phase local search approach. Phase A is a swap‐based hill climber, while (the optional) Phase B is inspired by iterated local search. We evaluate our heuristics for AM1 and AM2 with extensive simulations, and we demonstrate that exploiting the heterogeneity in the cluster significantly reduces the makespan, compared to the state of the art for homogeneous processors.Peer Reviewe

    The rate of convergence to optimality of the LPT rule

    Get PDF
    The LPT rule is a heuristic method to distribute jobs among identical machines so as to minimize the makespan of the resulting schedule. If the processing times of the jobs are assumed to be independent identically distributed random variables, then (under a mild condition on the distribution) the absolute error of this heuristic is known to converge to 0 almost surely. In this note we analyse the asymptotic behaviour of the absolute error and its first and higher moments to show that under quite general assumptions the speed of convergence is proportional to appropriate powers of (log log n)/n and 1/n. Thus, we simplify, strengthen and extend earlier results obtained for the uniform and exponential distribution.

    Scheduling and Tuning Kernels for High-performance on Heterogeneous Processor Systems

    Get PDF
    Accelerated parallel computing techniques using devices such as GPUs and Xeon Phis (along with CPUs) have proposed promising solutions of extending the cutting edge of high-performance computer systems. A significant performance improvement can be achieved when suitable workloads are handled by the accelerator. Traditional CPUs can handle those workloads not well suited for accelerators. Combination of multiple types of processors in a single computer system is referred to as a heterogeneous system. This dissertation addresses tuning and scheduling issues in heterogeneous systems. The first section presents work on tuning scientific workloads on three different types of processors: multi-core CPU, Xeon Phi massively parallel processor, and NVIDIA GPU; common tuning methods and platform-specific tuning techniques are presented. Then, analysis is done to demonstrate the performance characteristics of the heterogeneous system on different input data. This section of the dissertation is part of the GeauxDock project, which prototyped a few state-of-art bioinformatics algorithms, and delivered a fast molecular docking program. The second section of this work studies the performance model of the GeauxDock computing kernel. Specifically, the work presents an extraction of features from the input data set and the target systems, and then uses various regression models to calculate the perspective computation time. This helps understand why a certain processor is faster for certain sets of tasks. It also provides the essential information for scheduling on heterogeneous systems. In addition, this dissertation investigates a high-level task scheduling framework for heterogeneous processor systems in which, the pros and cons of using different heterogeneous processors can complement each other. Thus a higher performance can be achieve on heterogeneous computing systems. A new scheduling algorithm with four innovations is presented: Ranked Opportunistic Balancing (ROB), Multi-subject Ranking (MR), Multi-subject Relative Ranking (MRR), and Automatic Small Tasks Rearranging (ASTR). The new algorithm consistently outperforms previously proposed algorithms with better scheduling results, lower computational complexity, and more consistent results over a range of performance prediction errors. Finally, this work extends the heterogeneous task scheduling algorithm to handle power capping feature. It demonstrates that a power-aware scheduler significantly improves the power efficiencies and saves the energy consumption. This suggests that, in addition to performance benefits, heterogeneous systems may have certain advantages on overall power efficiency

    Grain-size optimization and scheduling for distributed memory architectures

    Get PDF
    The problem of scheduling parallel programs for execution on distributed memory parallel architectures has become the subject of intense research in recent, years. Because of the high inter-processor communication overhead in existing parallel machines, a crucial step in scheduling is task clustering, the process of coalescing heavily communicating fine grain tasks into coarser ones in order to reduce the communication overhead so that the overall execution time is minimized. The thesis of this research is that the task of exposing the parallelism in a given application should be left to the algorithm designer. On the other hand, the task of limiting the parallelism in a chosen parallel algorithm is best handled by the compiler or operating system for the target parallel machine. Toward this end, we have developed CASS (for Clustering And Scheduling System), a. task management system that provides facilities for automatic granularity optimization and task scheduling of parallel programs on distributed memory parallel architectures. In CASS, a task graph generated by a profiler is used by the clustering module to find the best granularity al which to execute the program so that the overall execution time is minimized. The scheduling module maps the clusters onto a. fixed number of processors and determines the order of execution of tasks in each processor. The output of scheduling module is then used by a code generator to generate machine instructions. CASS employs two efficient heuristic algorithms for clustering static task graphs: CASS-I for clustering with task duplication, and CASS-II for clustering without task duplication. It is shown that the clustering algorithms used by CASS outperform the best known algorithms reported in the literature. For the scheduling module in CASS, a heuristic algorithm based on load balancing is used to merge clusters such that the number of clusters matches the number of available physical processors. We also investigate task clustering algorithms for dynamic task graphs and show that it is inherently more difficult than the static case

    Efficient Parallel Scheduling of Malleable Tasks

    Get PDF

    Fast and Accurate Simulation of Multithreaded Sparse Linear Algebra Solvers

    Get PDF
    International audienceThe ever growing complexity and scale of parallel architectures imposes to rewrite classical monolithic HPC scientific applications and libraries as their portability and performance optimization only comes at a prohibitive cost. There is thus a recent and general trend in using instead a modular approach where numerical algorithms are written at a high level independently of the hardware architecture as Directed Acyclic Graphs (DAG) of tasks. A task-based runtime system then dynamically schedules the resulting DAG on the different computing resources, automatically taking care of data movement and taking into account the possible speed heterogeneity and variability. Evaluating the performance of such complex and dynamic systems is extremely challenging especially for irregular codes. In this article, we explain how we crafted a faithful simulation, both in terms of performance and memory usage, of the behavior of qr_mumps, a fully-featured sparse linear algebra library, on multi-core architectures. In our approach, the target high-end machines are calibrated only once to derive sound performance models. These models can then be used at will to quickly predict and study in a reproducible way the performance of such irregular and resource-demanding applications using solely a commodity laptop
    • 

    corecore