247 research outputs found

    Independent and Divisible Task Scheduling on Heterogeneous Star-shaped Platforms with Limited Memory

    Get PDF
    In this paper, we consider the problem of allocating and scheduling a collection of independent, equal-sized tasks on heterogeneous star-shaped platforms. We also address the same problem for divisible tasks. For both cases, we take memory constraints into account. We prove strong NP-completeness results for different objective functions, namely makespan minimization and throughput maximization, on simple star-shaped platforms. We propose an approximation algorithm based on the unconstrained version (with unlimited memory) of the problem. We introduce several heuristics, which are evaluated and compared through extensive simulations. An unexpected conclusion drawn from these experiments is that classical scheduling heuristics that try to greedily minimize the completion time of each task are outperformed by the simple heuristic that consists in assigning the task to the available processor that has the smallest communication time, regardless of computation power (hence a "bandwidth-centric" distribution).Dans ce rapport, nous nous intéressons au problème de l’allocation d’un grand nombre de taches indépendantes et de taille identiques sur des plateformes de calcul hétérogènes organisées en étoile. Nous nous intéressons également au modèle des tâches divisibles. Pour ces deux modèles, nous prenons en compte les contraintes mémoires et démontrons des résultats de NP-complétude pour diverses métriques (le «makespakan» et le débit). Nous proposons un algorithme d’approximation basé sur la version non-contrainte (c’est-`a-dire avec une mémoire infinie) du problème. Nous proposons également d’autres heuristiques que nous évaluons à l’aide d’un grand nombre de simulations. Une conclusion inattendue qui ressort de ces expériences est que les heuristiques de listes classiques qui essaient de minimiser gloutonnement la durée de l’ordonnancement sont bien moins performantes que l’heuristique toute simple consistant à envoyer les tâches aux processeurs disponibles ayant le temps de communication le plus faible, sans même tenir compte de leur puissance de calcu

    Revisiting Matrix Product on Master-Worker Platforms

    Get PDF
    This paper is aimed at designing efficient parallel matrix-product algorithms for heterogeneous master-worker platforms. While matrix-product is well-understood for homogeneous 2D-arrays of processors (e.g., Cannon algorithm and ScaLAPACK outer product algorithm), there are three key hypotheses that render our work original and innovative: - Centralized data. We assume that all matrix files originate from, and must be returned to, the master. - Heterogeneous star-shaped platforms. We target fully heterogeneous platforms, where computational resources have different computing powers. - Limited memory. Because we investigate the parallelization of large problems, we cannot assume that full matrix panels can be stored in the worker memories and re-used for subsequent updates (as in ScaLAPACK). We have devised efficient algorithms for resource selection (deciding which workers to enroll) and communication ordering (both for input and result messages), and we report a set of numerical experiments on various platforms at Ecole Normale Superieure de Lyon and the University of Tennessee. However, we point out that in this first version of the report, experiments are limited to homogeneous platforms

    Scheduling multiple bags of tasks on heterogeneous master- worker platforms: centralized versus distributed solutions

    Get PDF
    Multiple applications that execute concurrently on heterogeneous platforms compete for CPU and network resources. In this paper we consider the problem of scheduling applications to ensure fair and efficient execution on master-worker platforms where the communication is restricted to a tree embedded in the network. The goal of the scheduling is to obtain the best throughput while enforcing some fairness between applications. We show how to derive an asymptotically optimal periodic schedule by solving a linear program expressing all problem constraints. For single-level trees, the optimal solution can be analytically computed. For large-scale platforms, gathering the global knowledge needed by the linear programming approach might be unrealistic. One solution is to adapt the multi-commodity flow algorithm of Awerbuch and Leighton, but it still requires some global knowledge. Thus, we also investigates heuristic solutions using only local information, and test them via simulations. The best of our heuristics achieves the optimal performance on about two-thirds of our test cases, but is far worse in a few cases

    Dynamic Scheduling of MapReduce Shuffle under Bandwidth Constraints

    Get PDF
    Whether it is for e-science or business, the amount of data produced every year is growing at a high rate. Managing and processing those data raises new challenges. MapReduce is one answer to the need for scalable tools able to handle the amount of data. It imposes a general structure of computation and let the implementation perform its optimizations. During the computation, there is a phase called Shuffle where every node sends a possibly large amount of data to every other node. This report proposes and evaluates six algorithms to improve data transfers during the Shuffle phase under bandwidth constraints.Que ce soit pour l’e-science ou pour les affaires, la quantité de données produites chaque année augmente à une vitesse vertigineuse. Gérer et traiter ces données soulève de nouveaux défis. MapReduce est l’une des réponses aux besoins d’outils qui passent à l’échelle et capables de gérer ces volumes de données. Il impose une structure générale de calcul et laisse l’implémentation effectuer ses optimisations. Durant l’une des phases du calcul appelée Shuffle, tous les nœuds envoient des données potentiellement grosses à tous les autres nœuds. Ce rapport propose et évalue six algorithmes pour améliorer le transfert des données durant cette phase de Shuffle sous des contraintes de bande passante

    Scheduling for Large Scale Distributed Computing Systems: Approaches and Performance Evaluation Issues

    Get PDF
    Although our everyday life and society now depends heavily oncommunication infrastructures and computation infrastructures,scientists and engineers have always been among the main consumers ofcomputing power. This document provides a coherent overview of theresearch I have conducted in the last 15 years and which targets themanagement and performance evaluation of large scale distributedcomputing infrastructures such as clusters, grids, desktop grids,volunteer computing platforms, ... when used for scientific computing.In the first part of this document, I present how I have addressedscheduling problems arising on distributed platforms (like computinggrids) with a particular emphasis on heterogeneity and multi-userissues, hence in connection with game theory. Most of these problemsare relaxed from a classical combinatorial optimization formulationinto a continuous form, which allows to easily account for keyplatform characteristics such as heterogeneity or complex topologywhile providing efficient practical and distributed solutions.The second part presents my main contributions to the SimGrid project,which is a simulation toolkit for building simulators of distributedapplications (originally designed for scheduling algorithm evaluationpurposes). It comprises a unified presentation of how the questions ofvalidation and scalability have been addressed in SimGrid as well asthoughts on specific challenges related to methodological aspects andto the application of SimGrid to the HPC context

    Scheduling And Resource Management For Complex Systems: From Large-scale Distributed Systems To Very Large Sensor Networks

    Get PDF
    In this dissertation, we focus on multiple levels of optimized resource management techniques. We first consider a classic resource management problem, namely the scheduling of data-intensive applications. We define the Divisible Load Scheduling (DLS) problem, outline the system model based on the assumption that data staging and all communication with the sites can be done in parallel, and introduce a set of optimal divisible load scheduling algorithms and the related fault-tolerant coordination algorithm. The DLS algorithms introduced in this dissertation exploit parallel communication, consider realistic scenarios regarding the time when heterogeneous computing systems are available, and generate optimal schedules. Performance studies show that these algorithms perform better than divisible load scheduling algorithms based upon sequential communication. We have developed a self-organization model for resource management in distributed systems consisting of a very large number of sites with excess computing capacity. This self-organization model is inspired by biological metaphors and uses the concept of varying energy levels to express activity and goal satisfaction. The model is applied to Pleiades, a service-oriented architecture based on resource virtualization. The self-organization model for complex computing and communication systems is applied to Very Large Sensor Networks (VLSNs). An algorithm for self-organization of anonymous sensor nodes called SFSN (Scale-free Sensor Networks) and an algorithm utilizing the Small-worlds principle called SWAS (Small-worlds of Anonymous Sensors) are introduced. The SFSN algorithm is designed for VLSNs consisting of a fairly large number of inexpensive sensors with limited resources. An important feature of the algorithm is the ability to interconnect sensors without an identity, or physical address used by traditional communication and coordination protocols. During the self-organization phase, the collision-free communication channels allowing a sensor to synchronously forward information to the members of its proximity set are established and the communication pattern is followed during the activity phases. Simulation study shows that the SFSN ensures the scalability, limits the amount of communication and the complexity of coordination. The SWAS algorithm is further improved from SFSN by applying the Small-worlds principle. It is unique in its ability to create a sensor network with a topology approximating small-world networks. Rather than creating shortcuts between pairs of diametrically positioned nodes in a logical ring, we end up with something resembling a double-stranded DNA. By exploiting Small-worlds principle we combine two desirable features of networks, namely high clustering and small path length

    Efficient Scheduling and High-Performance Graph Partitioning on Heterogeneous CPU-GPU Systems

    Get PDF
    Heterogeneous CPU-GPU systems have emerged as a power-efficient platform for high performance parallelization of the applications. However, effectively exploiting these architectures faces a number of challenges including differences in the programming models of the CPU (MIMD) and the GPU (SIMD), GPU memory constraints, and comparatively low communication bandwidth between the CPU and GPU. As a consequence, high performance execution of applications on these platforms requires designing new adaptive parallelizing methods. In this thesis, first we explore embarrassingly parallel applications where tasks have no inter-dependencies. Although the massive processing power of GPUs provides an attractive opportunity for high-performance execution of embarrassingly parallel tasks on CPU-GPU systems, minimized execution time can only be obtained by optimally distributing the tasks between the processors. In contemporary CPU-GPU systems, the scheduler cannot decide about the appropriate rate distribution. Hence it requires high programming effort to manually divide the tasks among the processors. Herein, we design and implement a new dynamic scheduling heuristic to minimize the execution time of embarrassingly parallel applications on a heterogeneous CPU-GPU system. The scheduler is integrated into a scheduling framework that provides pre-implemented automated scheduling modules, liberating the user from the complexities of scheduling details. The experimental results show that our scheduling approach achieves better to similar performance compared to some of the scheduling algorithms proposed for CPU-GPU systems. We then investigate task dependent applications, where the tasks have data dependencies. The computational tasks and their communication patterns are expressed by a task interaction graph. Scheduling of the task interaction graph on a cluster can be done by first partitioning the graph into a set of computationally balanced partitions in such a way that the communication cost among the partitions is minimized, and subsequently mapping the partitions onto physical processors. Aside from scheduling, graph partitioning is a common computation phase in many application domains, including social network analysis, data mining, and VLSI design. However, irregular and data-dependent graph partitioning sub-tasks pose multiple challenges for efficient GPU utilization, which favors regularity. We design and implement a multilevel graph partitioner on a heterogeneous CPU-GPU system that takes advantage of the high parallel processing power of GPUs by executing the computation-intensive parts of the partitioning sub-tasks on the GPU and assigning the parts with less parallelism to the CPU. Our partitioner aims to overcome some of the challenges arising due to the irregular nature of the algorithm, and memory constraints on GPUs. We present a lock-free scheme since fine-grained synchronization among thousands of GPU threads imposes too high a performance overhead. Experimental results demonstrate that our partitioner outperforms serial and parallel MPI-based partitioners. It performs similar to shared-memory CPU-based parallel graph partitioner. To optimize the graph partitioner performance, we describe an effective and methodological approach to enable a GPU-based multi-level graph partitioning that is tailored specifically for the SIMD architecture. Our solution avoids thread divergence and balances the load over GPU threads by dynamically assigning an appropriate number of threads to process the graph vertices and irregular sized neighbors. Our optimized design is autonomous as all the steps are carried out by the GPU with minimal CPU interference. We show that this design outperforms CPU-based parallel graph partitioner. Finally, we apply some of our partitioning techniques to another graph processing algorithm, minimum spanning tree (MST), that exhibits load imbalance characteristics. We show that extending these techniques helps in achieving a high performance implementation of MST on the GPU

    Computational Methods in Science and Engineering : Proceedings of the Workshop SimLabs@KIT, November 29 - 30, 2010, Karlsruhe, Germany

    Get PDF
    In this proceedings volume we provide a compilation of article contributions equally covering applications from different research fields and ranging from capacity up to capability computing. Besides classical computing aspects such as parallelization, the focus of these proceedings is on multi-scale approaches and methods for tackling algorithm and data complexity. Also practical aspects regarding the usage of the HPC infrastructure and available tools and software at the SCC are presented
    • …
    corecore