215 research outputs found

    What Makes Affinity-Based Schedulers So Efficient ?

    Get PDF
    The tremendous increase in the size and heterogeneity of supercomputers makes it very difficult to predict the performance of a scheduling algorithm. Therefore, dynamic solutions, where scheduling decisions are made at runtime have overpassed static allocation strategies. The simplicity and efficiency of dynamic schedulers such as Hadoop are a key of the success of the MapReduce framework. Dynamic schedulers such as StarPU, PaRSEC or StarSs are also developed for more constrained computations, e.g. task graphs coming from linear algebra. To make their decisions, these runtime systems make use of some static information, such as the distance of tasks to the critical path or the affinity between tasks and computing resources (CPU, GPU,\ldots) and of dynamic information, such as where input data are actually located. In this paper, we concentrate on two elementary linear algebra kernels, namely the outer product and the matrix multiplication. For each problem, we propose several dynamic strategies that can be used at runtime and we provide an analytic study of their theoretical performance. We prove that the theoretical analysis provides very good estimate of the amount of communications induced by a dynamic strategy, thus enabling to choose among them for a given problem and architecture

    Communications collectives et ordonnancement en régime permanent sur plates-formes hétérogènes

    Get PDF
    The results presented in this document concern scheduling forlarge-scale heterogeneous platforms. We mainly focus on collectivecommunications taking place during the execution of distributedapplications: broadcast, scatter or reduction of data for instance. Westudy the steady-state operation of these communications and we aim atmaximizing the throughput of a series of similar communications. Thegoal is also to obtain an asymptotically optimal schedule as formakespan minimization. We present a general framework to study thesecommunications, which we use to assess the complexity of the problem foreach communication primitive. For a particular model of communication,the bidirectional one-port model, we develop a practical for solvingthe problem, based on a linear program in rational numbers. This resultsare illustrated by experiments on the Grid5000 platform. The study ofsteady-state operations is extended to scheduling multipleapplications on computing grids.Les travaux présentés dans cette thèse concernent l'ordonnancementpour les plates-formes hétérogènes à grande échelle. Nous nousintéressons principalement aux opérations de communicationscollectives comme la diffusion de données, la distribution de donnéesou la réduction. Nous étudions ces problèmes dans le cadre de leurrégime permanent, en optimisant le débit d'une série d'opérations decommunications, en vue d'obtenir un ordonnancement asymptotiquementoptimal du point de vue du temps d'exécution total. Après avoirprésenté un cadre général d'étude qui nous permet de connaître lacomplexité du problème pour chaque primitive, nous développons, pourle modèle de communication un-port bidirectionnel, une méthode derésolution pratique fondée sur la résolution d'un programme linéaireen rationnels. Cette étude du régime permanent est illustrée par desexpérimentations sur Grid5000 et se prolonge vers l'ordonnancementd'applications multiples sur des grilles de calcul

    Tree traversals with task-memory affinities

    Get PDF
    We study the complexity of traversing tree-shaped workflows whose tasks require large I/O files. We target a heterogeneous architecture with two resource types, each with a different memory, such as a multicore node equipped with a dedicated accelerator (FPGA or GPU). The tasks in the workflow are colored according to their type and can be processed if all their input and output files can be stored in the corresponding memory. The amount of used memory of each type at a given execution step strongly depends upon the ordering in which the tasks are executed, and upon when communications between both memories are scheduled. The objective is to determine an efficient traversal that minimizes the maximum amount of memory of each type needed to traverse the whole tree. In this paper, we establish the complexity of this two-memory scheduling problem, and provide inapproximability results. In addition, we design several heuristics, based on both post-order and general traversals, and we evaluate them on a comprehensive set of tree graphs, including random trees as well as assembly trees arising in the context of sparse matrix factorizations.Dans ce rapport, nous nous intéressons à la complexité du traitement d'arbres de tâches utilisant de gros fichiers d'entrée et de sortie. Nous nous focalisons sur une architecture hétérogène avec deux types de ressources, utilisant chacune une mémoire spécifique, comme par exemple un noeud multicore équipé d'un accélérateur (FPGA ou GPU). Les tâches de l'arbre sont colorées suivant leur type et peuvent être exécutées si tous leurs fichiers d'entrée et de sortie peuvent être stockés dans la mémoire correspondante. La quantité de mémoire de chaque type utilisée à une étape donnée de l'exécution dépend fortement de l'ordre dans lequel les tâches sont exécutées et du moment où sont effectuées les communications entre les deux mémoires. L'objectif est de déterminer un ordonnancement efficace qui minimise la quantité de mémoire de chaque type nécessaire pour traiter l'arbre entier. Dans ce rapport, nous établissons la complexité de ce problème d'ordonnancement à deux mémoires et nous fournissons des résultats d'inapproximabilité. De plus, nous proposons plusieurs heuristiques, fondées à la fois sur des traversées d'arbres en profondeur et des traversées générales, que nous évaluons sur un ensemble complet d'arbres, comprenant des arbres aléatoires ainsi que des arbres rencontrés dans le domaine de la factorisation de matrices creuses

    Comments on the hierarchically structured bin packing problem

    Get PDF
    International audienceWe study the hierarchically structured bin packing problem. In this problem, the items to be packed into bins are at the leaves of a tree. The objective of the packing is to minimize the total number of bins into which the descendants of an internal node are packed, summed over all internal nodes. We investigate an existing algorithm and make a correction to the analysis of its approximation ratio. Further results regarding the structure of an optimal solution and a strengthened inapproximability result are given

    The impact of cache misses on the performance of matrix product algorithms on multicore platforms

    Get PDF
    The multicore revolution is underway, bringing new chips introducing more complex memory architectures. Classical algorithms must be revisited in order to take the hierarchical memory layout into account. In this paper, we aim at designing cache-aware algorithms that minimize the number of cache misses paid during the execution of the matrix product kernel on a multicore processor. We analytically show how to achieve the best possible tradeoff between shared and distributed caches. We implement and evaluate several algorithms on two multicore platforms, one equipped with one Xeon quadcore, and the second one enriched with a GPU. It turns out that the impact of cache misses is very different across both platforms, and we identify what are the main design parameters that lead to peak performance for each target hardware configuration.La révolution multi-coeur est en cours, qui voit l'arrivée de processeurs dotées d'une architecture mémoire complexe. Les algorithmes les plus classiques doivent être revisités pour prendre en compte la disposition hiérarchique de la mémoire. Dans ce rapport, nous étudions des algorithmes prenant en compte les caches de données qui minimisent le nombre de défauts de cache pendant l'exécution d'un produit de matrices sur un processeur multi-coeur. Nous montrons analytiquement comment obtenir le meilleur compromis entre les caches partagés et distribués. Nous proposons une implémentation pour évaluer ces algorithmes sur deux plates-formes multi-coeur, l'une équipé d'un processeur Xeon quadri-coeur, l'autre dotée d'un GPU. Il apparaît que l'impact des défauts de cache est très différent sur ces deux plates-formes, et nous identifions quels sont les principaux paramètres de conception qui conduisent aux performances maximales pour chacune de ces configurations matérielles

    Memory-aware list scheduling for hybrid platforms

    Get PDF
    This report provides memory-aware heuristics to schedule tasks graphs onto heterogeneous resources, such as a dual-memory cluster equipped with multicores and a dedicated accelerator (FPGA or GPU). Each task has a different processing time for either resource. The optimization objective is to schedule the graph so as to minimize execution time, given the available memory for each resource type. In addition to ordering the tasks, we must also decide on which resource to execute them, given their computation requirement and the memory currently available on each resource. The major contributions of this report are twofold: (i) the derivation of an intricate integer linear program formulation for this scheduling problem; and (ii) the design of memory-aware heuristics, which outperform the reference heuristics HEFT and MinMin on a wide variety of problem instances. The absolute performance of these heuristics is assessed for small-size graphs, with up to 30 tasks, thanks to the linear program

    Model and complexity results for tree traversals on hybrid platforms

    Get PDF
    International audienceWe study the complexity of traversing tree-shaped workflows whose tasks require large I/O files. We target a heterogeneous architec- ture with two resource of different types, where each resource has its own memory, such as a multicore node equipped with a dedicated accelera- tor (FPGA or GPU). Tasks in the workflow are tagged with the type of resource needed for their processing. Besides, a task can be processed on a given resource only if all its input files and output files can be stored in the corresponding memory. At a given execution step, the amount of data stored in each memory strongly depends upon the ordering in which the tasks are executed, and upon when communications between both memories are scheduled. The objective is to determine an efficient traver- sal that minimizes the maximum amount of memory of each type needed to traverse the whole tree. In this paper, we establish the complexity of this two-memory scheduling problem, provide inapproximability results, and show how to determine the optimal depth-first traversal. Altogether, these results lay the foundations for memory-aware scheduling algorithms on heterogeneous platforms

    Scheduling malleable task trees

    Get PDF
    Solving sparse linear systems can lead to processing tree workflows on a platform of processors. In this study, we use the model of malleable tasks motivated in [Prasanna96,Beaumont07] in order to study tree workflow schedules under two contradictory objectives: makespan minimization and memory minization. First, we give a simpler proof of the result of [Prasanna96] which allows to compute a makespan-optimal schedule for tree workflows. Then, we study a more realistic speed-up function and show that the previous schedules are not optimal in this context. Finally, we give complexity results concerning the objective of minimizing both makespan and memory

    Malleable task-graph scheduling with a practical speed-up model

    Get PDF
    Scientific workloads are often described by Directed Acyclic task Graphs.Indeed, DAGs represent both a model frequently studied in theoretical literature and the structure employed by dynamic runtime schedulers to handle HPC applications. A natural problem is then to compute a makespan-minimizing schedule of a given graph. In this paper, we are motivated by task graphs arising from multifrontal factorizations of sparsematrices and therefore work under the following practical model. We focus on malleable tasks (i.e., a single task can be allotted a time-varying number of processors) and specifically on a simple yet realistic speedup model: each task can be perfectly parallelized, but only up to a limited number of processors. We first prove that the associated decision problem of minimizing the makespan is NP-Complete. Then, we study a widely used algorithm, PropScheduling, under this practical model and propose a new strategy GreedyFilling. Even though both strategies are 2-approximations, experiments on real and synthetic data sets show that GreedyFilling achieves significantly lower makespans

    Determining the optimal redistribution

    Get PDF
    The classical redistribution problem aims at optimally scheduling communications when moving from an initial data distribution \Dini to a target distribution \Dtar where each processor PiP_{i} will host a subset P(i)P(i) of data items. However, modern computing platforms are equipped with a powerful interconnection switch, and the cost of a given communication is (almost) independent of the location of its sender and receiver. This leads to generalizing the redistribution problem as follows: find the optimal permutation σ\sigma of processors such that PiP_{i} will host the set P(σ(i))P(\sigma(i)), and for which the cost of the redistribution is minimal. This report studies the complexity of this generalized problem. We provide optimal algorithms and evaluate their gain over classical redistribution through simulations. We also show the NP-hardness of the problem to find the optimal data partition and processor permutation (defined by new subsets P(σ(i))P(\sigma(i))) that minimize the cost of redistribution followed by a simple computation kernel.Le problème de redistribution classique consiste à ordonnancer les communications de manière optimale lorsque l'on passe une distribution de données initiale \Dini à une distribution cible \Dtar où chaque processeur PiP_{i} héberge un sous-ensemble P(i)P(i) des données. Cependant, les plates-formes de calcul modernes sont équipées de puissants réseaux d'interconnexion programmables, et le coût d'une communication donnée est (presque) indépendant de l'emplacement de l'expéditeur et du récepteur. Cela conduit à généraliser le problème de redistribution comme suit: trouver la permutation optimale σ\sigma de processeurs telle que PiP_{i} héberge l'ensemble P(σ(i))P(\sigma(i)), et telle que le coût de redistribution soit minimal. Ce rapport étudie la complexité de ce problème généralisé. Nous proposons des algorithmes optimaux et évaluons leur gain par rapport à la redistribution classique, via quelques simulations. Nous montrons aussi la NP-completude du problème consistant à trouver la partition de données optimale et la permutation des processeurs (définie par les nouveaux sous-ensembles P(σ(i))P(\sigma(i))) qui minimise le coût de la redistribution suivie d'un noyau de calcul simple
    corecore