14 research outputs found

    Memory-aware list scheduling for hybrid platforms

    Get PDF
    This report provides memory-aware heuristics to schedule tasks graphs onto heterogeneous resources, such as a dual-memory cluster equipped with multicores and a dedicated accelerator (FPGA or GPU). Each task has a different processing time for either resource. The optimization objective is to schedule the graph so as to minimize execution time, given the available memory for each resource type. In addition to ordering the tasks, we must also decide on which resource to execute them, given their computation requirement and the memory currently available on each resource. The major contributions of this report are twofold: (i) the derivation of an intricate integer linear program formulation for this scheduling problem; and (ii) the design of memory-aware heuristics, which outperform the reference heuristics HEFT and MinMin on a wide variety of problem instances. The absolute performance of these heuristics is assessed for small-size graphs, with up to 30 tasks, thanks to the linear program

    Dynamic Memory-Aware Task-Tree Scheduling

    Get PDF
    International audienceFactorizing sparse matrices using direct multi-frontal methods generates directed tree-shaped task graphs, where edges represent data dependency between tasks. This paper revisits the execution of tree-shaped task graphs using multiple processors that share a bounded memory. A task can only be executed if all its input and output data can fit into the memory. The key difficulty is to manage the order of the task executions so that we can achieve high parallelism while staying below the memory bound. In particular, because input data of unprocessed tasks must be kept in memory, a bad scheduling strategy might compromise the termination of the algorithm. In the single processor case, solutions that are guaranteed to be below a memory bound are known. The multi-processor case (when one tries to minimize the total completion time) has been shown to be NP-complete. We present in this paper a novel heuristic solution that has a low complexity and is guaranteed to complete the tree within a given memory bound. We compare our algorithm to state of the art strategies, and observe that on both actual execution trees and synthetic trees, we always perform better than these solutions, with average speedups between 1.25 and 1.45 on actual assembly trees. Moreover, we show that the overhead of our algorithm is negligible even on deep trees, and would allow its runtime execution

    Pingo: A Framework for the Management of Storage of Intermediate Outputs of Computational Workflows

    Get PDF
    abstract: Scientific workflows allow scientists to easily model and express the entire data processing steps, typically as a directed acyclic graph (DAG). These scientific workflows are made of a collection of tasks that usually take a long time to compute and that produce a considerable amount of intermediate datasets. Because of the nature of scientific exploration, a scientific workflow can be modified and re-run multiple times, or new scientific workflows are created that might make use of past intermediate datasets. Storing intermediate datasets has the potential to save time in computations. Since storage is limited, one main problem that needs a solution is determining which intermediate datasets need to be saved at creation time in order to minimize the computational time of the workflows to be run in the future. This research thesis proposes the design and implementation of Pingo, a system that is capable of managing the computations of scientific workflows as well as the storage, provenance and deletion of intermediate datasets. Pingo uses the history of workflows submitted to the system to predict the most likely datasets to be needed in the future, and subjects the decision of dataset deletion to the optimization of the computational time of future workflows.Dissertation/ThesisMasters Thesis Computer Science 201

    Dynamic memory-aware task-tree scheduling

    Get PDF
    Factorizing sparse matrices using direct multifrontal methods generates directed tree-shaped task graphs, where edges represent data dependency between tasks. This paper revisits the execution of tree-shaped task graphs using multiple processors that share a bounded memory. A task can only be executed if all its input and output data can fit into the memory. The key difficulty is to manage the order of the task executions so that we can achieve high parallelism while staying below the memory bound. In particular, because input data of unprocessed tasks must be kept in memory, a bad scheduling strategy might compromise the termination of the algorithm. In the single processor case, solutions that are guaranteed to be below a memory bound are known. The multi-processor case (when one tries to minimize the total completion time) has been shown to be NP-complete. We present in this paper a novel heuristic solution that has a low complexity and is guaranteed to complete the tree within a given memory bound. We compare our algorithm to state of the art strategies, and observe that on both actual execution trees and synthetic trees, we always perform better than these solutions, with average speedups between 1.25 and 1.45 on actual assembly trees. Moreover, we show that the overhead of our algorithm is negligible even on deep trees (10^5), and would allow its runtime execution

    Tree traversals with task-memory affinities

    Get PDF
    We study the complexity of traversing tree-shaped workflows whose tasks require large I/O files. We target a heterogeneous architecture with two resource types, each with a different memory, such as a multicore node equipped with a dedicated accelerator (FPGA or GPU). The tasks in the workflow are colored according to their type and can be processed if all their input and output files can be stored in the corresponding memory. The amount of used memory of each type at a given execution step strongly depends upon the ordering in which the tasks are executed, and upon when communications between both memories are scheduled. The objective is to determine an efficient traversal that minimizes the maximum amount of memory of each type needed to traverse the whole tree. In this paper, we establish the complexity of this two-memory scheduling problem, and provide inapproximability results. In addition, we design several heuristics, based on both post-order and general traversals, and we evaluate them on a comprehensive set of tree graphs, including random trees as well as assembly trees arising in the context of sparse matrix factorizations.Dans ce rapport, nous nous intéressons à la complexité du traitement d'arbres de tâches utilisant de gros fichiers d'entrée et de sortie. Nous nous focalisons sur une architecture hétérogène avec deux types de ressources, utilisant chacune une mémoire spécifique, comme par exemple un noeud multicore équipé d'un accélérateur (FPGA ou GPU). Les tâches de l'arbre sont colorées suivant leur type et peuvent être exécutées si tous leurs fichiers d'entrée et de sortie peuvent être stockés dans la mémoire correspondante. La quantité de mémoire de chaque type utilisée à une étape donnée de l'exécution dépend fortement de l'ordre dans lequel les tâches sont exécutées et du moment où sont effectuées les communications entre les deux mémoires. L'objectif est de déterminer un ordonnancement efficace qui minimise la quantité de mémoire de chaque type nécessaire pour traiter l'arbre entier. Dans ce rapport, nous établissons la complexité de ce problème d'ordonnancement à deux mémoires et nous fournissons des résultats d'inapproximabilité. De plus, nous proposons plusieurs heuristiques, fondées à la fois sur des traversées d'arbres en profondeur et des traversées générales, que nous évaluons sur un ensemble complet d'arbres, comprenant des arbres aléatoires ainsi que des arbres rencontrés dans le domaine de la factorisation de matrices creuses

    Parallel scheduling of task trees with limited memory

    Get PDF
    This paper investigates the execution of tree-shaped task graphs using multiple processors. Each edge of such a tree represents some large data. A task can only be executed if all input and output data fit into memory, and a data can only be removed from memory after the completion of the task that uses it as an input data. Such trees arise, for instance, in the multifrontal method of sparse matrix factorization. The peak memory needed for the processing of the entire tree depends on the execution order of the tasks. With one processor the objective of the tree traversal is to minimize the required memory. This problem was well studied and optimal polynomial algorithms were proposed. Here, we extend the problem by considering multiple processors, which is of obvious interest in the application area of matrix factorization. With multiple processors comes the additional objective to minimize the time needed to traverse the tree, i.e., to minimize the makespan. Not surprisingly, this problem proves to be much harder than the sequential one. We study the computational complexity of this problem and provide inapproximability results even for unit weight trees. We design a series of practical heuristics achieving different trade-offs between the minimization of peak memory usage and makespan. Some of these heuristics are able to process a tree while keeping the memory usage under a given memory limit. The different heuristics are evaluated in an extensive experimental evaluation using realistic trees.Dans ce rapport, nous nous intéressons au traitement d'arbres de tâches par plusieurs processeurs. Chaque arête d'un tel arbre représente un gros fichier d'entrée/sortie. Une tâche peut être traitée seulement si l'ensemble de ses fichiers d'entrée et de sortie peut résider en mémoire, et un fichier ne peut être retiré de la mémoire que lorsqu'il a été traité. De tels arbres surviennent, par exemple, lors de la factorisation de matrices creuses par des méthodes multifrontales. La quantité de mémoire nécessaire dépend de l'ordre de traitement des tâches. Avec un seul processeur, l'objectif est naturellement de minimiser la quantité de mémoire requise. Ce problème a déjà été étudié et des algorithmes polynomiaux ont été proposés. Nous étendons ce problème en considérant plusieurs processeurs, ce qui est d'un intérêt évident pour le problème de la factorisation de grandes matrices. Avec plusieurs processeurs se pose également le problème de la minimisation du temps nécessaire pour traiter l'arbre. Nous montrons que comme attendu, ce problème est bien plus compliqué que dans le cas séquentiel. Nous étudions la complexité de ce problème et nous fournissons des résultats d'inaproximabilité, même dans le cas de poids unitaires. Nous proposons plusieurs heuristiques qui obtiennent différents compromis entre mémoire et temps d'exécution. Certaines d'entre elles sont capables de traiter l'arbre tout en gardant la consommation mémoire inférieure à une limite donnée. Nous analysons les performances de toutes ces heuristiques par une large campagne de simulations utilisant des arbres réalistes

    Partitionnement d’arbres de tâches pour des plates-formes distribuées avec limitation de mémoire

    Get PDF
    Scientific applications are commonly modeled as the processing of directed acyclicgraphs of tasks, and for some of them, the graph takes the special form of a rooted tree. Thistree expresses both the computational dependencies between tasks and their storage requirements.The problem of scheduling/traversing such a tree on a single processor to minimize its memoryfootprint has already been widely studied. Hence, we move to parallel processing and study howto partition the tree for a homogeneous multiprocessor platform, where each processor is equippedwith its own memory. We formally state the problem of partitioning the tree into subtrees suchthat each subtree can be processed on a single processor and the total resulting processing time isminimized. We prove that the problem is NP-complete, and we design polynomial-time heuristicsto address it. An extensive set of simulations demonstrates the usefulness of these heuristics.Les applications scientifiques sont couramment modélisées par des graphes de tâches. Pour certaines d'entre elles, le graphe prend la forme particulière d'un arbre enraciné". Cet arbre détermine à la fois les dépendance entre tâches de calcul et les besoins en stockage. Le problème d'ordonnancer (ou parcourir) un tel arbre sur un seul processeur pour réduire son empreinte mémoire a déjà largement été étudié. Dans ce rapport, nous considérons le traitement parallèle d'un tel arbre et étudions comment le partitionner pour une plate-forme decalcul formée de processeurs homogènes disposant chacun de sa propre mémoire.Nous formalisons le problème du partitionnement de l'arbre en sous-arbres de telle sorte que chaque sous-arbre puisse être traité sur un seul processeur et que le temps de calcul total soit minimal. Nous montrons que ce problème est NP-complet et proposons des heuristiques polynomiales. Un ensemble exhaustif,de simulations permet de montrer l'utilité de ces heuristiques

    Ordonnancement parallèle de DAGs sous contraintes mémoire

    Get PDF
    Scientific workflows are frequently modeled as Directed Acyclic Graphs (DAG) oftasks, which represent computational modules and their dependencies, in the form of dataproduced by a task and used by another one. This formulation allows the use of runtime sys-tems which dynamically allocate tasks onto the resources of increasingly complex and hetero-geneous computing platforms. However, for some workflows, such a dynamic schedule mayrun out of memory by exposing too much parallelism. This paper focuses on the problem oftransforming such a DAG to prevent memory shortage, and concentrates on shared memoryplatforms. We first propose a simple model of DAG which is expressive enough to emulate com-plex memory behaviors. We then exhibit a polynomial-time algorithm that computes the max-imum peak memory of a DAG, that is, the maximum memory needed by any parallel schedule.We consider the problem of reducing this maximum peak memory to make it smaller than agiven bound by adding new fictitious edges, while trying to minimize the critical path of thegraph. After proving this problem NP-complete, we provide an ILP solution as well as severalheuristic strategies that are thoroughly compared by simulation on synthetic DAGs modelingactual computational workflows. We show that on most instances, we are able to decrease themaximum peak memory at the cost of a small increase in the critical path, thus with little im-pact on quality of the final parallel schedule.Les applications de calcul scientifique sont souvent modélisées pardes graphes de tâches orientés acycliques (DAG), qui représentent les tâchesde calcul et leurs dépendances, sous la forme de données produites par unetâche et utilisées par une autre. Cette formulation permet l’utilisation d’APIqui allouent dynamiquement les tâches sur les ressources de plateformes decalcul hétérogènes de plus en plus complexes. Cependant, pour certaines ap-plications, un tel ordonnancement dynamique peut manquer de mémoire enexploitant trop de parallélisme. Cet article porte sur le problème consistant àtransformer un tel DAG pour empêcher toute pénurie de mémoire, en se con-centrant sur les plateformes à mémoire partagée. On propose tout d’abord unmodèle simple de graphe qui est assez expressif pour émuler des comporte-ments mémoires complexes. On expose ensuite un algorithme polynomial quicalcule le pic mémoire maximum d’un DAG, qui représente la mémoire maxi-male requise par tout ordonnancement parallèle. On considère ensuite le prob-lème consistant à réduire ce pic mémoire maximal pour qu’il devienne plus pe-tit qu’une borne donnée en rajoutant des arêtes fictives, tout en essayant deminimiser le chemin critique du graphe. Après avoir prouvé ce problème NP-complet, on fournit un programme linéaire en nombres entiers le résolvant,ainsi que plusieurs stratégies heuristiques qui sont minitieusement comparées-sur des graphes synthétiques modélisant des applications de calcul réelles. Onmontre que sur la plupart des instances, on arrive à diminuer le pic mémoiremaximal, au prix d’une légère augmentation du chemin critique, et donc avecpeu d’impact sur la qualité de l’ordonnancement parallèle final
    corecore