279 research outputs found

    Parallel scheduling of task trees with limited memory

    Get PDF
    This paper investigates the execution of tree-shaped task graphs using multiple processors. Each edge of such a tree represents some large data. A task can only be executed if all input and output data fit into memory, and a data can only be removed from memory after the completion of the task that uses it as an input data. Such trees arise, for instance, in the multifrontal method of sparse matrix factorization. The peak memory needed for the processing of the entire tree depends on the execution order of the tasks. With one processor the objective of the tree traversal is to minimize the required memory. This problem was well studied and optimal polynomial algorithms were proposed. Here, we extend the problem by considering multiple processors, which is of obvious interest in the application area of matrix factorization. With multiple processors comes the additional objective to minimize the time needed to traverse the tree, i.e., to minimize the makespan. Not surprisingly, this problem proves to be much harder than the sequential one. We study the computational complexity of this problem and provide inapproximability results even for unit weight trees. We design a series of practical heuristics achieving different trade-offs between the minimization of peak memory usage and makespan. Some of these heuristics are able to process a tree while keeping the memory usage under a given memory limit. The different heuristics are evaluated in an extensive experimental evaluation using realistic trees.Dans ce rapport, nous nous intĂ©ressons au traitement d'arbres de tĂąches par plusieurs processeurs. Chaque arĂȘte d'un tel arbre reprĂ©sente un gros fichier d'entrĂ©e/sortie. Une tĂąche peut ĂȘtre traitĂ©e seulement si l'ensemble de ses fichiers d'entrĂ©e et de sortie peut rĂ©sider en mĂ©moire, et un fichier ne peut ĂȘtre retirĂ© de la mĂ©moire que lorsqu'il a Ă©tĂ© traitĂ©. De tels arbres surviennent, par exemple, lors de la factorisation de matrices creuses par des mĂ©thodes multifrontales. La quantitĂ© de mĂ©moire nĂ©cessaire dĂ©pend de l'ordre de traitement des tĂąches. Avec un seul processeur, l'objectif est naturellement de minimiser la quantitĂ© de mĂ©moire requise. Ce problĂšme a dĂ©jĂ  Ă©tĂ© Ă©tudiĂ© et des algorithmes polynomiaux ont Ă©tĂ© proposĂ©s. Nous Ă©tendons ce problĂšme en considĂ©rant plusieurs processeurs, ce qui est d'un intĂ©rĂȘt Ă©vident pour le problĂšme de la factorisation de grandes matrices. Avec plusieurs processeurs se pose Ă©galement le problĂšme de la minimisation du temps nĂ©cessaire pour traiter l'arbre. Nous montrons que comme attendu, ce problĂšme est bien plus compliquĂ© que dans le cas sĂ©quentiel. Nous Ă©tudions la complexitĂ© de ce problĂšme et nous fournissons des rĂ©sultats d'inaproximabilitĂ©, mĂȘme dans le cas de poids unitaires. Nous proposons plusieurs heuristiques qui obtiennent diffĂ©rents compromis entre mĂ©moire et temps d'exĂ©cution. Certaines d'entre elles sont capables de traiter l'arbre tout en gardant la consommation mĂ©moire infĂ©rieure Ă  une limite donnĂ©e. Nous analysons les performances de toutes ces heuristiques par une large campagne de simulations utilisant des arbres rĂ©alistes

    Mapping tree-shaped workflows on systems with different memory sizes and processor speeds

    Get PDF
    Directed acyclic graphs are commonly used to model scientific workflows, by expressing dependencies between tasks, as well as the resource requirements of the workflow. As a special case, rooted directed trees occur in several applications, for instance in sparse matrix computations. Since typical workflows are modeled by large trees, it is crucial to schedule them efficiently, so that their execution time (or makespan) is minimized. Furthermore, it is usually beneficial to distribute the execution on several compute nodes, hence increasing the available memory, and allowing us to parallelize parts of the execution. To exploit the heterogeneity of modern clusters in this context, we investigate the partitioning and mapping of tree‐shaped workflows on two types of target architecture models: in AM1, each processor can have a different memory size, and in AM2, each processor can also have a different speed (in addition to a different memory size). We design a three‐step heuristic for AM1, which adapts and extends previous work for homogeneous clusters [Gou C, Benoit A, Marchal L. Partitioning tree‐shaped task graphs for distributed platforms with limited memory. IEEE Trans Parallel Dist Syst 2020; 31(7): 1533–1544]. The changes we propose concern the assignment to processors (accounting for the different memory sizes) and the availability of suitable processors when splitting or merging subtrees. For AM2, we extend the heuristic for AM1 with a two‐phase local search approach. Phase A is a swap‐based hill climber, while (the optional) Phase B is inspired by iterated local search. We evaluate our heuristics for AM1 and AM2 with extensive simulations, and we demonstrate that exploiting the heterogeneity in the cluster significantly reduces the makespan, compared to the state of the art for homogeneous processors.Peer Reviewe

    Scheduling malleable task trees

    Get PDF
    Solving sparse linear systems can lead to processing tree workflows on a platform of processors. In this study, we use the model of malleable tasks motivated in [Prasanna96,Beaumont07] in order to study tree workflow schedules under two contradictory objectives: makespan minimization and memory minization. First, we give a simpler proof of the result of [Prasanna96] which allows to compute a makespan-optimal schedule for tree workflows. Then, we study a more realistic speed-up function and show that the previous schedules are not optimal in this context. Finally, we give complexity results concerning the objective of minimizing both makespan and memory

    Independent and Divisible Task Scheduling on Heterogeneous Star-shaped Platforms with Limited Memory

    Get PDF
    In this paper, we consider the problem of allocating and scheduling a collection of independent, equal-sized tasks on heterogeneous star-shaped platforms. We also address the same problem for divisible tasks. For both cases, we take memory constraints into account. We prove strong NP-completeness results for different objective functions, namely makespan minimization and throughput maximization, on simple star-shaped platforms. We propose an approximation algorithm based on the unconstrained version (with unlimited memory) of the problem. We introduce several heuristics, which are evaluated and compared through extensive simulations. An unexpected conclusion drawn from these experiments is that classical scheduling heuristics that try to greedily minimize the completion time of each task are outperformed by the simple heuristic that consists in assigning the task to the available processor that has the smallest communication time, regardless of computation power (hence a "bandwidth-centric" distribution).Dans ce rapport, nous nous intĂ©ressons au problĂšme de l’allocation d’un grand nombre de taches indĂ©pendantes et de taille identiques sur des plateformes de calcul hĂ©tĂ©rogĂšnes organisĂ©es en Ă©toile. Nous nous intĂ©ressons Ă©galement au modĂšle des tĂąches divisibles. Pour ces deux modĂšles, nous prenons en compte les contraintes mĂ©moires et dĂ©montrons des rĂ©sultats de NP-complĂ©tude pour diverses mĂ©triques (le «makespakan» et le dĂ©bit). Nous proposons un algorithme d’approximation basĂ© sur la version non-contrainte (c’est-`a-dire avec une mĂ©moire infinie) du problĂšme. Nous proposons Ă©galement d’autres heuristiques que nous Ă©valuons Ă  l’aide d’un grand nombre de simulations. Une conclusion inattendue qui ressort de ces expĂ©riences est que les heuristiques de listes classiques qui essaient de minimiser gloutonnement la durĂ©e de l’ordonnancement sont bien moins performantes que l’heuristique toute simple consistant Ă  envoyer les tĂąches aux processeurs disponibles ayant le temps de communication le plus faible, sans mĂȘme tenir compte de leur puissance de calcu

    Memory-aware list scheduling for hybrid platforms

    Get PDF
    This report provides memory-aware heuristics to schedule tasks graphs onto heterogeneous resources, such as a dual-memory cluster equipped with multicores and a dedicated accelerator (FPGA or GPU). Each task has a different processing time for either resource. The optimization objective is to schedule the graph so as to minimize execution time, given the available memory for each resource type. In addition to ordering the tasks, we must also decide on which resource to execute them, given their computation requirement and the memory currently available on each resource. The major contributions of this report are twofold: (i) the derivation of an intricate integer linear program formulation for this scheduling problem; and (ii) the design of memory-aware heuristics, which outperform the reference heuristics HEFT and MinMin on a wide variety of problem instances. The absolute performance of these heuristics is assessed for small-size graphs, with up to 30 tasks, thanks to the linear program

    Partitionnement d’arbres de tĂąches pour des plates-formes distribuĂ©es avec limitation de mĂ©moire

    Get PDF
    Scientific applications are commonly modeled as the processing of directed acyclicgraphs of tasks, and for some of them, the graph takes the special form of a rooted tree. Thistree expresses both the computational dependencies between tasks and their storage requirements.The problem of scheduling/traversing such a tree on a single processor to minimize its memoryfootprint has already been widely studied. Hence, we move to parallel processing and study howto partition the tree for a homogeneous multiprocessor platform, where each processor is equippedwith its own memory. We formally state the problem of partitioning the tree into subtrees suchthat each subtree can be processed on a single processor and the total resulting processing time isminimized. We prove that the problem is NP-complete, and we design polynomial-time heuristicsto address it. An extensive set of simulations demonstrates the usefulness of these heuristics.Les applications scientifiques sont couramment modĂ©lisĂ©es par des graphes de tĂąches. Pour certaines d'entre elles, le graphe prend la forme particuliĂšre d'un arbre enracinĂ©". Cet arbre dĂ©termine Ă  la fois les dĂ©pendance entre tĂąches de calcul et les besoins en stockage. Le problĂšme d'ordonnancer (ou parcourir) un tel arbre sur un seul processeur pour rĂ©duire son empreinte mĂ©moire a dĂ©jĂ  largement Ă©tĂ© Ă©tudiĂ©. Dans ce rapport, nous considĂ©rons le traitement parallĂšle d'un tel arbre et Ă©tudions comment le partitionner pour une plate-forme decalcul formĂ©e de processeurs homogĂšnes disposant chacun de sa propre mĂ©moire.Nous formalisons le problĂšme du partitionnement de l'arbre en sous-arbres de telle sorte que chaque sous-arbre puisse ĂȘtre traitĂ© sur un seul processeur et que le temps de calcul total soit minimal. Nous montrons que ce problĂšme est NP-complet et proposons des heuristiques polynomiales. Un ensemble exhaustif,de simulations permet de montrer l'utilitĂ© de ces heuristiques

    Placement de workflows de type arbre sur des architectures à mémoire hétérogÚne

    Get PDF
    Directed acyclic graphs are commonly used to model scientific workflows, by expressing dependencies between tasks, as well as the resource requirements of the workflow. As a special case, rooted directed trees occur in several applications, for instance in sparse matrix computations. Since typical workflows are modeled by huge trees, it is crucial to schedule them efficiently, so that their execution time (or makespan) is minimized. Furthermore, it might be beneficial to distribute the execution on several compute nodes, hence increasing the available memory, and allowing us to parallelize parts of the execution. To exploit the heterogeneity of modern clusters in this context, we investigate the partitioning and mapping of tree-shaped workflows on target architectures where each processor can have a different memory size. Our three-step heuristic adapts and extends previous work for homogeneous clusters [Gou et al., TPDS 2020]. The changes we propose concern the assignment to processors (which considers the different memory sizes) and the availability of suitable processors when splitting or merging subtrees. We evaluate our approach with extensive simulations and demonstrate that exploiting the heterogeneity in the cluster reduces the makespan significantly compared to the state of the art for homogeneous memory
    • 

    corecore