89 research outputs found

    Point-to-point and congestion bandwidth estimation: experimental evaluation on PlanetLab

    Get PDF
    In large scale Internet platforms, measuring the available bandwidth between nodes of the platform is difficult and costly. However, having access to this information allows to design clever algorithms to optimize resource usage for some collective communications, like broadcasting a message or organizing master/slave computations. In this paper, we analyze the feasibility to provide estimations, based on a limited number of measurements, for the point-to-point available bandwidth values, and for the congestion which happens when several communications take place at the same time. We present a dataset obtained with both types of measurements performed on a set of nodes from the PlanetLab platform. We show that matrix factorization techniques are quite efficient at predicting point-to-point available bandwidth, but are not adapted for congestion analysis. However, a LastMile modeling of the platform allows to perform congestion predictions with a reasonable level of accuracy, even with a small amount of information, despite the variability of the measured platform

    Analysis of a List Scheduling Algorithm for Task Graphs on Two Types of Resources

    Get PDF
    International audienceWe consider the problem of scheduling task graphs on two types of unrelated resources, which arises in the context of task-based runtime systems on modern platforms containing CPUs and GPUs. In this paper, we focus on an algorithm named HeteroPrio, which was originally introduced as an efficient heuristic for a particular application. HeteroPrio is an adaptation of the well known list scheduling algorithm, in which the tasks are picked by the resources in the order of their acceleration factor. This algorithm is augmented with a spoliation mechanism: a task assigned by the list algorithm can later on be reassigned to a different resource if it allows to finish this task earlier. We propose here the first theoretical analysis of the HeteroPrio algorithm in the presence of dependencies. More specifically, if the platform contains m and n processors of each type, we show that the worst-case approximation ratio of HeteroPrio is between 1 + max(m/n, n/m) and 2 + max(m/n, n/m). Our proof structure allows to precisely identify the necessary conditions on the spoliation strategy to obtain such a guarantee. We also present an in-depth experimental analysis, comparing several such spoliation strategies, and comparing HeteroPrio with other algorithms from the literature. Although the worst case analysis shows the possibility of pathological behavior, HeteroPrio is able to produce, in very reasonable time, schedules of significantly better quality

    Algorithms for Preemptive Co-scheduling of Kernels on GPUs

    Get PDF
    International audienceModern GPUs allow concurrent kernel execution and preemption to improve hardware utilization and responsiveness. Currently, the decision on the simultaneous execution of kernels is performed by the hardware, which can lead to unreasonable use of resources. In this work, we tackle the problem of co-scheduling for GPUs in high competition scenarios. We propose a novel graphbased preemptive co-scheduling algorithm, with the focus on reducing the number of preemptions. We show that the optimal preemptive makespan can be computed by solving a Linear Program in polynomial time. Based on this solution we propose graph theoretical model and an algorithm to build preemptive schedules which minimizes the number of preemptions. We show, however, that finding the minimal amount of preemptions among all preemptive solutions of optimal makespan is a NP-hard problem. We performed experiments on real-world GPU applications and our approach can achieve optimal makespan by preempting 6 to 9% of the tasks

    Column generation integer programming for allocating jobs with periodic demand variations

    Get PDF
    International audienceIn the context of service hosting in large-scale datacenters, we consider the problem faced by a provider for allocating services to machines. An analysis of a public Google trace corresponding to the use of a production cluster over a long period shows that long-running services experience demand variations with a periodic (daily) pattern, and that services with such a pattern account for most of the overall CPU demand. This leads to an allocation problem where the classical Bin-Packing issue is augmented with the possibility to co-locate jobs whose peaks occur at different times of the day, which is bound to be more efficient than the usual approach that consist in over-provisioning for the maximum demand. In this paper, we propose a column-generation approach to solving this problem, where the subproblem uses a sophisticated SOCP (Second Order Cone Program) formulation. This allows to explicitely select jobs which benefit from being co-allocated together. Experimental results comparing with theoretical lower bounds and with standard packing heuristics shows that this approach is able to provide very efficient assignments in reasonable time

    Parallel scheduling of task trees with limited memory

    Get PDF
    This paper investigates the execution of tree-shaped task graphs using multiple processors. Each edge of such a tree represents some large data. A task can only be executed if all input and output data fit into memory, and a data can only be removed from memory after the completion of the task that uses it as an input data. Such trees arise, for instance, in the multifrontal method of sparse matrix factorization. The peak memory needed for the processing of the entire tree depends on the execution order of the tasks. With one processor the objective of the tree traversal is to minimize the required memory. This problem was well studied and optimal polynomial algorithms were proposed. Here, we extend the problem by considering multiple processors, which is of obvious interest in the application area of matrix factorization. With multiple processors comes the additional objective to minimize the time needed to traverse the tree, i.e., to minimize the makespan. Not surprisingly, this problem proves to be much harder than the sequential one. We study the computational complexity of this problem and provide inapproximability results even for unit weight trees. We design a series of practical heuristics achieving different trade-offs between the minimization of peak memory usage and makespan. Some of these heuristics are able to process a tree while keeping the memory usage under a given memory limit. The different heuristics are evaluated in an extensive experimental evaluation using realistic trees.Dans ce rapport, nous nous intéressons au traitement d'arbres de tâches par plusieurs processeurs. Chaque arête d'un tel arbre représente un gros fichier d'entrée/sortie. Une tâche peut être traitée seulement si l'ensemble de ses fichiers d'entrée et de sortie peut résider en mémoire, et un fichier ne peut être retiré de la mémoire que lorsqu'il a été traité. De tels arbres surviennent, par exemple, lors de la factorisation de matrices creuses par des méthodes multifrontales. La quantité de mémoire nécessaire dépend de l'ordre de traitement des tâches. Avec un seul processeur, l'objectif est naturellement de minimiser la quantité de mémoire requise. Ce problème a déjà été étudié et des algorithmes polynomiaux ont été proposés. Nous étendons ce problème en considérant plusieurs processeurs, ce qui est d'un intérêt évident pour le problème de la factorisation de grandes matrices. Avec plusieurs processeurs se pose également le problème de la minimisation du temps nécessaire pour traiter l'arbre. Nous montrons que comme attendu, ce problème est bien plus compliqué que dans le cas séquentiel. Nous étudions la complexité de ce problème et nous fournissons des résultats d'inaproximabilité, même dans le cas de poids unitaires. Nous proposons plusieurs heuristiques qui obtiennent différents compromis entre mémoire et temps d'exécution. Certaines d'entre elles sont capables de traiter l'arbre tout en gardant la consommation mémoire inférieure à une limite donnée. Nous analysons les performances de toutes ces heuristiques par une large campagne de simulations utilisant des arbres réalistes

    MadPipe: Memory Aware Dynamic Programming Algorithm for Pipelined Model Parallelism

    Get PDF
    The training phase in Deep Neural Networks (DNNs) is very computationally intensive and is nowadays often performed on parallel computing platforms, ranging from a few GPUs to several thousand GPUs. The strategy of choice for the parallelization of training is the so-called data parallel approach, based of the parallel training of the different inputs (typically images) and a the aggregation of network weights with collective communications (AllReduce). The scalability of this approach is limited both by the memory available on each node and the networking capacities for collective operations. Recently, a parallel model approach, in which the network weights are distributed and images are trained in a pipeline/stream manner over the computational nodes has been proposed (Pipedream, Gpipe). In this paper, we formalize in detail the optimization problem associated with the placement of DNN layers onto computation resources when using pipelined model parallelism, and we derive a dynamic programming based heuristic, MadPipe, that allows to significantly improve the performance of the parallel model approach compared to the literature

    Reliable Service Allocation in Clouds

    Get PDF
    International audienceWe consider several reliability problems that arise when allocating applications to processing resources in a Cloud computing platform. More specifically, we assume on the one hand that each computing resource is associated to a capacity constraint and to a probability of failure. On the other hand, we assume that each service runs as a set of independent instances of identical Virtual Machines, and that the Service Level Agreement between the Cloud provider and the client states that a minimal number of instances of the service should run with a given probability. In this context, given the capacity and failure probabilities of the machines, and the capacity and reliability demands of the services, the question for the cloud provider is to find an allocation of the instances of the services (possibly using replication) onto machines satisfying all types of constraints during a given time period. In this paper, our goal is to assess the impact of the reliability constraint on the complexity of resource allocation problems. We consider several variants of this problem, depending on the number of services and whether their reliability demand is individual or global. We prove several fundamental complexity results (#\#P' and NP-completeness results) and we provide several optimal and approximation algorithms. In particular, we prove that a basic randomized allocation algorithm, that is easy to implement, provides optimal or quasi-optimal results in several contexts, and we show through simulations that it also achieves very good results in more general settings
    corecore