10 research outputs found

    A self-mobile skeleton in the presence of external loads

    Get PDF
    Multicore clusters provide cost-effective platforms for running CPU-intensive and data-intensive parallel applications. To effectively utilise these platforms, sharing their resources is needed amongst the applications rather than dedicated environments. When such computational platforms are shared, user applications must compete at runtime for the same resource so the demand is irregular and hence the load is changeable and unpredictable. This thesis explores a mechanism to exploit shared multicore clusters taking into account the external load. This mechanism seeks to reduce runtime by finding the best computing locations to serve the running computations. We propose a generic algorithmic data-parallel skeleton which is aware of its computations and the load state of the computing environment. This skeleton is structured using the Master/Worker pattern where the master and workers are distributed on the nodes of the cluster. This skeleton divides the problem into computations where all these computations are initiated by the master and coordinated by the distributed workers. Moreover, the skeleton has built-in mobility to implicitly move the parallel computations between two workers. This mobility is data mobility controlled by the application, the skeleton. This skeleton is not problem-specific and therefore it is able to execute different kinds of problems. Our experiments suggest that this skeleton is able to efficiently compensate for unpredictable load variations. We also propose a performance cost model that estimates the continuation time of the running computations locally and remotely. This model also takes the network delay, data size and the load state as inputs to estimate the transfer time of the potential movement. Our experiments demonstrate that this model takes accurate decisions based on estimates in different load patterns to reduce the total execution time. This model is problem-independent because it considers the progress of all current computations. Moreover, this model is based on measurements so it is not dependent on the programming language. Furthermore, this model takes into account the load state of the nodes on which the computation run. This state includes the characteristics of the nodes and hence this model is architecture-independent. Because the scheduling has direct impact on system performance, we support the skeleton with a cost-informed scheduler that uses a hybrid scheduling policy to improve the dynamicity and adaptivity of the skeleton. This scheduler has agents distributed over the participating workers to keep the load information up to date, trigger the estimations, and facilitate the mobility operations. On runtime, the skeleton co-schedules its computations over computational resources without interfering with the native operating system scheduler. We demonstrate that using a hybrid approach the system makes mobility decisions which lead to improved performance and scalability over large number of computational resources. Our experiments suggest that the adaptivity of our skeleton in shared environment improves the performance and reduces resource contention on nodes that are heavily loaded. Therefore, this adaptivity allows other applications to acquire more resources. Finally, our experiments show that the load scheduler has a low incurred overhead, not exceeding 0.6%, compared to the total execution time

    Second-generation polyalgorithms for parallel dense-matrix multiplication

    Get PDF
    The polyalgorithm library, originally designed in 1991-1993 by Robert Falgout, Jin Li, and Anthony Skjellum, includes fourteen dense matrix multiplication algorithms mapped onto two-dimensional process grids using the Message Passing Interface (MPI). This thesis\u27 goal is to achieve optimized performance of parallel, dense linear algebra algorithms by varying the algorithm as a function of problem size, shape, data layout, concurrency, and architecture. We integrate these algorithms with an intra-node BLAS DGEMM kernel designed by Thomas Hines (Tennessee Tech), which improves the BLAS DGEMM performance in fat-by-thin dense matrix multiplication region. We add a rank-k-based SUMMA algorithm, which performs better than rank-1-based SUMMA. We studied performance on two cluster systems and results show the performance and improvements achieved. We compare and contrast our results with COSMA, a recent, highly optimized approach, and verify that COSMA, using optimal 3D grid decompositions, has significant advantages provided its preferred data layouts can be used

    An efficient, practical, portable mapping technique on computational grids

    Get PDF
    Grid computing provides a powerful, virtual parallel system known as a computational Grid on which users can run parallel applications to solve problems quickly. However, users must be careful to allocate tasks to nodes properly because improper allocation of only one task could result in lengthy executions of applications, or even worse, applications could crash. This allocation problem is called the mapping problem, and an entity that tackles this problem is called a mapper. In this thesis, we aim to develop an efficient, practical, portable mapper. To study the mapping problem, researchers often make unrealistic assumptions such as that nodes of Grids are always reliable, that execution times of tasks assigned to nodes are known a priori, or that detailed information of parallel applications is always known. As a result, the practicality and portability of mappers developed in such conditions are uncertain. Our review of related work suggested that a more efficient tool is required to study this problem; therefore, we developed GMap, a simulator researchers/developers can use to develop practical, portable mappers. The fact that nodes are not always reliable leads to the development of an algorithm for predicting the reliability of nodes and a predictor for identifying reliable nodes of Grids. Experimental results showed that the predictor reduced the chance of failures in executions of applications by half. The facts that execution times of tasks assigned to nodes are not known a priori and that detailed information of parallel applications is not alw ays known, lead to the evaluation of five nearest-neighbour (nn) execution time estimators: k-nn smoothing, k-nn, adaptive k-nn, one-nn, and adaptive one-nn. Experimental results showed that adaptive k-nn was the most efficient one. We also implemented the predictor and the estimator in GMap. Using GMap, we could reliably compare the efficiency of six mapping algorithms: Min-min, Max-min, Genetic Algorithms, Simulated Annealing, Tabu Search, and Quick-quality Map, with none of the preceding unrealistic assumptions. Experimental results showed that Quick-quality Map was the most efficient one. As a result of these findings, we achieved our goal in developing an efficient, practical, portable mapper

    Applications Development for the Computational Grid

    Get PDF

    Proceedings of the Second International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2015) Krakow, Poland

    Get PDF
    Proceedings of: Second International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2015). Krakow (Poland), September 10-11, 2015

    Scheduling for Large Scale Distributed Computing Systems: Approaches and Performance Evaluation Issues

    Get PDF
    Although our everyday life and society now depends heavily oncommunication infrastructures and computation infrastructures,scientists and engineers have always been among the main consumers ofcomputing power. This document provides a coherent overview of theresearch I have conducted in the last 15 years and which targets themanagement and performance evaluation of large scale distributedcomputing infrastructures such as clusters, grids, desktop grids,volunteer computing platforms, ... when used for scientific computing.In the first part of this document, I present how I have addressedscheduling problems arising on distributed platforms (like computinggrids) with a particular emphasis on heterogeneity and multi-userissues, hence in connection with game theory. Most of these problemsare relaxed from a classical combinatorial optimization formulationinto a continuous form, which allows to easily account for keyplatform characteristics such as heterogeneity or complex topologywhile providing efficient practical and distributed solutions.The second part presents my main contributions to the SimGrid project,which is a simulation toolkit for building simulators of distributedapplications (originally designed for scheduling algorithm evaluationpurposes). It comprises a unified presentation of how the questions ofvalidation and scalability have been addressed in SimGrid as well asthoughts on specific challenges related to methodological aspects andto the application of SimGrid to the HPC context

    Communications collectives et ordonnancement en régime permanent sur plates-formes hétérogènes

    Get PDF
    The results presented in this document concern scheduling forlarge-scale heterogeneous platforms. We mainly focus on collectivecommunications taking place during the execution of distributedapplications: broadcast, scatter or reduction of data for instance. Westudy the steady-state operation of these communications and we aim atmaximizing the throughput of a series of similar communications. Thegoal is also to obtain an asymptotically optimal schedule as formakespan minimization. We present a general framework to study thesecommunications, which we use to assess the complexity of the problem foreach communication primitive. For a particular model of communication,the bidirectional one-port model, we develop a practical for solvingthe problem, based on a linear program in rational numbers. This resultsare illustrated by experiments on the Grid5000 platform. The study ofsteady-state operations is extended to scheduling multipleapplications on computing grids.Les travaux présentés dans cette thèse concernent l'ordonnancementpour les plates-formes hétérogènes à grande échelle. Nous nousintéressons principalement aux opérations de communicationscollectives comme la diffusion de données, la distribution de donnéesou la réduction. Nous étudions ces problèmes dans le cadre de leurrégime permanent, en optimisant le débit d'une série d'opérations decommunications, en vue d'obtenir un ordonnancement asymptotiquementoptimal du point de vue du temps d'exécution total. Après avoirprésenté un cadre général d'étude qui nous permet de connaître lacomplexité du problème pour chaque primitive, nous développons, pourle modèle de communication un-port bidirectionnel, une méthode derésolution pratique fondée sur la résolution d'un programme linéaireen rationnels. Cette étude du régime permanent est illustrée par desexpérimentations sur Grid5000 et se prolonge vers l'ordonnancementd'applications multiples sur des grilles de calcul
    corecore