27 research outputs found

    rDLB: A Novel Approach for Robust Dynamic Load Balancing of Scientific Applications with Parallel Independent Tasks

    Full text link
    Scientific applications often contain large and computationally intensive parallel loops. Dynamic loop self scheduling (DLS) is used to achieve a balanced load execution of such applications on high performance computing (HPC) systems. Large HPC systems are vulnerable to processors or node failures and perturbations in the availability of resources. Most self-scheduling approaches do not consider fault-tolerant scheduling or depend on failure or perturbation detection and react by rescheduling failed tasks. In this work, a robust dynamic load balancing (rDLB) approach is proposed for the robust self scheduling of independent tasks. The proposed approach is proactive and does not depend on failure or perturbation detection. The theoretical analysis of the proposed approach shows that it is linearly scalable and its cost decrease quadratically by increasing the system size. rDLB is integrated into an MPI DLS library to evaluate its performance experimentally with two computationally intensive scientific applications. Results show that rDLB enables the tolerance of up to (P minus one) processor failures, where P is the number of processors executing an application. In the presence of perturbations, rDLB boosted the robustness of DLS techniques up to 30 times and decreased application execution time up to 7 times compared to their counterparts without rDLB

    A Measure of Robustness Against Multiple Kinds of Perturbations

    Get PDF
    Parallel and distributed heterogeneous computing systems may operate in an environment that undergoes unpredictable changes causing certain system performance features to degrade. Such systems need robustness to guarantee limited degradation despite fluctuations in the behavior of its component parts or environment. Our previous work in this area presented a method for generating a measure of robustness for a given system. However, the focus of that approach was on a scenario where all perturbations were of the same kind, e.g., all perturbations were in message sizes or computation times, but not both message sizes and computation times. This paper gives an extended discussion of the case where perturbations could be of different kinds, and presents some new insights

    10- #1123 DISE脩O ROBUSTO DEL SISTEMA LOG脥STICO DE ACOPIO Y REFRIGERACI脫N DE LECHE MEDIANTE AN脕LISIS DE LAS COMPENSACIONES ENTRE EMISIONES DE CO2 Y VALOR PRESENTE NETO

    Get PDF
    El problema de dise帽o de sistemas log铆sticos es un problema de nivel estrat茅gico que implica la selecci贸n de uno o varios dep贸sitos de un conjunto de ubicaciones candidatas. Durante los 煤ltimos a帽os, muchos problemas de log铆stica e investigaci贸n de operaciones se han extendido para incluir problemas de efecto invernadero y aspectos financieros relacionados con el impacto ambiental de las actividades de transporte. El presente trabajo presenta un dise帽o robusto del sistema log铆stico de acopio y refrigeraci贸n de leche en una Cooperativa (Tordecilla-Madera, Polo, Mu帽oz, Gonz谩lez-Rodr铆guez, 2017). Este dise帽o consiste en la localizaci贸n de tanques de refrigeraci贸n, en donde cada uno permite acopiar la leche de varios productores. El modelo propuesto est谩 formulado como un problema bi-objetivo, considerando la minimizaci贸n de las emisiones de gases de efecto invernadero producida por el trasporte de cantinas de leche en motocicleta y la maximizaci贸n del valor presente neto de la configuraci贸n del sistema (VPN). Al caracterizar la relaci贸n robustez-VPN y robustez-CO2 se determin贸 cu谩l configuraci贸n es m谩s robusta y como se genera esta robustez. El modelo matem谩tico propuesto del problema se resuelve con la t茅cnica cl谩sica de restricci贸n 茅psilon y la robustez se determina por medio de la metodolog铆a FePia (Ali, Maciejewski, Siegel, 2004). Se determin贸 entonces que la Cooperativa debe montar su sistema log铆stico de acopio y refrigeraci贸n de acuerdo con la configuraci贸n escogida y para esta se dise帽贸 un plan t谩ctico que optimiza el uso de los tanques de refrigeraci贸n instalados

    Robust resource allocation in weather data processing systems

    Get PDF
    Includes bibliographical references (pages [9-10]).Reliability of weather data processing systems is of prime importance to ensure the efficient operation of space-based weather monitoring systems. This work defines a heterogeneous weather data processing system that is susceptible to uncertainties in data set arrival times. The resource allocation must be robust with respect to these uncertainties. The tasks to be executed by the data processing system are classified into three broad categories: telemetry, tracking and control (high priority); data processing (medium priority); and data research (low priority).The high priority tasks must be completed before considering medium and low priority tasks. The goal of this research is to find a resource allocation that minimizes makespan of the high priority tasks, and to find a mapping that maximizes a function of the completion time and priority of the medium and low priority tasks. Different heuristic techniques to find near optimal solutions are studied, and their performance is evaluated

    Robust processor allocation for independent tasks when dollar cost for processors is a constraint

    Get PDF
    Includes bibliographical references (pages 9-10).In a distributed heterogeneous computing system, the resources have different capabilities and tasks have different requirements. Different classes of machines used in such systems typically vary in dollar cost based on their computing efficiencies. Makespan (defined as the completion time for an entire set of tasks) is often the performance feature that is optimized. Resource allocation is often done based on estimates of the computation time of each task on each class of machines. Hence, it is important that makespan be robust against errors in computation time estimates. The dollar cost to purchase the machines for use can be a constraint such that only a subset of the machines available can be purchased. The goal of this study is to: (1) select a subset of all the machines available so that the cost constraint for the machines is satisfied, and (2) find a static mapping of tasks so that the robustness of the desired system feature, makespan, is maximized against the errors in task execution time estimates. Six heuristic techniques to this problem are presented and evaluated

    Robustness of resource allocation in parallel and distributed computing systems, The

    Get PDF
    Includes bibliographical references (page [9]).This paper gives an overview of the material to be discussed in the invited keynote presentation by H. J. Siegel; it summarizes our research in [1]. Performing computing and communication tasks on parallel and distributed systems involves the coordinated use of different types of machines, networks, interfaces, and other resources. Decisions about how best to allocate resources are often based on estimated values of task and system parameters, due to uncertainties in the system environment. An important research problem is the development of resource management strategies that can guarantee a particular system performance given such uncertainties. We have designed a methodology for deriving the degree of robustness of a resource allocation - the maximum amount of collective uncertainty in system parameters within which a user-specified level of system performance (QoS) can be guaranteed. Our four-step procedure for deriving a robustness metric for an arbitrary system will be presented. We will illustrate this procedure and its usefulness by deriving robustness metrics for some example distributed systems

    The Robustness of Resource Allocation in Parallel and Distributed Computing Systems

    Get PDF
    This paper gives an overview of the material to be discussed in the invited keynote presentation by H. J. Siegel. Performing computing and communication tasks on parallel and distributed systems involves the coordinated use of different types of machines, networks, interfaces, and other resources. Decisions about how best to allocate resources are often based on estimated values of task and system parameters, due to uncertainties in the system environment. An important research problem is the development of resource management strategies that can guarantee a particular system performance given such uncertainties. We have designed a methodology for deriving the degree of robustness of a resource allocation - the maximum amount of collective uncertainty in system parameters within which a user-specified level of system performance (QoS) can be guaranteed. Our four-step procedure for deriving a robustness metric for an arbitrary system will be presented. We will illustrate this procedure and its usefulness by deriving robustness metrics for some example distributed systems
    corecore