24 research outputs found

    Collective behavior and task persistification in lazy and minimalist collectives

    Get PDF
    When individuals in a collective system are constrained in terms of sensing, memory, computation, or power reserves; the design of algorithms to control them becomes challenging. These individual limitations can be due to multiple reasons like the shrinking size of each agent for bulk manufacturing efficiency or enforced simplicity to attain cost efficiency. Whereas, in some areas like nano-medicine, the nature of the task itself warrants such simplicity. This thesis presents algorithms inspired by biological and statistical physics models to achieve useful collective behavior through simple local physical interactions and, minimalist approaches to persistify tasks for long durations in collectives with limited capabilities and energy reserves. The first part of the thesis presents a system of vibration-driven robots that embodies the features of simplicity described above. A combination of theory, experiment, and simulation is used to study dynamic aggregation behavior in these robots facilitated via short-range physical attraction potentials between agents. Collectives in a dynamically aggregated state are shown to be capable of transporting objects over relatively long distances in a finite arena. In the rest of the thesis, two different, yet complementary systems are studied and elaborated to highlight the usefulness of distributed inactivity and activity modulation in aiding persistification of tasks in collectives incapable of implementing complicated algorithms to incorporate regular energy replenishing cycles. To summarize, an approach to achieving dynamic aggregation and related tasks like object transport in a constrained brushbot system is described. Two different artificial and biological collective systems are explored to reveal strategies through which tasks can be persistified without requiring complicated computations, sensing, and memory.Ph.D

    Estudo sobre processamento maciçamente paralelo na internet

    Get PDF
    Orientador: Marco Aurélio Amaral HenriquesTese (doutorado) - Universidade Estadual de Campinas, Faculdade de Engenharia Eletrica e de ComputaçãoResumo: Este trabalho estuda a possibilidade de aproveitar o poder de processamento agregado dos computadores conectados pela Internet para resolver problemas de grande porte. O trabalho apresenta um estudo do problema tanto do ponto de vista teórico quanto prático. Desde o ponto de vista teórico estudam-se as características das aplicações paralelas que podem tirar proveito de um ambiente computacional com um grande número de computadores heterogêneos fracamente acoplados. Desde o ponto de vista prático estudam-se os problemas fundamentais a serem resolvidos para se construir um computador paralelo virtual com estas características e propõem-se soluções para alguns dos mais importantes como balanceamento de carga e tolerância a falhas. Os resultados obtidos indicam que é possível construir um computador paralelo virtual robusto, escalável e tolerante a falhas e obter bons resultados na execução de aplicações com alta razão computação/comunicaçãoAbstract: This thesis explores the possibility of using the aggregated processing power of computers connected by the Internet to solve large problems. The issue is studied both from the theoretical and practical point of views. From the theoretical perspective this work studies the characteristics that parallel applications should have to be able to exploit an environment with a large, weakly connected set of computers. From the practical perspective the thesis indicates the fundamental problems to be solved in order to construct a large parallel virtual computer, and proposes solutions to some of the most important of them, such as load balancing and fault tolerance. The results obtained so far indicate that it is possible to construct a robust, scalable and fault tolerant parallel virtual computer and use it to execute applications with high computing/communication ratioDoutoradoEngenharia de ComputaçãoDoutor em Engenharia Elétric

    Adaptive and Power-aware Fault Tolerance for Future Extreme-scale Computing

    Get PDF
    Two major trends in large-scale computing are the rapid growth in HPC with in particular an international exascale initiative, and the dramatic expansion of Cloud infrastructures accompanied by the Big Data passion. To satisfy the continuous demands for increasing computing capacity, future extreme-scale systems will embrace a multi-fold increase in the number of computing, storage, and communication components, in order to support an unprecedented level of parallelism. Despite the capacity and economies benefits, making the upward transformation to extreme-scale poses numerous scientific and technological challenges, two of which are the power consumption and fault tolerance. With the increase in system scale, failure would become a norm rather than an exception, driving the system to significantly lower efficiency with unforeseen power consumption. This thesis aims at simultaneously addressing the above two challenges by introducing a novel fault-tolerant computational model, referred to as \textit{Leaping Shadows}. Based on Shadow Replication, Leaping Shadows associates with each main process a suite of coordinated shadow processes, which execute in parallel but at differential rates, to deal with failures and meet the QoS requirements of the underlying application under strict power/energy constraints. In failure-prone extreme-scale computing environments, this new model addresses the limitations of the basic Shadow Replication model, and achieves adaptive and power-aware fault tolerance that is more time and energy efficient than existing techniques. In this thesis, we first present an analytical model based optimization framework that demonstrates Shadow Replication's adaptivity and flexibility in achieving multi-dimensional QoS requirements. Then, we introduce Leaping Shadows as a novel power-aware fault tolerance model, which tolerates multiple types of failures, guarantees forward progress, and maintains a consistent level of resilience. Lastly, the details of a Leaping Shadows implementation in MPI is discussed, along with extensive performance evaluation that includes comparison to checkpoint/restart. Collectively, these efforts advocate an adaptive and power-aware fault tolerance alternative for future extreme-scale computing

    Extensions of Task-based Runtime for High Performance Dense Linear Algebra Applications

    Get PDF
    On the road to exascale computing, the gap between hardware peak performance and application performance is increasing as system scale, chip density and inherent complexity of modern supercomputers are expanding. Even if we put aside the difficulty to express algorithmic parallelism and to efficiently execute applications at large scale, other open questions remain. The ever-growing scale of modern supercomputers induces a fast decline of the Mean Time To Failure. A generic, low-overhead, resilient extension becomes a desired aptitude for any programming paradigm. This dissertation addresses these two critical issues, designing an efficient unified linear algebra development environment using a task-based runtime, and extending a task-based runtime with fault tolerant capabilities to build a generic framework providing both soft and hard error resilience to task-based programming paradigm. To bridge the gap between hardware peak performance and application perfor- mance, a unified programming model is designed to take advantage of a lightweight task-based runtime to manage the resource-specific workload, and to control the data ow and parallel execution of tasks. Under this unified development, linear algebra tasks are abstracted across different underlying heterogeneous resources, including multicore CPUs, GPUs and Intel Xeon Phi coprocessors. Performance portability is guaranteed and this programming model is adapted to a wide range of accelerators, supporting both shared and distributed-memory environments. To solve the resilient challenges on large scale systems, fault tolerant mechanisms are designed for a task-based runtime to protect applications against both soft and hard errors. For soft errors, three additions to a task-based runtime are explored. The first recovers the application by re-executing minimum number of tasks, the second logs intermediary data between tasks to minimize the necessary re-execution, while the last one takes advantage of algorithmic properties to recover the data without re- execution. For hard errors, we propose two generic approaches, which augment the data logging mechanism for soft errors. The first utilizes non-volatile storage device to save logged data, while the second saves local logged data on a remote node to protect against node failure. Experimental results have confirmed that our soft and hard error fault tolerant mechanisms exhibit the expected correctness and efficiency

    Mitigation of failures in high performance computing via runtime techniques

    Get PDF
    As machines increase in scale, it is predicted that failure rates of supercomputers will correspondingly increase. Even though the mean time to failure (MTTF) of individual component is high, the large number of components significantly decreases the system MTTF. Meanwhile, the decreasing size of transistors has been critical to the increase in capacity of supercomputers. The smaller the transistors are, silent data corruptions (SDC) are likely to occur more frequently. SDCs do not inhibit execution, but may silently lead to incorrect results. In this thesis, we leverage runtime system and compiler techniques to mitigate a significant fraction of failures automatically with low overhead. The main goals of various system-level fault tolerance strategies designed in this thesis are: reducing the extra cost added to application execution while improving system reliability; automatically adjusting fault tolerance decisions without user intervention based on environmental changes; protecting applications not only from fail-stop failures but also from silent data corruptions. The main contributions of this thesis are development of a semi-blocking checkpoint protocol that overlaps application execution with fault tolerance operation to reduce the overhead of checkpointing, a runtime system technique for automatic checkpoint and restart without user intervention, a holistic framework (ACR) for automatically detecting and recovering from silent data corruptions and a framework called FlipBack that provides targeted protection against silent data corruption with low cost

    Workload Modeling for Computer Systems Performance Evaluation

    Full text link

    Using migratable objects to enhance fault tolerance schemes in supercomputers

    Get PDF
    Supercomputers have seen an exponential increase in their size in the last two decades. Such a high growth rate is expected to take us to exascale in the timeframe 2018-2022. But, to bring a productive exascale environment about, it is necessary to focus on several key challenges. One of those challenges is fault tolerance. Machines at extreme scale will experience frequent failures and will require the system to avoid or overcome those failures. Various techniques have recently been developed to tolerate failures. The impact of these techniques and their scalability can be substantially enhanced by a parallel programming model called migratable objects. In this paper, we demonstrate how the migratable-objects model facilitates and improves several fault tolerance approaches. Our experimental results on thousands of cores suggest fault tolerance schemes based on migratable objects have low performance overhead and high scalability. Additionally, we present a performance model that predicts a significant benefit of using migratable objects to provide fault tolerance at extreme scale

    DEPENDABILITY IN CLOUD COMPUTING

    Get PDF
    The technological advances and success of Service-Oriented Architectures and the Cloud computing paradigm have produced a revolution in the Information and Communications Technology (ICT). Today, a wide range of services are provisioned to the users in a flexible and cost-effective manner, thanks to the encapsulation of several technologies with modern business models. These services not only offer high-level software functionalities such as social networks or e-commerce but also middleware tools that simplify application development and low-level data storage, processing, and networking resources. Hence, with the advent of the Cloud computing paradigm, today's ICT allows users to completely outsource their IT infrastructure and benefit significantly from the economies of scale. At the same time, with the widespread use of ICT, the amount of data being generated, stored and processed by private companies, public organizations and individuals is rapidly increasing. The in-house management of data and applications is proving to be highly cost intensive and Cloud computing is becoming the destination of choice for increasing number of users. As a consequence, Cloud computing services are being used to realize a wide range of applications, each having unique dependability and Quality-of-Service (Qos) requirements. For example, a small enterprise may use a Cloud storage service as a simple backup solution, requiring high data availability, while a large government organization may execute a real-time mission-critical application using the Cloud compute service, requiring high levels of dependability (e.g., reliability, availability, security) and performance. Service providers are presently able to offer sufficient resource heterogeneity, but are failing to satisfy users' dependability requirements mainly because the failures and vulnerabilities in Cloud infrastructures are a norm rather than an exception. This thesis provides a comprehensive solution for improving the dependability of Cloud computing -- so that -- users can justifiably trust Cloud computing services for building, deploying and executing their applications. A number of approaches ranging from the use of trustworthy hardware to secure application design has been proposed in the literature. The proposed solution consists of three inter-operable yet independent modules, each designed to improve dependability under different system context and/or use-case. A user can selectively apply either a single module or combine them suitably to improve the dependability of her applications both during design time and runtime. Based on the modules applied, the overall proposed solution can increase dependability at three distinct levels. In the following, we provide a brief description of each module. The first module comprises a set of assurance techniques that validates whether a given service supports a specified dependability property with a given level of assurance, and accordingly, awards it a machine-readable certificate. To achieve this, we define a hierarchy of dependability properties where a property represents the dependability characteristics of the service and its specific configuration. A model of the service is also used to verify the validity of the certificate using runtime monitoring, thus complementing the dynamic nature of the Cloud computing infrastructure and making the certificate usable both at discovery and runtime. This module also extends the service registry to allow users to select services with a set of certified dependability properties, hence offering the basic support required to implement dependable applications. We note that this module directly considers services implemented by service providers and provides awareness tools that allow users to be aware of the QoS offered by potential partner services. We denote this passive technique as the solution that offers first level of dependability in this thesis. Service providers typically implement a standard set of dependability mechanisms that satisfy the basic needs of most users. Since each application has unique dependability requirements, assurance techniques are not always effective, and a pro-active approach to dependability management is also required. The second module of our solution advocates the innovative approach of offering dependability as a service to users' applications and realizes a framework containing all the mechanisms required to achieve this. We note that this approach relieves users from implementing low-level dependability mechanisms and system management procedures during application development and satisfies specific dependability goals of each application. We denote the module offering dependability as a service as the solution that offers second level of dependability in this thesis. The third, and the last, module of our solution concerns secure application execution. This module considers complex applications and presents advanced resource management schemes that deploy applications with improved optimality when compared to the algorithms of the second module. This module improves dependability of a given application by minimizing its exposure to existing vulnerabilities, while being subject to the same dependability policies and resource allocation conditions as in the second module. Our approach to secure application deployment and execution denotes the third level of dependability offered in this thesis. The contributions of this thesis can be summarized as follows.The contributions of this thesis can be summarized as follows. \u2022 With respect to assurance techniques our contributions are: i) de finition of a hierarchy of dependability properties, an approach to service modeling, and a model transformation scheme; ii) de finition of a dependability certifi cation scheme for services; iii) an approach to service selection that considers users' dependability requirements; iv) de finition of a solution to dependability certifi cation of composite services, where the dependability properties of a composite service are calculated on the basis of the dependability certi ficates of component services. \u2022 With respect to off ering dependability as a service our contributions are: i) de finition of a delivery scheme that transparently functions on users' applications and satisfi es their dependability requirements; ii) design of a framework that encapsulates all the components necessary to o er dependability as a service to the users; iii) an approach to translate high level users' requirements to low level dependability mechanisms; iv) formulation of constraints that allow enforcement of deployment conditions inherent to dependability mechanisms and an approach to satisfy such constraints during resource allocation; v) a resource management scheme that masks the a ffect of system changes by adapting the current allocation of the application. \u2022 With respect to security management our contributions are: i) an approach that deploys users' applications in the Cloud infrastructure such that their exposure to vulnerabilities is minimized; ii) an approach to build interruptible elastic algorithms whose optimality improves as the processing time increases, eventually converging to an optimal solution

    Tolérance aux pannes dans des environnements de calcul parallèle et distribué (optimisation des stratégies de sauvegarde/reprise et ordonnancement)

    Get PDF
    Le passage de l'échelle des nouvelles plates-formes de calcul parallèle et distribué soulève de nombreux défis scientifiques. À terme, il est envisageable de voir apparaître des applications composées d'un milliard de processus exécutés sur des systèmes à un million de coeurs. Cette augmentation fulgurante du nombre de processeurs pose un défi de résilience incontournable, puisque ces applications devraient faire face à plusieurs pannes par jours. Pour assurer une bonne exécution dans ce contexte hautement perturbé par des interruptions, de nombreuses techniques de tolérance aux pannes telle que l'approche de sauvegarde et reprise (checkpoint) ont été imaginées et étudiées. Cependant, l'intégration de ces approches de tolérance aux pannes dans le couple formé par l'application et la plate-forme d'exécution soulève des problématiques d'optimisation pour déterminer le compromis entre le surcoût induit par le mécanisme de tolérance aux pannes d'un coté et l'impact des pannes sur l'exécution d'un autre coté. Dans la première partie de cette thèse nous concevons deux modèles de performance stochastique (minimisation de l'impact des pannes et du surcoût des points de sauvegarde sur l'espérance du temps de complétion de l'exécution en fonction de la distribution d'inter-arrivées des pannes). Dans la première variante l'objectif est la minimisation de l'espérance du temps de complétion en considérant que l'application est de nature préemptive. Nous exhibons dans ce cas de figure tout d'abord une expression analytique de la période de sauvegarde optimale quand le taux de panne et le surcoût des points de sauvegarde sont constants. Par contre dans le cas où le taux de panne ou les surcoûts des points de sauvegarde sont arbitraires nous présentons une approche numérique pour calculer l'ordonnancement optimal des points de sauvegarde. Dans la deuxième variante, l'objectif est la minimisation de l'espérance de la quantité totale de temps perdu avant la première panne en considérant les applications de nature non-préemptive. Dans ce cas de figure, nous démontrons tout d'abord que si les surcoûts des points sauvegarde sont arbitraires alors le problème du meilleur ordonnancement des points de sauvegarde est NP-complet. Ensuite, nous exhibons un schéma de programmation dynamique pour calculer un ordonnancement optimal. Dans la deuxième partie de cette thèse nous nous focalisons sur la conception des stratégies d'ordonnancement tolérant aux pannes qui optimisent à la fois le temps de complétion de la dernière tâche et la probabilité de succès de l'application. Nous mettons en évidence dans ce cas de figure qu'en fonction de la nature de la distribution de pannes, les deux objectifs à optimiser sont tantôt antagonistes, tantôt congruents. Ensuite en fonction de la nature de distribution de pannes nous donnons des approches d'ordonnancement avec des ratios de performance garantis par rapport aux deux objectifs.The parallel computing platforms available today are increasingly larger. Typically the emerging parallel platforms will be composed of several millions of CPU cores running up to a billion of threads. This intensive growth of the number of parallel threads will make the application subject to more and more failures. Consequently it is necessary to develop efficient strategies providing safe and reliable completion for HPC parallel applications. Checkpointing is one of the most popular and efficient technique for developing fault-tolerant applications on such a context. However, checkpoint operations are costly in terms of time, computation and network communications. This will certainly affect the global performance of the application. In the first part of this thesis, we propose a performance model that expresses formally the checkpoint scheduling problem. Two variants of the problem have been considered. In the first variant, the objective is the minimization of the expected completion time. Under this model we prove that when the failure rate and the checkpoint cost are constant the optimal checkpoint strategy is necessarily periodic. For the general problem when the failure rate and the checkpoint cost are arbitrary we provide a numerical solution for the problem. In the second variant if the problem, we exhibit the tradeoff between the impact of the checkpoints operations and the lost computation due to failures. In particular, we prove that the checkpoint scheduling problem is NP-hard even in the simple case of uniform failure distribution. We also present a dynamic programming scheme for determining the optimal checkpointing times in all the variants of the problem. In the second part of this thesis, we design several fault tolerant scheduling algorithms that minimize the application makespan and in the same time maximize the application reliability. Mainly, in this part we point out that the growth rate of the failure distribution determines the relationship between both objectives. More precisely we show that when the failure rate is decreasing the two objectives are antagonist. In the second hand when the failure rate is increasing both objective are congruent. Finally, we provide approximation algorithms for both failure rate cases.SAVOIE-SCD - Bib.électronique (730659901) / SudocGRENOBLE1/INP-Bib.électronique (384210012) / SudocGRENOBLE2/3-Bib.électronique (384219901) / SudocSudocFranceF
    corecore