thesis

Recovery-oriented software architecture for grid applications (ROSA-Grids)

Abstract

Grids are distributed systems that dynamically coordinate a large number of heterogeneous resources to execute large-scale projects. Examples of grid resources include high-performance computers, massive data stores, high bandwidth networking, telescopes, and synchrotrons. Failure in grids is arguably inevitable due to the massive scale and the heterogeneity of grid resources, the distribution of these resources over unreliable networks, the complexity of mechanisms that are needed to integrate such resources into a seamless utility, and the dynamic nature of the grid infrastructure that allows continuous changes to happen. To make matters worse, grid applications are generally long running, and these runs repeatedly require coordinated use of many resources at the same time. In this thesis, we propose the Recovery-Aware Components (RAC) approach. The RAC approach enables a grid application to handle failure reactively and proactively at the level of the smallest and independent execution unit of the application. The approach also combines runtime prediction with a proactive fault tolerance strategy. The RAC approach aims at improving the reliability of the grid application with the least overhead possible. Moreover, to allow a grid fault tolerance manager fine-tuned control and trading off of reliability gained and overhead paid, this thesis offers an architecture-aware modelling and simulation of reliability and overhead. The thesis demonstrates for a few of a dozen or so classes of application architecture already identified in prior research, that the typical architectural structure of the class can be captured in a few parameters. The work shows that these parameters suffice to achieve significant insight into, and control of, such tradeoffs. The contributions of our research project are as follows. We defined the RAC approach. We showed the usage of the RAC approach for improving the reliability of MapReduce and Combinational Logic grid applications. We provided Markov models that represent the execution behaviour of these applications for reliability and overhead analyses. We analysed the sensitivity of the reliability-overhead tradeoff of the RAC approach to the type of fault tolerance strategy, the parameters of a fault tolerance strategy, prediction interval and a predictor’s accuracy. The final contribution of our research is an experiment testbed that enables a grid fault tolerance expert to evaluate diverse fault tolerance support configurations, and then choose the one that will satisfy the reliability and cost requirements

    Similar works