61 research outputs found

    Recovery-oriented software architecture for grid applications (ROSA-Grids)

    Get PDF
    Grids are distributed systems that dynamically coordinate a large number of heterogeneous resources to execute large-scale projects. Examples of grid resources include high-performance computers, massive data stores, high bandwidth networking, telescopes, and synchrotrons. Failure in grids is arguably inevitable due to the massive scale and the heterogeneity of grid resources, the distribution of these resources over unreliable networks, the complexity of mechanisms that are needed to integrate such resources into a seamless utility, and the dynamic nature of the grid infrastructure that allows continuous changes to happen. To make matters worse, grid applications are generally long running, and these runs repeatedly require coordinated use of many resources at the same time. In this thesis, we propose the Recovery-Aware Components (RAC) approach. The RAC approach enables a grid application to handle failure reactively and proactively at the level of the smallest and independent execution unit of the application. The approach also combines runtime prediction with a proactive fault tolerance strategy. The RAC approach aims at improving the reliability of the grid application with the least overhead possible. Moreover, to allow a grid fault tolerance manager fine-tuned control and trading off of reliability gained and overhead paid, this thesis offers an architecture-aware modelling and simulation of reliability and overhead. The thesis demonstrates for a few of a dozen or so classes of application architecture already identified in prior research, that the typical architectural structure of the class can be captured in a few parameters. The work shows that these parameters suffice to achieve significant insight into, and control of, such tradeoffs. The contributions of our research project are as follows. We defined the RAC approach. We showed the usage of the RAC approach for improving the reliability of MapReduce and Combinational Logic grid applications. We provided Markov models that represent the execution behaviour of these applications for reliability and overhead analyses. We analysed the sensitivity of the reliability-overhead tradeoff of the RAC approach to the type of fault tolerance strategy, the parameters of a fault tolerance strategy, prediction interval and a predictor’s accuracy. The final contribution of our research is an experiment testbed that enables a grid fault tolerance expert to evaluate diverse fault tolerance support configurations, and then choose the one that will satisfy the reliability and cost requirements

    D2P: Automatically Creating Distributed Dynamic Programming Codes

    Get PDF
    Dynamic Programming (DP) algorithms are common targets for parallelization, and, as these algorithms are applied to larger inputs, distributed implementations become necessary. However, creating distributed-memory solutions involves the challenges of task creation, program and data partitioning, communication optimization, and task scheduling. In this paper we present D2P, an end-to-end system for automatically transforming a specification of any recursive DP algorithm into distributed-memory implementation of the algorithm. When given a pseudo-code of a recursive DP algorithm, D2P automatically generates the corresponding MPI-based implementation. Our evaluation of the generated distributed implementations shows that they are efficient and scalable. Moreover, D2P-generated implementations are faster than implementations generated by recent general distributed DP frameworks, and are competitive with (and often faster than) hand-written implementations

    Improving the Efficiency of Heterogeneous Clouds

    Get PDF

    Evaluation and Analysis of Distributed Graph-Parallel Processing Frameworks

    Get PDF
    A number of graph-parallel processing frameworks have been proposed to address the needs of processing complex and large-scale graph structured datasets in recent years. Although significant performance improvement made by those frameworks were reported, comparative advantages of each of these frameworks over the others have not been fully studied, which impedes the best utilization of those frameworks for a specific graph computing task and setting. In this work, we conducted a comparison study on parallel processing systems for large-scale graph computations in a systematic manner, aiming to reveal the characteristics of those systems in performing common graph algorithms with real-world datasets on the same ground. We selected three popular graph-parallel processing frameworks (Giraph, GPS and GraphLab) for the study and also include a representative general data-parallel computing system— Spark—in the comparison in order to understand how well a general data-parallel system can run graph problems. We applied basic performance metrics measuring speed, resource utilization, and scalability to answer a basic question of which graph-parallel processing platform is better suited for what applications and datasets. Three widely-used graph algorithms— clustering coefficient, shortest path length, and PageRank score—were used for benchmarking on the targeted computing systems.We ran those algorithms against three real world network datasets with diverse characteristics and scales on a research cluster and have obtained a number of interesting observations. For instance, all evaluated systems showed poor scalability (i.e., the runtime increases with more computing nodes) with small datasets likely due to communication overhead. Further, out of the evaluated graphparallel computing platforms, PowerGraph consistently exhibits better performance than others

    ATCOM: Automatically tuned collective communication system for SMP clusters.

    Get PDF
    Conventional implementations of collective communications are based on point-to-point communications, and their optimizations have been focused on efficiency of those communication algorithms. However, point-to-point communications are not the optimal choice for modern computing clusters of SMPs due to their two-level communication structure. In recent years, a few research efforts have investigated efficient collective communications for SMP clusters. This dissertation is focused on platform-independent algorithms and implementations in this area;There are two main approaches to implementing efficient collective communications for clusters of SMPs: using shared memory operations for intra-node communications, and over-lapping inter-node/intra-node communications. The former fully utilizes the hardware based shared memory of an SMP, and the latter takes advantage of the inherent hierarchy of the communications within a cluster of SMPs. Previous studies focused on clusters of SMP from certain vendors. However, the previously proposed methods are not portable to other systems. Because the performance optimization issue is very complicated and the developing process is very time consuming, it is highly desired to have self-tuning, platform-independent implementations. As proven in this dissertation, such an implementation can significantly outperform the other point-to-point based portable implementations and some platform-specific implementations;The dissertation describes in detail the architecture of the platform-independent implementation. There are four system components: shared memory-based collective communications, overlapping mechanisms for inter-node and intra-node communications, a prediction-based tuning module and a micro-benchmark based tuning module. Each component is carefully designed with the goal of automatic tuning in mind
    • …
    corecore