1,818 research outputs found

    Survey On Fault Tolerance In Grid Computing

    Full text link

    DBRS: Directed Acyclic Graph based Reliable Scheduling Approach in Large Scale Computing

    Get PDF
    In large scale environments, scheduling presents a significant challenge because it is an NP-hard problem. There are basically two types of task in execution- dependent task and independent task. The execution of dependent task must follow a strict order because output of one activity is typically the input of another. In this paper, a reliable fault tolerant approach is proposed for scheduling of dependent task in large scale computing environments. The workflow of dependent task is represented with the help of a DAG (directed acyclic graph). The proposed methodology is evaluated over various parameters by applying it in a large scale computing environment- ‘grid computing’. Grid computing is a high performance computing for solving complex, large and data intensive problems in various fields. The result analysis shows that the proposed DAG based reliable scheduling (DBRS) approach increases the performance of system by decreasing the makespan, number of failures and increasing performance improvement ratio (PIR)

    Fault aware task scheduling in cloud using min-min and DBSCAN

    Get PDF
    Cloud computing leverages computing resources by managing these resources globally in a more efficient manner as compared to individual resource services. It requires us to deliver the resources in a heterogeneous environment and also in a highly dynamic nature. Hence, there is always a risk of resource allocation failure that can maximize the delay in task execution. Such adverse impact in the cloud environment also raises questions on quality of service (QoS). Resource management for cloud application and service have bigger challenges and many researchers have proposed several solutions but there is room for improvement. Clustering the resources clustering and mapping them according to task can also be an option to deal with such task failure or mismanaged resource allocation. Density-based spatial clustering of applications with noise (DBSCAN) is a stochastic approach-based algorithm which has the capability to cluster the resources in a cloud environment. The proposed algorithm considers high execution enabled powerful data centers with least fault probability during resource allocation which reduces the probability of fault and increases the tolerance. The simulation is cone using CloudsSim 5.0 tool kit. The results show 25% average improve in execution time, 6.5% improvement in number of task completed and 3.48% improvement in count of task failed as compared to ACO, PSO, BB-BC (Bib = g bang Big Crunch) and WHO(Whale optimization algorithm)

    The Contemporary Affirmation of Taxonomy and Recent Literature on Workflow Scheduling and Management in Cloud Computing

    Get PDF
    The Cloud computing systemspreferred over the traditional forms of computing such as grid computing, utility computing, autonomic computing is attributed forits ease of access to computing, for its QoS preferences, SLA2019;s conformity, security and performance offered with minimal supervision. A cloud workflow schedule when designed efficiently achieves optimalre source sage, balance of workloads, deadline specific execution, cost control according to budget specifications, efficient consumption of energy etc. to meet the performance requirements of today2019; svast scientific and business requirements. The businesses requirements under recent technologies like pervasive computing are motivating the technology of cloud computing for further advancements. In this paper we discuss some of the important literature published on cloud workflow scheduling

    A Taxonomy of Data Grids for Distributed Data Sharing, Management and Processing

    Full text link
    Data Grids have been adopted as the platform for scientific communities that need to share, access, transport, process and manage large data collections distributed worldwide. They combine high-end computing technologies with high-performance networking and wide-area storage management techniques. In this paper, we discuss the key concepts behind Data Grids and compare them with other data sharing and distribution paradigms such as content delivery networks, peer-to-peer networks and distributed databases. We then provide comprehensive taxonomies that cover various aspects of architecture, data transportation, data replication and resource allocation and scheduling. Finally, we map the proposed taxonomy to various Data Grid systems not only to validate the taxonomy but also to identify areas for future exploration. Through this taxonomy, we aim to categorise existing systems to better understand their goals and their methodology. This would help evaluate their applicability for solving similar problems. This taxonomy also provides a "gap analysis" of this area through which researchers can potentially identify new issues for investigation. Finally, we hope that the proposed taxonomy and mapping also helps to provide an easy way for new practitioners to understand this complex area of research.Comment: 46 pages, 16 figures, Technical Repor

    MAPREDUCE CHALLENGES ON PERVASIVE GRIDS

    No full text
    International audienceThis study presents the advances on designing and implementing scalable techniques to support the development and execution of MapReduce application in pervasive distributed computing infrastructures, in the context of the PER-MARE project. A pervasive framework for MapReduce applications is very useful in practice, especially in those scientific, enterprises and educational centers which have many unused or underused computing resources, which can be fully exploited to solve relevant problems that demand large computing power, such as scientific computing applications, big data processing, etc. In this study, we pro-pose the study of multiple techniques to support volatility and heterogeneity on MapReduce, by applying two complementary approaches: Improving the Apache Hadoop middleware by including context-awareness and fault-tolerance features; and providing an alternative pervasive grid implementation, fully adapted to dynamic environments. The main design and implementation decisions for both alternatives are described and validated through experiments, demonstrating that our approaches provide high reliability when executing on pervasive environments. The analysis of the experiments also leads to several insights on the requirements and constraints from dynamic and volatile systems, reinforcing the importance of context-aware information and advanced fault-tolerance features to provide efficient and reliable MapReduce services on pervasive grids

    Resiliency in numerical algorithm design for extreme scale simulations

    Get PDF
    This work is based on the seminar titled ‘Resiliency in Numerical Algorithm Design for Extreme Scale Simulations’ held March 1–6, 2020, at Schloss Dagstuhl, that was attended by all the authors. Advanced supercomputing is characterized by very high computation speeds at the cost of involving an enormous amount of resources and costs. A typical large-scale computation running for 48 h on a system consuming 20 MW, as predicted for exascale systems, would consume a million kWh, corresponding to about 100k Euro in energy cost for executing 1023 floating-point operations. It is clearly unacceptable to lose the whole computation if any of the several million parallel processes fails during the execution. Moreover, if a single operation suffers from a bit-flip error, should the whole computation be declared invalid? What about the notion of reproducibility itself: should this core paradigm of science be revised and refined for results that are obtained by large-scale simulation? Naive versions of conventional resilience techniques will not scale to the exascale regime: with a main memory footprint of tens of Petabytes, synchronously writing checkpoint data all the way to background storage at frequent intervals will create intolerable overheads in runtime and energy consumption. Forecasts show that the mean time between failures could be lower than the time to recover from such a checkpoint, so that large calculations at scale might not make any progress if robust alternatives are not investigated. More advanced resilience techniques must be devised. The key may lie in exploiting both advanced system features as well as specific application knowledge. Research will face two essential questions: (1) what are the reliability requirements for a particular computation and (2) how do we best design the algorithms and software to meet these requirements? While the analysis of use cases can help understand the particular reliability requirements, the construction of remedies is currently wide open. One avenue would be to refine and improve on system- or application-level checkpointing and rollback strategies in the case an error is detected. Developers might use fault notification interfaces and flexible runtime systems to respond to node failures in an application-dependent fashion. Novel numerical algorithms or more stochastic computational approaches may be required to meet accuracy requirements in the face of undetectable soft errors. These ideas constituted an essential topic of the seminar. The goal of this Dagstuhl Seminar was to bring together a diverse group of scientists with expertise in exascale computing to discuss novel ways to make applications resilient against detected and undetected faults. In particular, participants explored the role that algorithms and applications play in the holistic approach needed to tackle this challenge. This article gathers a broad range of perspectives on the role of algorithms, applications and systems in achieving resilience for extreme scale simulations. The ultimate goal is to spark novel ideas and encourage the development of concrete solutions for achieving such resilience holistically.Peer Reviewed"Article signat per 36 autors/es: Emmanuel Agullo, Mirco Altenbernd, Hartwig Anzt, Leonardo Bautista-Gomez, Tommaso Benacchio, Luca Bonaventura, Hans-Joachim Bungartz, Sanjay Chatterjee, Florina M. Ciorba, Nathan DeBardeleben, Daniel Drzisga, Sebastian Eibl, Christian Engelmann, Wilfried N. Gansterer, Luc Giraud, Dominik G ̈oddeke, Marco Heisig, Fabienne Jezequel, Nils Kohl, Xiaoye Sherry Li, Romain Lion, Miriam Mehl, Paul Mycek, Michael Obersteiner, Enrique S. Quintana-Ortiz, Francesco Rizzi, Ulrich Rude, Martin Schulz, Fred Fung, Robert Speck, Linda Stals, Keita Teranishi, Samuel Thibault, Dominik Thonnes, Andreas Wagner and Barbara Wohlmuth"Postprint (author's final draft
    • …
    corecore