Search CORE

423 research outputs found

Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management

Author: Fernandez RC
Kalyvianaki E
Migliavacca M
Pietzuch P
Publication venue: 'American College of Medical Physics (ACMP)'
Publication date: 01/01/2013
Field of study

As users of big data applications expect fresh results, we witness a new breed of stream processing systems (SPS) that are designed to scale to large numbers of cloud-hosted machines. Such systems face new challenges: (i) to benefit from the pay-as-you-go model of cloud computing, they must scale out on demand, acquiring additional virtual machines (VMs) and parallelising operators when the workload increases; (ii) failures are common with deployments on hundreds of VMs - systems must be fault-tolerant with fast recovery times, yet low per-machine overheads. An open question is how to achieve these two goals when stream queries include stateful operators, which must be scaled out and recovered without affecting query results. Our key idea is to expose internal operator state explicitly to the SPS through a set of state management primitives. Based on them, we describe an integrated approach for dynamic scale out and recovery of stateful operators. Externalised operator state is checkpointed periodically by the SPS and backed up to upstream VMs. The SPS identifies individual operator bottlenecks and automatically scales them out by allocating new VMs and partitioning the check-pointed state. At any point, failed operators are recovered by restoring checkpointed state on a new VM and replaying unprocessed tuples. We evaluate this approach with the Linear Road Benchmark on the Amazon EC2 cloud platform and show that it can scale automatically to a load factor of L=350 with 50 VMs, while recovering quickly from failures. Copyright © 2013 ACM

CiteSeerX

City Research Online

Crossref

Spiral - Imperial College Digital Repository

Kent Academic Repository

An aspect-oriented approach to fault-tolerance in grid platforms

Author: Medeiros Bruno
Sobral João Luís Ferreira
Publication venue: 'Netbiblo'
Publication date: 01/01/2011
Field of study

Migrating traditional scientific applications to computational Grids requires programming tools that can help programmers to update application behaviour to this kind of platforms. Computational Grids are particularly suited for long running scientific applications, but they are also more prone to faults than desktop machines. The AspectGrid framework aims to develop methodologies and tools that can help to Grid-enable scientific applications, particularly focusing on techniques based on aspect-oriented programming. In this paper we present the aspect-oriented approach taken in the AspectGrid framework to address faults in computational Grids. In the proposed approach, scientific applications are enhanced with fault-tolerance capability by plugging additional modules. The proposed technique is portable across operating systems and minimises the changes required to base applications

Universidade do Minho: RepositoriUM

AspectGrid: Aspect-Oriented Fault-Tolerance in Grid Platforms

Author: Medeiros Bruno
Sobral João
Publication venue: Institute of Informatics, Slovak Academy of Sciences
Publication date: 20/06/2012
Field of study

Migrating traditional scientific applications to computational Grids requires programming tools that can help programmers update application behaviour to this kind of platforms. Computational Grids are particularly suited for long running scientific applications, but they are also more prone to faults than desktop machines. The AspectGrid framework aims to develop methodologies and tools that can help Grid-enable scientific applications, particularly focusing on techniques based on aspect-oriented programming. In this paper we present the aspect-oriented approach taken in the AspectGrid framework to address faults in computational Grids. In the proposed approach, scientific applications are enhanced with fault-tolerance capability by plugging additional modules. The proposed technique is portable across operating systems and minimises the changes required to base applications

Computing and Informatics (E-Journal - Institute of Informatics, SAS, Bratislava)

AspectGrid: aspect-oriented fault-tolerance in grid platforms

Author: Medeiros Bruno
Sobral João Luís Ferreira
Publication venue: Slovak Academic Press Ltd.
Publication date: 01/01/2012
Field of study

Universidade do Minho: RepositoriUM

Computing and Informatics (E-Journal - Institute of Informatics, SAS, Bratislava)

Recommended from our members

Analyzing Spark Performance on Spot Instances

Author: Tian Jiannan
Publication venue: ScholarWorks@UMass Amherst
Publication date: 27/10/2017
Field of study

Amazon Spot Instances provide inexpensive service for high-performance computing. With spot instances, it is possible to get at most 90% off as discount in costs by bidding spare Amazon Elastic Computer Cloud (Amazon EC2) instances. In exchange for low cost, spot instances bring the reduced reliability onto the computing environment, because this kind of instance could be revoked abruptly by the providers due to supply and demand, and higher-priority customers are first served. To achieve high performance on instances with compromised reliability, Spark is applied to run jobs. In this thesis, a wide set of spark experiments are conducted to study its performance on spot instances. Without stateful replicating, Spark suffers from cascad- ing rollback and is forced to regenerate these states for ad hoc practices repeatedly. Such downside leads to discussion on trade-off between compatible slow checkpointing and regenerating on rollback and inspires us to apply multiple fault tolerance schemes. And Spark is proven to finish a job only with proper revocation rate. To validate and evaluate our work, prototype and simulator are designed and implemented. And based on real history price records, we studied how various checkpoint write frequencies and bid level affect performance. In case study, experiments show that our presented techniques can lead to ~20% shorter completion time and ~25% lower costs than those cases without such techniques. And compared with running jobs on full-price instance, the absolute saving in costs can be ~70%

ScholarWorks@UMass Amherst

Extending Scojo-PECT by migration based on application level checkpointing

Author: Shi Jiaying
Publication venue: 'University of Windsor Leddy Library'
Publication date: 01/01/2009
Field of study

In parallel computing, jobs have different runtimes and required computation resources. With runtimes correlated with resources, scheduling these jobs would be a packing problem getting the utilization and total execution time varies. Sometimes, resources are idle while jobs are preempted or have resource conflict with no chance to take use of them. This greatly wastes system resource at certain degree. Here we propose an approach which takes periodic checkpoints of running jobs with the chance to take advantage of migration to optimize our scheduler during long term scheduling. We improve our original Scojo-PECT preemptive scheduler which does not have checkpoint support before. We evaluate the gained execution time minus overhead of checkpointing/migration, to make comparison with original execution time

Scholarship at UWindsor

Resource management for extreme scale high performance computing systems in the presence of failures

Author: Dauwe Daniel
Publication venue: Colorado State University. Libraries
Publication date: 01/01/2018
Field of study

2018 Summer.Includes bibliographical references.High performance computing (HPC) systems, such as data centers and supercomputers, coordinate the execution of large-scale computation of applications over tens or hundreds of thousands of multicore processors. Unfortunately, as the size of HPC systems continues to grow towards exascale complexities, these systems experience an exponential growth in the number of failures occurring in the system. These failures reduce performance and increase energy use, reducing the efficiency and effectiveness of emerging extreme-scale HPC systems. Applications executing in parallel on individual multicore processors also suffer from decreased performance and increased energy use as a result of applications being forced to share resources, in particular, the contention from multiple application threads sharing the last-level cache causes performance degradation. These challenges make it increasingly important to characterize and optimize the performance and behavior of applications that execute in these systems. To address these challenges, in this dissertation we propose a framework for intelligently characterizing and managing extreme-scale HPC system resources. We devise various techniques to mitigate the negative effects of failures and resource contention in HPC systems. In particular, we develop new HPC resource management techniques for intelligently utilizing system resources through the (a) optimal scheduling of applications to HPC nodes and (b) the optimal configuration of fault resilience protocols. These resource management techniques employ information obtained from historical analysis as well as theoretical and machine learning methods for predictions. We use these data to characterize system performance, energy use, and application behavior when operating under the uncertainty of performance degradation from both system failures and resource contention. We investigate how to better characterize and model the negative effects from system failures as well as application co-location on large-scale HPC computing systems. Our analysis of application and system behavior also investigates: the interrelated effects of network usage of applications and fault resilience protocols; checkpoint interval selection and its sensitivity to system parameters for various checkpoint-based fault resilience protocols; and performance comparisons of various promising strategies for fault resilience in exascale-sized systems

Mountain Scholar (Digital Collections of Colorado and Wyoming)

Dynamic fault tolerant grid workflow in the water threat management project

Author: Moon Young Suk
Publication venue: RIT Scholar Works
Publication date: 01/01/2010
Field of study

Achieving fault tolerance is an inevitable problem in distributed systems, with it becoming more challenging in decentralized, heterogeneous, and dynamic-environment systems such as a Grid. When deploying applications requires time-criticality, how to allocate resources for jobs in a fault-tolerant manner is an important issue for the delivery of the services. The Water Threat Management project is a research to find solutions for the contamination incidents problems in urban water distribution systems, and it involves the development of the cyberinfrastructure in a Grid environment. To handle such urgent events properly, the deployment of the system demands real-time processing without the failure. Our approach of integrating a fault-tolerant framework into a Water Threat Management system provides fault tolerance at the queuing stage rather than the job-execution stage by scheduling jobs in fault-tolerant ways. This includes the development of the batch queuing system in the Cyberaide Shell project. In addition, we present a dynamic workflow in the Water Threat Management system that can reduce the queue wait time in the changing environment

RIT Scholar Works