15,536 research outputs found
Automating Fault Tolerance in High-Performance Computational Biological Jobs Using Multi-Agent Approaches
Background: Large-scale biological jobs on high-performance computing systems
require manual intervention if one or more computing cores on which they
execute fail. This places not only a cost on the maintenance of the job, but
also a cost on the time taken for reinstating the job and the risk of losing
data and execution accomplished by the job before it failed. Approaches which
can proactively detect computing core failures and take action to relocate the
computing core's job onto reliable cores can make a significant step towards
automating fault tolerance.
Method: This paper describes an experimental investigation into the use of
multi-agent approaches for fault tolerance. Two approaches are studied, the
first at the job level and the second at the core level. The approaches are
investigated for single core failure scenarios that can occur in the execution
of parallel reduction algorithms on computer clusters. A third approach is
proposed that incorporates multi-agent technology both at the job and core
level. Experiments are pursued in the context of genome searching, a popular
computational biology application.
Result: The key conclusion is that the approaches proposed are feasible for
automating fault tolerance in high-performance computing systems with minimal
human intervention. In a typical experiment in which the fault tolerance is
studied, centralised and decentralised checkpointing approaches on an average
add 90% to the actual time for executing the job. On the other hand, in the
same experiment the multi-agent approaches add only 10% to the overall
execution time.Comment: Computers in Biology and Medicin
Self-repairing Homomorphic Codes for Distributed Storage Systems
Erasure codes provide a storage efficient alternative to replication based
redundancy in (networked) storage systems. They however entail high
communication overhead for maintenance, when some of the encoded fragments are
lost and need to be replenished. Such overheads arise from the fundamental need
to recreate (or keep separately) first a copy of the whole object before any
individual encoded fragment can be generated and replenished. There has been
recently intense interest to explore alternatives, most prominent ones being
regenerating codes (RGC) and hierarchical codes (HC). We propose as an
alternative a new family of codes to improve the maintenance process, which we
call self-repairing codes (SRC), with the following salient features: (a)
encoded fragments can be repaired directly from other subsets of encoded
fragments without having to reconstruct first the original data, ensuring that
(b) a fragment is repaired from a fixed number of encoded fragments, the number
depending only on how many encoded blocks are missing and independent of which
specific blocks are missing. These properties allow for not only low
communication overhead to recreate a missing fragment, but also independent
reconstruction of different missing fragments in parallel, possibly in
different parts of the network. We analyze the static resilience of SRCs with
respect to traditional erasure codes, and observe that SRCs incur marginally
larger storage overhead in order to achieve the aforementioned properties. The
salient SRC properties naturally translate to low communication overheads for
reconstruction of lost fragments, and allow reconstruction with lower latency
by facilitating repairs in parallel. These desirable properties make
self-repairing codes a good and practical candidate for networked distributed
storage systems
MOON: MapReduce On Opportunistic eNvironments
Abstract—MapReduce offers a flexible programming model for processing and generating large data sets on dedicated resources, where only a small fraction of such resources are every unavailable at any given time. In contrast, when MapReduce is run on volunteer computing systems, which opportunistically harness idle desktop computers via frameworks like Condor, it results in poor performance due to the volatility of the resources, in particular, the high rate of node unavailability. Specifically, the data and task replication scheme adopted by existing MapReduce implementations is woefully inadequate for resources with high unavailability. To address this, we propose MOON, short for MapReduce On Opportunistic eNvironments. MOON extends Hadoop, an open-source implementation of MapReduce, with adaptive task and data scheduling algorithms in order to offer reliable MapReduce services on a hybrid resource architecture, where volunteer computing systems are supplemented by a small set of dedicated nodes. The adaptive task and data scheduling algorithms in MOON distinguish between (1) different types of MapReduce data and (2) different types of node outages in order to strategically place tasks and data on both volatile and dedicated nodes. Our tests demonstrate that MOON can deliver a 3-fold performance improvement to Hadoop in volatile, volunteer computing environments
A runtime heuristic to selectively replicate tasks for application-specific reliability targets
In this paper we propose a runtime-based selective task replication technique for task-parallel high performance computing applications. Our selective task replication technique is automatic and does not require modification/recompilation of OS, compiler or application code. Our heuristic, we call App_FIT, selects tasks to replicate such that the specified reliability target for an application is achieved. In our experimental evaluation, we show that App FIT selective replication heuristic is low-overhead and highly scalable. In addition, results indicate that complete task replication is overkill for achieving reliability targets. We show that with App FIT, we can tolerate pessimistic exascale error rates with only 53% of the tasks being replicated.This work was supported by FI-DGR 2013 scholarship and the European Community’s
Seventh Framework Programme [FP7/2007-2013] under the Mont-blanc 2
Project (www.montblanc-project.eu), grant agreement no. 610402 and in part by the
European Union (FEDER funds) under contract TIN2015-65316-P.Peer ReviewedPostprint (author's final draft
Next Generation Cloud Computing: New Trends and Research Directions
The landscape of cloud computing has significantly changed over the last
decade. Not only have more providers and service offerings crowded the space,
but also cloud infrastructure that was traditionally limited to single provider
data centers is now evolving. In this paper, we firstly discuss the changing
cloud infrastructure and consider the use of infrastructure from multiple
providers and the benefit of decentralising computing away from data centers.
These trends have resulted in the need for a variety of new computing
architectures that will be offered by future cloud infrastructure. These
architectures are anticipated to impact areas, such as connecting people and
devices, data-intensive computing, the service space and self-learning systems.
Finally, we lay out a roadmap of challenges that will need to be addressed for
realising the potential of next generation cloud systems.Comment: Accepted to Future Generation Computer Systems, 07 September 201
- …