1,182 research outputs found
MOON: MapReduce On Opportunistic eNvironments
Abstract—MapReduce offers a flexible programming model for processing and generating large data sets on dedicated resources, where only a small fraction of such resources are every unavailable at any given time. In contrast, when MapReduce is run on volunteer computing systems, which opportunistically harness idle desktop computers via frameworks like Condor, it results in poor performance due to the volatility of the resources, in particular, the high rate of node unavailability. Specifically, the data and task replication scheme adopted by existing MapReduce implementations is woefully inadequate for resources with high unavailability. To address this, we propose MOON, short for MapReduce On Opportunistic eNvironments. MOON extends Hadoop, an open-source implementation of MapReduce, with adaptive task and data scheduling algorithms in order to offer reliable MapReduce services on a hybrid resource architecture, where volunteer computing systems are supplemented by a small set of dedicated nodes. The adaptive task and data scheduling algorithms in MOON distinguish between (1) different types of MapReduce data and (2) different types of node outages in order to strategically place tasks and data on both volatile and dedicated nodes. Our tests demonstrate that MOON can deliver a 3-fold performance improvement to Hadoop in volatile, volunteer computing environments
Resource provisioning in Science Clouds: Requirements and challenges
Cloud computing has permeated into the information technology industry in the
last few years, and it is emerging nowadays in scientific environments. Science
user communities are demanding a broad range of computing power to satisfy the
needs of high-performance applications, such as local clusters,
high-performance computing systems, and computing grids. Different workloads
are needed from different computational models, and the cloud is already
considered as a promising paradigm. The scheduling and allocation of resources
is always a challenging matter in any form of computation and clouds are not an
exception. Science applications have unique features that differentiate their
workloads, hence, their requirements have to be taken into consideration to be
fulfilled when building a Science Cloud. This paper will discuss what are the
main scheduling and resource allocation challenges for any Infrastructure as a
Service provider supporting scientific applications
Portability of Scientific Workflows in NGS Data Analysis: A Case Study
The analysis of next-generation sequencing (NGS) data requires complex
computational workflows consisting of dozens of autonomously developed yet
interdependent processing steps. Whenever large amounts of data need to be
processed, these workflows must be executed on a parallel and/or distributed
systems to ensure reasonable runtime. Porting a workflow developed for a
particular system on a particular hardware infrastructure to another system or
to another infrastructure is non-trivial, which poses a major impediment to the
scientific necessities of workflow reproducibility and workflow reusability. In
this work, we describe our efforts to port a state-of-the-art workflow for the
detection of specific variants in whole-exome sequencing of mice. The workflow
originally was developed in the scientific workflow system snakemake for
execution on a high-performance cluster controlled by Sun Grid Engine. In the
project, we ported it to the scientific workflow system SaasFee that can
execute workflows on (multi-core) stand-alone servers or on clusters of
arbitrary sizes using the Hadoop. The purpose of this port was that also owners
of low-cost hardware infrastructures, for which Hadoop was made for, become
able to use the workflow. Although both the source and the target system are
called scientific workflow systems, they differ in numerous aspects, ranging
from the workflow languages to the scheduling mechanisms and the file access
interfaces. These differences resulted in various problems, some expected and
more unexpected, that had to be resolved before the workflow could be run with
equal semantics. As a side-effect, we also report cost/runtime ratios for a
state-of-the-art NGS workflow on very different hardware platforms: A
comparably cheap stand-alone server (80 threads), a mid-cost, mid-sized cluster
(552 threads), and a high-end HPC system (3784 threads)
- …