9,632 research outputs found
Stocator: A High Performance Object Store Connector for Spark
We present Stocator, a high performance object store connector for Apache
Spark, that takes advantage of object store semantics. Previous connectors have
assumed file system semantics, in particular, achieving fault tolerance and
allowing speculative execution by creating temporary files to avoid
interference between worker threads executing the same task and then renaming
these files. Rename is not a native object store operation; not only is it not
atomic, but it is implemented using a costly copy operation and a delete.
Instead our connector leverages the inherent atomicity of object creation, and
by avoiding the rename paradigm it greatly decreases the number of operations
on the object store as well as enabling a much simpler approach to dealing with
the eventually consistent semantics typical of object stores. We have
implemented Stocator and shared it in open source. Performance testing shows
that it is as much as 18 times faster for write intensive workloads and
performs as much as 30 times fewer operations on the object store than the
legacy Hadoop connectors, reducing costs both for the client and the object
storage service provider
The Family of MapReduce and Large Scale Data Processing Systems
In the last two decades, the continuous increase of computational power has
produced an overwhelming flow of data which has called for a paradigm shift in
the computing architecture and large scale data processing mechanisms.
MapReduce is a simple and powerful programming model that enables easy
development of scalable parallel applications to process vast amounts of data
on large clusters of commodity machines. It isolates the application from the
details of running a distributed program such as issues on data distribution,
scheduling and fault tolerance. However, the original implementation of the
MapReduce framework had some limitations that have been tackled by many
research efforts in several followup works after its introduction. This article
provides a comprehensive survey for a family of approaches and mechanisms of
large scale data processing mechanisms that have been implemented based on the
original idea of the MapReduce framework and are currently gaining a lot of
momentum in both research and industrial communities. We also cover a set of
introduced systems that have been implemented to provide declarative
programming interfaces on top of the MapReduce framework. In addition, we
review several large scale data processing systems that resemble some of the
ideas of the MapReduce framework for different purposes and application
scenarios. Finally, we discuss some of the future research directions for
implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author
Next-Generation EU DataGrid Data Management Services
We describe the architecture and initial implementation of the
next-generation of Grid Data Management Middleware in the EU DataGrid (EDG)
project.
The new architecture stems out of our experience and the users requirements
gathered during the two years of running our initial set of Grid Data
Management Services. All of our new services are based on the Web Service
technology paradigm, very much in line with the emerging Open Grid Services
Architecture (OGSA). We have modularized our components and invested a great
amount of effort towards a secure, extensible and robust service, starting from
the design but also using a streamlined build and testing framework.
Our service components are: Replica Location Service, Replica Metadata
Service, Replica Optimization Service, Replica Subscription and high-level
replica management. The service security infrastructure is fully GSI-enabled,
hence compatible with the existing Globus Toolkit 2-based services; moreover,
it allows for fine-grained authorization mechanisms that can be adjusted
depending on the service semantics.Comment: Talk from the 2003 Computing in High Energy and Nuclear Physics
(CHEP03), La Jolla,Ca, USA, March 2003 8 pages, LaTeX, the file contains all
LaTeX sources - figures are in the directory "figures
Sciunits: Reusable Research Objects
Science is conducted collaboratively, often requiring knowledge sharing about
computational experiments. When experiments include only datasets, they can be
shared using Uniform Resource Identifiers (URIs) or Digital Object Identifiers
(DOIs). An experiment, however, seldom includes only datasets, but more often
includes software, its past execution, provenance, and associated
documentation. The Research Object has recently emerged as a comprehensive and
systematic method for aggregation and identification of diverse elements of
computational experiments. While a necessary method, mere aggregation is not
sufficient for the sharing of computational experiments. Other users must be
able to easily recompute on these shared research objects. In this paper, we
present the sciunit, a reusable research object in which aggregated content is
recomputable. We describe a Git-like client that efficiently creates, stores,
and repeats sciunits. We show through analysis that sciunits repeat
computational experiments with minimal storage and processing overhead.
Finally, we provide an overview of sharing and reproducible cyberinfrastructure
based on sciunits gaining adoption in the domain of geosciences
Automated data reduction workflows for astronomy
Data from complex modern astronomical instruments often consist of a large
number of different science and calibration files, and their reduction requires
a variety of software tools. The execution chain of the tools represents a
complex workflow that needs to be tuned and supervised, often by individual
researchers that are not necessarily experts for any specific instrument. The
efficiency of data reduction can be improved by using automatic workflows to
organise data and execute the sequence of data reduction steps. To realize such
efficiency gains, we designed a system that allows intuitive representation,
execution and modification of the data reduction workflow, and has facilities
for inspection and interaction with the data. The European Southern Observatory
(ESO) has developed Reflex, an environment to automate data reduction
workflows. Reflex is implemented as a package of customized components for the
Kepler workflow engine. Kepler provides the graphical user interface to create
an executable flowchart-like representation of the data reduction process. Key
features of Reflex are a rule-based data organiser, infrastructure to re-use
results, thorough book-keeping, data progeny tracking, interactive user
interfaces, and a novel concept to exploit information created during data
organisation for the workflow execution. Reflex includes novel concepts to
increase the efficiency of astronomical data processing. While Reflex is a
specific implementation of astronomical scientific workflows within the Kepler
workflow engine, the overall design choices and methods can also be applied to
other environments for running automated science workflows.Comment: 12 pages, 7 figure
- …