4 research outputs found
Experimental Performance Evaluation of Cloud-Based Analytics-as-a-Service
An increasing number of Analytics-as-a-Service solutions has recently seen
the light, in the landscape of cloud-based services. These services allow
flexible composition of compute and storage components, that create powerful
data ingestion and processing pipelines. This work is a first attempt at an
experimental evaluation of analytic application performance executed using a
wide range of storage service configurations. We present an intuitive notion of
data locality, that we use as a proxy to rank different service compositions in
terms of expected performance. Through an empirical analysis, we dissect the
performance achieved by analytic workloads and unveil problems due to the
impedance mismatch that arise in some configurations. Our work paves the way to
a better understanding of modern cloud-based analytic services and their
performance, both for its end-users and their providers.Comment: Longer version of the paper in Submission at IEEE CLOUD'1
Stocator: A High Performance Object Store Connector for Spark
We present Stocator, a high performance object store connector for Apache
Spark, that takes advantage of object store semantics. Previous connectors have
assumed file system semantics, in particular, achieving fault tolerance and
allowing speculative execution by creating temporary files to avoid
interference between worker threads executing the same task and then renaming
these files. Rename is not a native object store operation; not only is it not
atomic, but it is implemented using a costly copy operation and a delete.
Instead our connector leverages the inherent atomicity of object creation, and
by avoiding the rename paradigm it greatly decreases the number of operations
on the object store as well as enabling a much simpler approach to dealing with
the eventually consistent semantics typical of object stores. We have
implemented Stocator and shared it in open source. Performance testing shows
that it is as much as 18 times faster for write intensive workloads and
performs as much as 30 times fewer operations on the object store than the
legacy Hadoop connectors, reducing costs both for the client and the object
storage service provider