'Institute of Electrical and Electronics Engineers (IEEE)'
Abstract
Hadoop provides a scalable solution on traditional
cluster-based Big Data platforms but imposes performance
overheads due to only supporting on-disk data. Data Analytic
algorithms usually require multiple iterations over a dataset
and thus, multiple, slow, disk accesses. In contrast, modern
clusters possess increasing amounts of main memory that can
provide performance benefits by efficiently using main memory
caching mechanisms.
Apache Spark is an innovative distributed computing framework
that supports in-memory computations. Even though this
type of computations is very fast, memory is a scarce resource
and this can cause bottlenecks to execution or, even worse, lead
to failures. Spark offers various choices for memory tuning but
this requires in-depth systems-level knowledge and the choices
will be different across various workloads and cluster settings.
Generally, the optimal choice is achieved by adopting a trial
and error approach.
This work describes a first step towards an automated
selection mechanism for memory optimization that assesses
workload and cluster characteristics and selects an appropriate
caching scheme. The proposed caching mechanism decreases
execution times by up to 25% compared to the default strategy
and reduces the risk of main memory exceptions