Towards Automatic Memory Tuning for In-Memory Big Data Analytics in Clusters

Abstract

Hadoop provides a scalable solution on traditional cluster-based Big Data platforms but imposes performance overheads due to only supporting on-disk data. Data Analytic algorithms usually require multiple iterations over a dataset and thus, multiple, slow, disk accesses. In contrast, modern clusters possess increasing amounts of main memory that can provide performance benefits by efficiently using main memory caching mechanisms. Apache Spark is an innovative distributed computing framework that supports in-memory computations. Even though this type of computations is very fast, memory is a scarce resource and this can cause bottlenecks to execution or, even worse, lead to failures. Spark offers various choices for memory tuning but this requires in-depth systems-level knowledge and the choices will be different across various workloads and cluster settings. Generally, the optimal choice is achieved by adopting a trial and error approach. This work describes a first step towards an automated selection mechanism for memory optimization that assesses workload and cluster characteristics and selects an appropriate caching scheme. The proposed caching mechanism decreases execution times by up to 25% compared to the default strategy and reduces the risk of main memory exceptions

    Similar works