Selecting Efficient Cluster Resources for Data Analytics: When and How to Allocate for In-Memory Processing?

Abstract

Distributed dataflow systems such as Apache Spark or Apache Flink enable parallel, in-memory data processing on large clusters of commodity hardware. Consequently, the appropriate amount of memory to allocate to the cluster is a crucial consideration. In this paper, we analyze the challenge of efficient resource allocation for distributed data processing, focusing on memory. We emphasize that in-memory processing with in-memory data processing frameworks can undermine resource efficiency. Based on the findings of our trace data analysis, we compile requirements towards an automated solution for efficient cluster resource allocation.Comment: 4 pages, 3 Figures; ACM SSDBM 202

    Similar works

    Full text

    thumbnail-image

    Available Versions