Distributed dataflow systems such as Apache Spark or Apache Flink enable
parallel, in-memory data processing on large clusters of commodity hardware.
Consequently, the appropriate amount of memory to allocate to the cluster is a
crucial consideration.
In this paper, we analyze the challenge of efficient resource allocation for
distributed data processing, focusing on memory. We emphasize that in-memory
processing with in-memory data processing frameworks can undermine resource
efficiency. Based on the findings of our trace data analysis, we compile
requirements towards an automated solution for efficient cluster resource
allocation.Comment: 4 pages, 3 Figures; ACM SSDBM 202