Thrill: High-performance algorithmic distributed batch data processing with C++

Axtmann, Michael; Bingmann, Timo; Jobstl, Emanuel; Lamm, Sebastian; Nguyen, Huyen Chau; Noe, Alexander; Sanders, Peter; Schlag, Sebastian; Stumpp, Matthias; Sturm, Tobias

Thrill: High-performance algorithmic distributed batch data processing with C++

Authors: Michael Axtmann
Timo Bingmann
Emanuel Jobstl
Sebastian Lamm
Huyen Chau Nguyen
Alexander Noe
Peter Sanders
Sebastian Schlag
Matthias Stumpp
Tobias Sturm
Publication date: 1 January 2016
Publisher: Institute of Electrical and Electronics Engineers
Doi

Abstract

We present the design and a first performance evaluation of Thrill -- a prototype of a general purpose big data processing framework with a convenient data-flow style programming interface. Thrill is somewhat similar to Apache Spark and Apache Flink with at least two main differences. First, Thrill is based on C++ which enables performance advantages due to direct native code compilation, a more cache-friendly memory layout, and explicit memory management. In particular, Thrill uses template meta-programming to compile chains of subsequent local operations into a single binary routine without intermediate buffering and with minimal indirections. Second, Thrill uses arrays rather than multisets as its primary data structure which enables additional operations like sorting, prefix sums, window scans, or combining corresponding fields of several arrays (zipping). We compare Thrill with Apache Spark and Apache Flink using five kernels from the HiBench suite. Thrill is consistently faster and often several times faster than the other frameworks. At the same time, the source codes have a similar level of simplicity and abstractio

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

KITopen

oai:EVASTAR-Karlsruhe.de:10000...

Last time updated on 07/05/2019

Crossref

info:doi/10.1109%2Fbigdata.201...

Last time updated on 02/01/2020