    Benchmarking Distributed Stream Data Processing Systems

    The need for scalable and efficient stream analysis has led to the development of many open-source streaming data processing systems (SDPSs) with highly diverging capabilities and performance characteristics. While first initiatives try to compare the systems for simple workloads, there is a clear gap of detailed analyses of the systems' performance characteristics. In this paper, we propose a framework for benchmarking distributed stream processing engines. We use our suite to evaluate the performance of three widely used SDPSs in detail, namely Apache Storm, Apache Spark, and Apache Flink. Our evaluation focuses in particular on measuring the throughput and latency of windowed operations, which are the basic type of operations in stream analytics. For this benchmark, we design workloads based on real-life, industrial use-cases inspired by the online gaming industry. The contribution of our work is threefold. First, we give a definition of latency and throughput for stateful operators. Second, we carefully separate the system under test and driver, in order to correctly represent the open world model of typical stream processing deployments and can, therefore, measure system performance under realistic conditions. Third, we build the first benchmarking framework to define and test the sustainable performance of streaming systems. Our detailed evaluation highlights the individual characteristics and use-cases of each system.Comment: Published at ICDE 201

    E2Clab: Exploring the Computing Continuum through Repeatable, Replicable and Reproducible Edge-to-Cloud Experiments

    International audienceDistributed digital infrastructures for computation and analytics are now evolving towards an interconnected ecosystem allowing complex applications to be executed from IoT Edge devices to the HPC Cloud (aka the Computing Continuum, the Digital Continuum, or the Transcontinuum). Understanding end-to-end performance in such a complex continuum is challenging. This breaks down to reconciling many, typically contradicting application requirements and constraints with low-level infrastructure design choices. One important challenge is to accurately reproduce relevant behaviors of a given application workflow and representative settings of the physical infrastructure underlying this complex continuum. In this paper we introduce a rigorous methodology for such a process and validate it through E2Clab. It is the first platform to support the complete analysis cycle of an application on the Computing Continuum: (i) the configuration of the experimental environment, libraries and frameworks; (ii) the mapping between the application parts and machines on the Edge, Fog and Cloud; (iii) the deployment of the application on the infrastructure; (iv) the automated execution; and (v) the gathering of experiment metrics. We illustrate its usage with a real-life application deployed on the Grid'5000 testbed, showing that our framework allows one to understand and improve performance, by correlating it to the parameter settings, the resource usage and the specifics of the underlying infrastructure

    Spontananfragen auf Datenströmen

    Many modern applications require processing large amounts of data in a real-time fashion. As a result, distributed stream processing engines (SPEs) have gained significant attention as an important new class of big data processing systems. The central design principle of these SPEs is to handle queries that potentially run forever on data streams with a query-at-a-time model, i.e., each query is optimized and executed separately. However, in many real applications, not only long-running queries but also many short-running queries are processed on data streams. In these applications, multiple stream queries are created and deleted concurrently, in an ad-hoc manner. The best practice to handle ad-hoc stream queries is to fork input stream and add additional resources for each query. However, this approach leads to redundant computation and data copy. This thesis lays the foundation for efficient ad-hoc stream query processing. To bridge the gap between stream data processing and ad-hoc query processing, we follow a top-down approach. First, we propose a benchmarking framework to analyze state-of-the-art SPEs. We provide a definition of latency and throughput for stateful operators. Moreover, we carefully separate the system under test and the driver, to correctly represent the open-world model of typical stream processing deployments. This separation enables us to measure the system performance under realistic conditions. Our solution is the first benchmarking framework to define and test the sustainable performance of SPEs. Throughout our analysis, we realize that the state-of-the-art SPEs are unable to execute stream queries in an ad-hoc manner. Second, we propose the first ad-hoc stream query processing engine for distributed data processing environments. We develop our solution based on three main requirements: (1) Integration: Ad-hoc query processing should be a composable layer that can extend stream operators, such as join, aggregation, and window operators; (2) Consistency: Ad-hoc query creation and deletion must be performed consistently and ensure exactly-once semantics and correctness; (3) Performance: In contrast to modern SPEs, ad-hoc SPEs should not only maximize data throughput but also query throughout via incremental computation and resource sharing. Third, we propose an ad-hoc stream join processing framework that integrates dynamic query processing and query re-optimization techniques with ad-hoc stream query processing. Our solution comprises an optimization layer and a stream data processing layer. The optimization layer periodically re-optimizes the query execution plan, performing join reordering and vertical and horizontal scaling at runtime without stopping the execution. The data processing layer enables incremental and consistent query processing, supporting all the actions triggered by the optimizer. The result of the second and the third contributions forms a complete ad-hoc SPE. We utilize the first contribution not only for benchmarking modern SPEs but also for evaluating the ad-hoc SPE.Eine Vielzahl moderner Anwendungen setzten die Echtzeitverarbeitung großer Datenmengen voraus. Aus diesem Grund haben neuerdings verteilte Systeme zur Verarbeitung von Datenströmen (sog. Datenstrom-Verarbeitungssysteme, abgek. "DSV") eine wichtige Bedeutung als neue Kategorie von Massendaten-Verarbeitungssystemen erlangt. Das zentrale Entwurfsprinzip dieser DSVs ist es, Anfragen, die potenziell unendlich lange auf einem Datenstrom laufen, jeweils Eine nach der Anderen zu verarbeiten (Englisch: "query-at-a-time model"). Das bedeutet, dass jede Anfrage eigenstĂ€ndig vom System optimiert und ausgefĂŒhrt wird. Allerdings stellen vielen reale Anwendungen nicht nur lang laufende Anfragen auf Datenströmen, sondern auch kurz laufende Spontananfragen. Solche Anwendungen können mehrere Anfragen spontan und zeitgleich erstellen und entfernen. Das bewĂ€hrte Verfahren, um Spontananfragen zu bearbeiten, zweigt den eingehenden Datenstrom ab und belegt zusĂ€tzliche Ressourcen fĂŒr jede neue Anfrage. Allerdings ist dieses Verfahren ineffizient, weil Spontananfragen damit redundante Berechnungen und Daten-Kopieroperationen verursachen. In dieser Arbeit legen wir das Fundament fĂŒr die effiziente Verarbeitung von Spontananfragen auf Datenströmen. Wir schließen in den folgenden drei Schritten die LĂŒcke zwischen verteilter Datenstromanfrage-Verarbeitung und Spontananfrage-Verarbeitung. Erstens stellen wir ein Benchmark-Framework zur Analyse von modernen DSVs vor. In diesem Framework stellen wir eine neue Definition fĂŒr die Latenz und den Durchsatz von zustandsbehafteten Operatoren vor. Zudem unterscheiden wir genau zwischen dem zu testenden System und dem Treibersystem, um das offene-Welt Modell, welches den typischen Anwendungsszenarien in der Datenstromverabeitung entspricht, korrekt zu reprĂ€sentieren. Diese strikte Unterscheidung ermöglicht es, die Systemleistung unter realen Bedingungen zu messen. Unsere Lösung ist damit das erste Benchmark-Framework, welches die dauerhaft durchhaltbare Systemleistung von DSVs definiert und testet. Durch eine systematische Analyse aktueller DSVs stellen wir fest, dass aktuelle DSVs außerstande sind, Spontananfragen effizient zu verarbeiten. Zweitens stellen wir das erste verteilte DSV zur Spontananfrageverarbeitung vor. Wir entwickeln unser Lösungskonzept basierend auf drei Hauptanforderungen: (1) Integration: Spontananfrageverarbeitung soll ein modularer Baustein sein, mit dem Datenstrom-Operatoren wie z.B. Join, Aggregation, und Zeitfenster-Operatoren erweitert werden können; (2) Konsistenz: die Erstellung und Entfernung von Spontananfragen mĂŒssen konsistent ausgefĂŒhrt werden, die Semantik fĂŒr einmalige Nachrichtenzustellung erhalten, sowie die Korrektheit des Anfrage-Ergebnisses sicherstellen; (3) Leistung: Im Gegensatz zu modernen DSVs sollen DSVs zur Spontananfrageverarbeitung nicht nur den Datendurchsatz, sondern auch den Anfragedurchsatz maximieren. Dies ermöglichen wir durch inkrementelle Kompilation und der Ressourcenteilung zwischen Anfragen. Drittens stellen wir ein ProgrammiergerĂŒst zur Verbeitung von Spontananfragen auf Datenströmen vor. Dieses integriert die dynamische Anfrageverarbeitung und die Nachoptimierung von Anfragen mit der Spontananfrageverarbeitung auf Datenströmen. Unser Lösungsansatz besteht aus einer Schicht zur Anfrageoptimierung und einer Schicht zur Anfrageverarbeitung. Die Optimierungsschicht optimiert periodisch den Anfrageverarbeitungsplan nach, wobei sie zur Laufzeit Joins neu anordnet und vertikal sowie horizontal skaliert, ohne die Verarbeitung anzuhalten. Die Verarbeitungsschicht ermöglicht eine inkrementelle und konsistente Anfrageverarbeitung und unterstĂŒtzt alle zuvor beschriebenen Eingriffe der Optimierungsschicht in die Anfrageverarbeitung. Zusammengefasst ergeben unsere zweiten und dritten Lösungskonzepte eine vollstĂ€ndige DSV zur Spontananfrageverarbeitung. Wir verwenden hierzu unseren ersten Beitrag nicht nur zur Bewertung moderner DSVs, sondern auch zur Evaluation unseres DSVs zur Spontananfrageverarbeitung

    A cloud-based Analytics-Platform for user-centric Internet of Things domains – Prototype and Performance Evaluation

    Data analytics have the potential to increase the value of data emitted from smart devices in user-centric Internet of Things environments, such as smart home, drastically. In order to allow businesses and end-consumers alike to tap into this potential, appropriate analytics architectures must be present. Current solutions in this field do not tackle all of the diverse challenges and requirements, which were identified in previous research. Specifically, personalized, extensible analytics solutions, which still offer the means to address big data problems are scarce. In this paper, we therefore present an architectural solution, which was specifically designed to address the named challenges. Furthermore, we offer insights into the prototypical implementation of the proposed concept as well as an evaluation of its performance against traditional big data architectures

    Scalability Benchmarking of Cloud-Native Applications Applied to Event-Driven Microservices

    Cloud-native applications constitute a recent trend for designing large-scale software systems. This thesis introduces the Theodolite benchmarking method, allowing researchers and practitioners to conduct empirical scalability evaluations of cloud-native applications, their frameworks, configurations, and deployments. The benchmarking method is applied to event-driven microservices, a specific type of cloud-native applications that employ distributed stream processing frameworks to scale with massive data volumes. Extensive experimental evaluations benchmark and compare the scalability of various stream processing frameworks under different configurations and deployments, including different public and private cloud environments. These experiments show that the presented benchmarking method provides statistically sound results in an adequate amount of time. In addition, three case studies demonstrate that the Theodolite benchmarking method can be applied to a wide range of applications beyond stream processing

    Performance and Reliability Evaluation of Apache Kafka Messaging System

    Streaming data is now flowing across various devices and applications around us. This type of data means any unbounded, ever growing, infinite data set which is continuously generated by all kinds of sources. Examples include sensor data transmitted among different Internet of Things (IoT) devices, user activity records collected on websites and payment requests sent from mobile devices. In many application scenarios, streaming data needs to be processed in real-time because its value can be futile over time. A variety of stream processing systems have been developed in the last decade and are evolving to address rising challenges. A typical stream processing system consists of multiple processing nodes in the topology of a DAG (directed acyclic graph). To build real-time streaming data pipelines across those nodes, message middleware technology is widely applied. As a distributed messaging system with high durability and scalability, Apache Kafka has become very popular among modern companies. It ingests streaming data from upstream applications and store the data in its distributed cluster, which provides a fault-tolerant data source for stream processors. Therefore, Kafka plays a critical role to ensure the completeness, correctness and timeliness of streaming data delivery. However, it is impossible to meet all the user requirements in real-time cases with a simple and fixed data delivery strategy. In this thesis, we address the challenge of choosing a proper configuration to guarantee both performance and reliability of Kafka for complex streaming application scenarios. We investigate the features that have an impact on the performance and reliability metrics. We propose a queueing based prediction model to predict the performance metrics, including producer throughput and packet latency of Kafka. We define two reliability metrics, the probability of message loss and the probability of message duplication. We create an ANN model to predict these metrics given unstable network metrics like network delay and packet loss rate. To collect sufficient training data we build a Docker-based Kafka testbed with a fault injection module. We use a new quality-of-service metric, timely throughput to help us choosing proper batch size in Kafka. Based on this metric, we propose a dynamic configuration method, which reactively guarantees both performance and reliability of Kafka under complex operation conditions

    A Comparative Study of Hadoop MapReduce, Apache Spark & Apache Flink for Data Science

    Distributed data processing platforms for cloud computing are important tools for large-scale data analytics. Apache Hadoop MapReduce has become the de facto standard in this space, though its programming interface is relatively low-level, requiring many implementation steps even for simple analysis tasks. This has led to the development of advanced dataflow oriented platforms, most prominently Apache Spark and Apache Flink. Those not only aim to improve performance, but also provide high-level data processing functionality, such as filtering and join operators, which should make data analysis tasks easier to develop. But without comparison data available, how would data scientists know which system they should choose? This research compares: Apache Hadoop MapReduce; Apache Spark; and Apache Flink, from the perspectives of performance, usability and practicality for batch-oriented data analytics. We propose and apply a methodology which guides the conception of multidimensional software comparisons and the presentation of their results. The methodology was effective, providing direction and structure to the comparison, and should serve as helpful for future comparisons. The results confirm that Spark and Flink are superior to Hadoop MapReduce in performance and usability. Spark and Flink were similar in all three considerations, however as per the methodology, readers have the flexibility to adjust weightings to their needs, which could differentiate them. We also report on the design, execution and results of a large-scale usability study with a cohort of masters students, who learn and work with all three platforms, solving different use cases in data science contexts. Our findings show that Spark and Flink are preferred platforms over MapReduce. Among participants, there was no significant difference in perceived preference or development time between both Spark and Flink. These results were included in the usability component of the multidimensional comparison

    Theodolite: Scalability Benchmarking of Distributed Stream Processing Engines in Microservice Architectures

    Distributed stream processing engines are designed with a focus on scalability to process big data volumes in a continuous manner. We present the Theodolite method for benchmarking the scalability of distributed stream processing engines. Core of this method is the definition of use cases that microservices implementing stream processing have to fulfill. For each use case, our method identifies relevant workload dimensions that might affect the scalability of a use case. We propose to design one benchmark per use case and relevant workload dimension. We present a general benchmarking framework, which can be applied to execute the individual benchmarks for a given use case and workload dimension. Our framework executes an implementation of the use case's dataflow architecture for different workloads of the given dimension and various numbers of processing instances. This way, it identifies how resources demand evolves with increasing workloads. Within the scope of this paper, we present 4 identified use cases, derived from processing Industrial Internet of Things data, and 7 corresponding workload dimensions. We provide implementations of 4 benchmarks with Kafka Streams and Apache Flink as well as an implementation of our benchmarking framework to execute scalability benchmarks in cloud environments. We use both for evaluating the Theodolite method and for benchmarking Kafka Streams' and Flink's scalability for different deployment options.Comment: 28 page
