20 research outputs found

    Spark versus Flink: Understanding Performance in Big Data Analytics Frameworks

    Get PDF
    International audienceBig Data analytics has recently gained increasing popularity as a tool to process large amounts of data on-demand. Spark and Flink are two Apache-hosted data analytics frameworks that facilitate the development of multi-step data pipelines using directly acyclic graph patterns. Making the most out of these frameworks is challenging because efficient executions strongly rely on complex parameter configurations and on an in-depth understanding of the underlying architectural choices. Although extensive research has been devoted to improving and evaluating the performance of such analytics frameworks, most of them benchmark the platforms against Hadoop, as a baseline, a rather unfair comparison considering the fundamentally different design principles. This paper aims to bring some justice in this respect, by directly evaluating the performance of Spark and Flink. Our goal is to identify and explain the impact of the different architectural choices and the parameter configurations on the perceived end-to-end performance. To this end, we develop a methodology for correlating the parameter settings and the operators execution plan with the resource usage. We use this methodology to dissect the performance of Spark and Flink with several representative batch and iterative workloads on up to 100 nodes. Our key finding is that there none of the two framework outperforms the other for all data types, sizes and job patterns. This paper performs a fine characterization of the cases when each framework is superior, and we highlight how this performance correlates to operators, to resource usage and to the specifics of the internal framework design

    Virtual Log-Structured Storage for High-Performance Streaming

    Get PDF
    International audienceOver the past decade, given the higher number of data sources (e.g., Cloud applications, Internet of things) and critical business demands, Big Data transitioned from batchoriented to real-time analytics. Stream storage systems, such as Apache Kafka, are well known for their increasing role in real-time Big Data analytics. For scalable stream data ingestion and processing, they logically split a data stream topic into multiple partitions. Stream storage systems keep multiple data stream copies to protect against data loss while implementing a stream partition as a replicated log. This architectural choice enables simplified development while trading cluster size with performance and the number of streams optimally managed. This paper introduces a shared virtual log-structured storage approach for improving the cluster throughput when multiple producers and consumers write and consume in parallel data streams. Stream partitions are associated with shared replicated virtual logs transparently to the user, effectively separating the implementation of stream partitioning (and data ordering) from data replication (and durability). We implement the virtual log technique in the KerA stream storage system. When comparing with Apache Kafka, KerA improves the cluster ingestion throughput (for replication factor three) by up to 4x when multiple producers write over hundreds of data streams

    Virtual Log-Structured Storage for High-Performance Streaming

    Get PDF
    International audienceOver the past decade, given the higher number of data sources (e.g., Cloud applications, Internet of things) and critical business demands, Big Data transitioned from batchoriented to real-time analytics. Stream storage systems, such as Apache Kafka, are well known for their increasing role in real-time Big Data analytics. For scalable stream data ingestion and processing, they logically split a data stream topic into multiple partitions. Stream storage systems keep multiple data stream copies to protect against data loss while implementing a stream partition as a replicated log. This architectural choice enables simplified development while trading cluster size with performance and the number of streams optimally managed. This paper introduces a shared virtual log-structured storage approach for improving the cluster throughput when multiple producers and consumers write and consume in parallel data streams. Stream partitions are associated with shared replicated virtual logs transparently to the user, effectively separating the implementation of stream partitioning (and data ordering) from data replication (and durability). We implement the virtual log technique in the KerA stream storage system. When comparing with Apache Kafka, KerA improves the cluster ingestion throughput (for replication factor three) by up to 4x when multiple producers write over hundreds of data streams

    Towards Unified Data Ingestion and Transfer for the Computing Continuum

    Get PDF
    peer reviewedThe computing continuum can enable new, novel big data use cases across the edge-cloud-supercomputer spectrum. Fast and high-volume data movement workflows rely on state-of- the-art architectures built on top of stream ingestion and file transfer open-source tools. Unfortunately, users struggle when faced with dealing with such diverse architectures: stream ingestion was designed for small-size datasets and low latency, while file transfer was designed for large-size datasets and high throughput. In this paper, we propose to unify ingestion and transfer, while introducing architectural design principles and discussing future implementation challenges

    Storage and Ingestion Systems in Support of Stream Processing: A Survey

    Get PDF
    Under the pressure of massive, exponentially increasing amounts ofheterogeneous data that are generated faster and faster, Big Data analyticsapplications have seen a shift from batch processing to stream processing,which can reduce the time needed to obtain meaningful insight dramatically.Stream processing is particularly well suited to address the challenges of fog/edgecomputing: much of this massive data comes from Internet of Things (IoT)devices and needs to be continuously funneled through an edge infrastructuretowards centralized clouds. Thus, it is only natural to process data on theirway as much as possible rather than wait for streams to accumulate on thecloud. Unfortunately, state-of-the-art stream processing systems are not wellsuited for this role: the data are accumulated (ingested), processed andpersisted (stored) separately, often using different services hosted ondifferent physical machines/clusters. Furthermore, there is only limited support foradvanced data manipulations, which often forces application developers tointroduce custom solutions and workarounds. In this survey article, wecharacterize the main state-of-the-art stream storage and ingestion systems.We identify the key aspects and discuss limitations and missing features inthe context of stream processing for fog/edge and cloud computing. The goal is tohelp practitioners understand and prepare for potential bottlenecks when usingsuch state-of-the-art systems. In particular, we discuss both functional(partitioning, metadata, search support, message routing, backpressuresupport) and non-functional aspects (high availability, durability,scalability, latency vs. throughput). As a conclusion of our study, weadvocate for a unified stream storage and ingestion system to speed-up datamanagement and reduce I/O redundancy (both in terms of storage space andnetwork utilization)

    KerA : A Unified Ingestion and Storage System for Scalable Big Data Processing

    No full text
    Le Big Data est maintenant la nouvelle ressource naturelle. Les architectures actuelles des environnements d'analyse des donnĂ©es massives sont constituĂ©es de trois couches: les flux de donnĂ©es sont acquis par la couche d’ingestion (e.g., Kafka) pour ensuite circuler Ă  travers la couche de traitement (e.g., Flink) qui s’appuie sur la couche de stockage (e.g., HDFS) pour stocker des donnĂ©es agrĂ©gĂ©es ou pour archiver les flux pour un traitement ultĂ©rieur. Malheureusement, malgrĂ© les bĂ©nĂ©fices potentiels apportĂ©s par les couches spĂ©cialisĂ©es (e.g., une mise en oeuvre simplifiĂ©e), dĂ©placer des quantitĂ©s importantes de donnĂ©es Ă  travers ces couches spĂ©cialisĂ©es s’avĂšre peu efficace: les donnĂ©es devraient ĂȘtre acquises, traitĂ©es et stockĂ©es en minimisant le nombre de copies. Cette thĂšse propose la conception et la mise en oeuvre d’une architecture unifiĂ©e pour l’ingestion et le stockage de flux de donnĂ©es, capable d'amĂ©liorer le traitement des applications Big Data. Cette approche minimise le dĂ©placement des donnĂ©es Ă  travers l’architecture d'analyse, menant ainsi Ă  une amĂ©lioration de l’utilisation des ressources. Nous identifions un ensemble de critĂšres de qualitĂ© pour un moteur dĂ©diĂ© d’ingestion des flux et stockage. Nous expliquons l’impact des diffĂ©rents choix architecturaux Big Data sur la performance de bout en bout. Nous proposons un ensemble de principes de conception d’une architecture unifiĂ©e et efficace pour l’ingestion et le stockage des donnĂ©es. Nous mettons en oeuvre et Ă©valuons le prototype KerA dans le but de gĂ©rer efficacement divers modĂšles d’accĂšs: accĂšs Ă  latence faible aux flux et/ou accĂšs Ă  dĂ©bit Ă©levĂ© aux flux et/ou objets.Big Data is now the new natural resource. Current state-of-the-art Big Data analytics architectures are built on top of a three layer stack:data streams are first acquired by the ingestion layer (e.g., Kafka) and then they flow through the processing layer (e.g., Flink) which relies on the storage layer (e.g., HDFS) for storing aggregated data or for archiving streams for later processing. Unfortunately, in spite of potential benefits brought by specialized layers (e.g., simplified implementation), moving large quantities of data through specialized layers is not efficient: instead, data should be acquired, processed and stored while minimizing the number of copies. This dissertation argues that a plausible path to follow to alleviate from previous limitations is the careful design and implementation of a unified architecture for stream ingestion and storage, which can lead to the optimization of the processing of Big Data applications. This approach minimizes data movement within the analytics architecture, finally leading to better utilized resources. We identify a set of requirements for a dedicated stream ingestion/storage engine. We explain the impact of the different Big Data architectural choices on end-to-end performance. We propose a set of design principles for a scalable, unified architecture for data ingestion and storage. We implement and evaluate the KerA prototype with the goal of efficiently handling diverse access patterns: low-latency access to streams and/or high throughput access to streams and/or objects

    KerA : un systÚme unifié d'ingestion et de stockage pour le traitement efficace du Big Data

    No full text
    Big Data is now the new natural resource. Current state-of-the-art Big Data analytics architectures are built on top of a three layer stack: data streams are first acquired by the ingestion layer (e.g., Kafka) and then they flow through the processing layer (e.g., Flink) which relies on the storage layer (e.g., HDFS) for storing aggregated data or for archiving streams for later processing. Unfortunately, in spite of potential benefits brought by specialized layers (e.g., simplified implementation), moving large quantities of data through specialized layers is not efficient: instead, data should be acquired, processed and stored while minimizing the number of copies. This dissertation argues that a plausible path to follow to alleviate from previous limitations is the careful design and implementation of a unified architecture for stream ingestion and storage which can lead to the optimization of the processing of Big Data applications. This approach minimizes data movement within the analytics architecture, finally leading to better utilized resources. We identify a set of requirements for a dedicated stream ingestion/storage engine. We explain the impact of the different Big Data architectural choices on end-to-end performance. We propose a set of design principles for a scalable, unified architecture for data ingestion and storage. We implement and evaluate the KerA prototype with the goal of efficiently handling diverse access patterns: low-latency access to streams and/or high throughput access to unbounded streams and/or objects.Le Big Data est maintenant la nouvelle ressource naturelle. Les architectures actuelles des environnements d'analyse des donnĂ©es massives sont constituĂ©es de trois couches: les flux de donnĂ©es sont acquis par la couche d’ingestion (e.g., Kafka) pour ensuite circuler Ă  travers la couche de traitement (e.g., Flink) qui s’appuie sur la couche de stockage (e.g., HDFS) pour stocker des donnĂ©es agrĂ©gĂ©es ou pour archiver les flux pour un traitement ultĂ©rieur. Malheureusement, malgrĂ© les bĂ©nĂ©fices potentiels apportĂ©s par les couches spĂ©cialisĂ©es (e.g., une mise en oeuvre simplifiĂ©e), dĂ©placer des quantitĂ©s importantes de donnĂ©es Ă  travers ces couches spĂ©cialisĂ©es s’avĂšre peu efficace: les donnĂ©es devraient ĂȘtre acquises, traitĂ©es et stockĂ©es en minimisant le nombre de copies. Cette thĂšse propose la conception et la mise en oeuvre d’une architecture unifiĂ©e pour l’ingestion et le stockage de flux de donnĂ©es, capable d'amĂ©liorer le traitement des applications Big Data. Cette approche minimise le dĂ©placement des donnĂ©es Ă  travers l’architecture d'analyse, menant ainsi Ă  une amĂ©lioration de l’utilisation des ressources. Nous identifions un ensemble de critĂšres de qualitĂ© pour un moteur dĂ©diĂ© d’ingestion des flux et stockage. Nous expliquons l’impact des diffĂ©rents choix architecturaux Big Data sur la performance de bout en bout. Nous proposons un ensemble de principes de conception d’une architecture unifiĂ©e et efficace pour l’ingestion et le stockage des donnĂ©es. Nous mettons en oeuvre et Ă©valuons le prototype KerA dans le but de gĂ©rer efficacement divers modĂšles d’accĂšs: accĂšs Ă  latence faible aux flux et/ou accĂšs Ă  dĂ©bit Ă©levĂ© aux flux et/ou objets

    In support of push-based streaming for the computing continuum

    No full text
    Real-time data architectures are core tools for implementing the edge-to-cloud computing continuum since streams are a natural abstraction for representing and predicting the needs of such applications. Over the past decade, Big Data architectures evolved into specialized layers for handling real-time storage and stream processing. Open-source streaming architectures efficiently decouple fast storage and processing engines by implementing stream reads through a pull-based interface exposed by storage. However, how much data the stream source operators have to pull from storage continuously and how often to issue pull-based requests are configurations left to the application and can result in increased system resources and overall reduced application performance. To tackle these issues, this paper proposes a unified streaming architecture that integrates co-located fast storage and streaming engines through push-based source integrations, making the data available for processing as soon as storage has them. We empirically evaluate pull-based versus push-based design alternatives of the streaming source reader and discuss the advantages of both approaches

    Kera: A Unified Storage and Ingestion Architecture for Efficient Stream Processing

    Get PDF
    Big Data applications are rapidly moving from a batch-oriented execution to areal-time model in order to extract value from the streams of data just asfast as they arrive. Such stream-based applications need to immediately ingestand analyze data and in many use cases combine live (i.e., real-time streams)and archived data in order to extract better insights. Current streamingarchitectures are designed with distinct components for ingestion (e.g.,Kafka) and storage (e.g., HDFS) of stream data. Unfortunately, this separationis becoming an overhead especially when data needs to be archived for lateranalysis (i.e., near real-time): in such use cases, stream data has to bewritten twice to disk and may pass twice over high latency networks. Moreover,current ingestion mechanisms offer no support for searching the acquiredstreams in real time, an important requirement to promptly react to fast data.In this paper we describe the design of Kera: a unified storage andingestion architecture that could better serve the specific needs of streamprocessing. We identify a set of design principles for stream-based Big Dataprocessing that guide us in designing a novel architecture for streaming. Wedesign Kera in order to reduce the storage and network utilizationsignificantly, which can lead to reduced times for stream processing andarchival. To this end, we propose a set of optimization techniques for handlingstreams with a log-structured (in memory and on disk) approach. On top of ourenvisioned architecture we devise the implementation of an efficient interfacefor data ingestion, processing, and storage (DIPS), an interplay betweenprocessing engines and smart storage systems, with the goal to reduce the end-to-end stream processing latency
    corecore