Search CORE

20 research outputs found

Spark versus Flink: Understanding Performance in Big Data Analytics Frameworks

Author: Antoniu Gabriel
Costan Alexandru
Marcu Ovidiu-Cristian
Pérez-Hernández María S.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 12/09/2016
Field of study

International audienceBig Data analytics has recently gained increasing popularity as a tool to process large amounts of data on-demand. Spark and Flink are two Apache-hosted data analytics frameworks that facilitate the development of multi-step data pipelines using directly acyclic graph patterns. Making the most out of these frameworks is challenging because efficient executions strongly rely on complex parameter configurations and on an in-depth understanding of the underlying architectural choices. Although extensive research has been devoted to improving and evaluating the performance of such analytics frameworks, most of them benchmark the platforms against Hadoop, as a baseline, a rather unfair comparison considering the fundamentally different design principles. This paper aims to bring some justice in this respect, by directly evaluating the performance of Spark and Flink. Our goal is to identify and explain the impact of the different architectural choices and the parameter configurations on the perceived end-to-end performance. To this end, we develop a methodology for correlating the parameter settings and the operators execution plan with the resource usage. We use this methodology to dissect the performance of Spark and Flink with several representative batch and iterative workloads on up to 100 nodes. Our key finding is that there none of the two framework outperforms the other for all data types, sizes and job patterns. This paper performs a fine characterization of the cases when each framework is superior, and we highlight how this performance correlates to operators, to resource usage and to the specifics of the internal framework design

HAL-CentraleSupelec

Crossref

INRIA a CCSD electronic archive server

HAL-Inserm

HAL-Rennes 1

Virtual Log-Structured Storage for High-Performance Streaming

Author: Antonin Gabriel
Costan Alexandru
Marcu Ovidiu-Cristian
Nicolae Bogdan
Publication venue
Publication date: 01/01/2021
Field of study

International audienceOver the past decade, given the higher number of data sources (e.g., Cloud applications, Internet of things) and critical business demands, Big Data transitioned from batchoriented to real-time analytics. Stream storage systems, such as Apache Kafka, are well known for their increasing role in real-time Big Data analytics. For scalable stream data ingestion and processing, they logically split a data stream topic into multiple partitions. Stream storage systems keep multiple data stream copies to protect against data loss while implementing a stream partition as a replicated log. This architectural choice enables simplified development while trading cluster size with performance and the number of streams optimally managed. This paper introduces a shared virtual log-structured storage approach for improving the cluster throughput when multiple producers and consumers write and consume in parallel data streams. Stream partitions are associated with shared replicated virtual logs transparently to the user, effectively separating the implementation of stream partitioning (and data ordering) from data replication (and durability). We implement the virtual log technique in the KerA stream storage system. When comparing with Apache Kafka, KerA improves the cluster ingestion throughput (for replication factor three) by up to 4x when multiple producers write over hundreds of data streams

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

HAL Descartes

Open Repository and Bibliography - Luxembourg

HAL-Rennes 1

Virtual Log-Structured Storage for High-Performance Streaming

Author: Antoniu Gabriel
Costan Alexandru
Marcu Ovidiu-Cristian
Nicolae Bogdan
Publication venue: HAL CCSD
Publication date: 07/09/2021
Field of study

INRIA a CCSD electronic archive server

Towards Unified Data Ingestion and Transfer for the Computing Continuum

Author: Arslan Tariq
BOUVRY Pascal
DANOY Grégoire
MARCU Ovidiu-Cristian
Publication venue
Publication date: 15/12/2023
Field of study

peer reviewedThe computing continuum can enable new, novel big data use cases across the edge-cloud-supercomputer spectrum. Fast and high-volume data movement workflows rely on state-of- the-art architectures built on top of stream ingestion and file transfer open-source tools. Unfortunately, users struggle when faced with dealing with such diverse architectures: stream ingestion was designed for small-size datasets and low latency, while file transfer was designed for large-size datasets and high throughput. In this paper, we propose to unify ingestion and transfer, while introducing architectural design principles and discussing future implementation challenges

Open Repository and Bibliography - Luxembourg

Storage and Ingestion Systems in Support of Stream Processing: A Survey

Author: Antoniu Gabriel
Bortoli Stefano
Costan Alexandru
Marcu Ovidiu-Cristian
Nicolae Bogdan
Pérez-Hernández María,
Tudoran Radu
Publication venue: HAL CCSD
Publication date: 29/11/2018
Field of study

Under the pressure of massive, exponentially increasing amounts ofheterogeneous data that are generated faster and faster, Big Data analyticsapplications have seen a shift from batch processing to stream processing,which can reduce the time needed to obtain meaningful insight dramatically.Stream processing is particularly well suited to address the challenges of fog/edgecomputing: much of this massive data comes from Internet of Things (IoT)devices and needs to be continuously funneled through an edge infrastructuretowards centralized clouds. Thus, it is only natural to process data on theirway as much as possible rather than wait for streams to accumulate on thecloud. Unfortunately, state-of-the-art stream processing systems are not wellsuited for this role: the data are accumulated (ingested), processed andpersisted (stored) separately, often using different services hosted ondifferent physical machines/clusters. Furthermore, there is only limited support foradvanced data manipulations, which often forces application developers tointroduce custom solutions and workarounds. In this survey article, wecharacterize the main state-of-the-art stream storage and ingestion systems.We identify the key aspects and discuss limitations and missing features inthe context of stream processing for fog/edge and cloud computing. The goal is tohelp practitioners understand and prepare for potential bottlenecks when usingsuch state-of-the-art systems. In particular, we discuss both functional(partitioning, metadata, search support, message routing, backpressuresupport) and non-functional aspects (high availability, durability,scalability, latency vs. throughput). As a conclusion of our study, weadvocate for a unified stream storage and ingestion system to speed-up datamanagement and reduce I/O redundancy (both in terms of storage space andnetwork utilization)

INRIA a CCSD electronic archive server

KerA : A Unified Ingestion and Storage System for Scalable Big Data Processing

Author: Marcu Ovidiu-Cristian
Publication venue
Publication date: 18/12/2018
Field of study

Le Big Data est maintenant la nouvelle ressource naturelle. Les architectures actuelles des environnements d'analyse des données massives sont constituées de trois couches: les flux de données sont acquis par la couche d’ingestion (e.g., Kafka) pour ensuite circuler à travers la couche de traitement (e.g., Flink) qui s’appuie sur la couche de stockage (e.g., HDFS) pour stocker des données agrégées ou pour archiver les flux pour un traitement ultérieur. Malheureusement, malgré les bénéfices potentiels apportés par les couches spécialisées (e.g., une mise en oeuvre simplifiée), déplacer des quantités importantes de données à travers ces couches spécialisées s’avère peu efficace: les données devraient être acquises, traitées et stockées en minimisant le nombre de copies. Cette thèse propose la conception et la mise en oeuvre d’une architecture unifiée pour l’ingestion et le stockage de flux de données, capable d'améliorer le traitement des applications Big Data. Cette approche minimise le déplacement des données à travers l’architecture d'analyse, menant ainsi à une amélioration de l’utilisation des ressources. Nous identifions un ensemble de critères de qualité pour un moteur dédié d’ingestion des flux et stockage. Nous expliquons l’impact des différents choix architecturaux Big Data sur la performance de bout en bout. Nous proposons un ensemble de principes de conception d’une architecture unifiée et efficace pour l’ingestion et le stockage des données. Nous mettons en oeuvre et évaluons le prototype KerA dans le but de gérer efficacement divers modèles d’accès: accès à latence faible aux flux et/ou accès à débit élevé aux flux et/ou objets.Big Data is now the new natural resource. Current state-of-the-art Big Data analytics architectures are built on top of a three layer stack:data streams are first acquired by the ingestion layer (e.g., Kafka) and then they flow through the processing layer (e.g., Flink) which relies on the storage layer (e.g., HDFS) for storing aggregated data or for archiving streams for later processing. Unfortunately, in spite of potential benefits brought by specialized layers (e.g., simplified implementation), moving large quantities of data through specialized layers is not efficient: instead, data should be acquired, processed and stored while minimizing the number of copies. This dissertation argues that a plausible path to follow to alleviate from previous limitations is the careful design and implementation of a unified architecture for stream ingestion and storage, which can lead to the optimization of the processing of Big Data applications. This approach minimizes data movement within the analytics architecture, finally leading to better utilized resources. We identify a set of requirements for a dedicated stream ingestion/storage engine. We explain the impact of the different Big Data architectural choices on end-to-end performance. We propose a set of design principles for a scalable, unified architecture for data ingestion and storage. We implement and evaluate the KerA prototype with the goal of efficiently handling diverse access patterns: low-latency access to streams and/or high throughput access to streams and/or objects

Theses.fr

KerA : un système unifié d'ingestion et de stockage pour le traitement efficace du Big Data

Author: Marcu Ovidiu-Cristian
Publication venue: HAL CCSD
Publication date: 18/12/2018
Field of study

Big Data is now the new natural resource. Current state-of-the-art Big Data analytics architectures are built on top of a three layer stack: data streams are first acquired by the ingestion layer (e.g., Kafka) and then they flow through the processing layer (e.g., Flink) which relies on the storage layer (e.g., HDFS) for storing aggregated data or for archiving streams for later processing. Unfortunately, in spite of potential benefits brought by specialized layers (e.g., simplified implementation), moving large quantities of data through specialized layers is not efficient: instead, data should be acquired, processed and stored while minimizing the number of copies. This dissertation argues that a plausible path to follow to alleviate from previous limitations is the careful design and implementation of a unified architecture for stream ingestion and storage which can lead to the optimization of the processing of Big Data applications. This approach minimizes data movement within the analytics architecture, finally leading to better utilized resources. We identify a set of requirements for a dedicated stream ingestion/storage engine. We explain the impact of the different Big Data architectural choices on end-to-end performance. We propose a set of design principles for a scalable, unified architecture for data ingestion and storage. We implement and evaluate the KerA prototype with the goal of efficiently handling diverse access patterns: low-latency access to streams and/or high throughput access to unbounded streams and/or objects.Le Big Data est maintenant la nouvelle ressource naturelle. Les architectures actuelles des environnements d'analyse des données massives sont constituées de trois couches: les flux de données sont acquis par la couche d’ingestion (e.g., Kafka) pour ensuite circuler à travers la couche de traitement (e.g., Flink) qui s’appuie sur la couche de stockage (e.g., HDFS) pour stocker des données agrégées ou pour archiver les flux pour un traitement ultérieur. Malheureusement, malgré les bénéfices potentiels apportés par les couches spécialisées (e.g., une mise en oeuvre simplifiée), déplacer des quantités importantes de données à travers ces couches spécialisées s’avère peu efficace: les données devraient être acquises, traitées et stockées en minimisant le nombre de copies. Cette thèse propose la conception et la mise en oeuvre d’une architecture unifiée pour l’ingestion et le stockage de flux de données, capable d'améliorer le traitement des applications Big Data. Cette approche minimise le déplacement des données à travers l’architecture d'analyse, menant ainsi à une amélioration de l’utilisation des ressources. Nous identifions un ensemble de critères de qualité pour un moteur dédié d’ingestion des flux et stockage. Nous expliquons l’impact des différents choix architecturaux Big Data sur la performance de bout en bout. Nous proposons un ensemble de principes de conception d’une architecture unifiée et efficace pour l’ingestion et le stockage des données. Nous mettons en oeuvre et évaluons le prototype KerA dans le but de gérer efficacement divers modèles d’accès: accès à latence faible aux flux et/ou accès à débit élevé aux flux et/ou objets

HAL-CentraleSupelec

Thèses en Ligne

INRIA a CCSD electronic archive server

HAL-Rennes 1

In support of push-based streaming for the computing continuum

Author: Bouvry Pascal
Marcu Ovidiu-Cristian
Publication venue
Publication date: 01/01/2023
Field of study

Real-time data architectures are core tools for implementing the edge-to-cloud computing continuum since streams are a natural abstraction for representing and predicting the needs of such applications. Over the past decade, Big Data architectures evolved into specialized layers for handling real-time storage and stream processing. Open-source streaming architectures efficiently decouple fast storage and processing engines by implementing stream reads through a pull-based interface exposed by storage. However, how much data the stream source operators have to pull from storage continuously and how often to issue pull-based requests are configurations left to the application and can result in increased system resources and overall reduced application performance. To tackle these issues, this paper proposes a unified streaming architecture that integrates co-located fast storage and streaming engines through push-based source integrations, making the data available for processing as soon as storage has them. We empirically evaluate pull-based versus push-based design alternatives of the streaming source reader and discuss the advantages of both approaches

Open Repository and Bibliography - Luxembourg

ZettaFlow: Towards High-Performance ML-based Analytics across the Digital Continuum

Author: Antoniu Gabriel
Costan Alexandru
Marcu Ovidiu-Cristian
Publication venue: HAL CCSD
Publication date: 15/10/2019
Field of study

International audienc

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

HAL-Rennes 1

Kera: A Unified Storage and Ingestion Architecture for Efficient Stream Processing

Author: Antoniu Gabriel
Costan Alexandru
Marcu Ovidiu-Cristian
Pérez-Hernández María S.
Publication venue: HAL CCSD
Publication date: 02/06/2017
Field of study

Big Data applications are rapidly moving from a batch-oriented execution to areal-time model in order to extract value from the streams of data just asfast as they arrive. Such stream-based applications need to immediately ingestand analyze data and in many use cases combine live (i.e., real-time streams)and archived data in order to extract better insights. Current streamingarchitectures are designed with distinct components for ingestion (e.g.,Kafka) and storage (e.g., HDFS) of stream data. Unfortunately, this separationis becoming an overhead especially when data needs to be archived for lateranalysis (i.e., near real-time): in such use cases, stream data has to bewritten twice to disk and may pass twice over high latency networks. Moreover,current ingestion mechanisms offer no support for searching the acquiredstreams in real time, an important requirement to promptly react to fast data.In this paper we describe the design of Kera: a unified storage andingestion architecture that could better serve the specific needs of streamprocessing. We identify a set of design principles for stream-based Big Dataprocessing that guide us in designing a novel architecture for streaming. Wedesign Kera in order to reduce the storage and network utilizationsignificantly, which can lead to reduced times for stream processing andarchival. To this end, we propose a set of optimization techniques for handlingstreams with a log-structured (in memory and on disk) approach. On top of ourenvisioned architecture we devise the implementation of an efficient interfacefor data ingestion, processing, and storage (DIPS), an interplay betweenprocessing engines and smart storage systems, with the goal to reduce the end-to-end stream processing latency

INRIA a CCSD electronic archive server

HAL-Rennes 1