Search CORE

83 research outputs found

A Survey on the Evolution of Stream Processing Systems

Author: Carbone Paris
Fragkoulis Marios
Kalavri Vasiliki
Katsifodimos Asterios
Publication venue
Publication date: 03/08/2020
Field of study

Stream processing has been an active research field for more than 20 years, but it is now witnessing its prime time due to recent successful efforts by the research community and numerous worldwide open-source communities. This survey provides a comprehensive overview of fundamental aspects of stream processing systems and their evolution in the functional areas of out-of-order data management, state management, fault tolerance, high availability, load management, elasticity, and reconfiguration. We review noteworthy past research findings, outline the similarities and differences between early ('00-'10) and modern ('11-'18) streaming systems, and discuss recent trends and open problems.Comment: 34 pages, 15 figures, 5 table

arXiv.org e-Print Archive

Chi: a scalable and programmable control plane for distributed stream processing systems

Author: Costa P
Dhulipalla S
Kim T
Kuppa V
Mai L
Muthukrishnan S
Potharaju R
Rao S
Venkataraman S
Xu L
Zeng K
Publication venue: 'VLDB Endowment'
Publication date: 01/06/2018
Field of study

Stream-processing workloads and modern shared cluster environments exhibit high variability and unpredictability. Combined with the large parameter space and the diverse set of user SLOs, this makes modern streaming systems very challenging to statically configure and tune. To address these issues, in this paper we investigate a novel control-plane design, Chi, which supports continuous monitoring and feedback, and enables dynamic re-configuration. Chi leverages the key insight of embedding control-plane messages in the data-plane channels to achieve a low-latency and flexible control plane for stream-processing systems. Chi introduces a new reactive programming model and design mechanisms to asynchronously execute control policies, thus avoiding global synchronization. We show how this allows us to easily implement a wide spectrum of control policies targeting different use cases observed in production. Large-scale experiments using production workloads from a popular cloud provider demonstrate the flexibility and efficiency of our approach

Spiral - Imperial College Digital Repository

Elastic techniques to handle dynamism in real-time data processing systems

Author: Xu Le
Publication venue
Publication date: 01/12/2021
Field of study

Real-time data processing is a crucial component of cloud computing today. It is widely adopted to provide an up-to-date view of data for social networks, cloud management, web applications, edge, and IoT infrastructures. Real-time processing frameworks are designed for time-sensitive tasks such as event detection, real-time data analysis, and prediction. Compared to handling offline, batched data, real-time data processing applications tend to be long-running and are prone to performance issues caused by many unpredictable environmental variables, including (but not limited to) job specification, user expectation, and available resources. In order to cope with this challenge, it is crucial for system designers to improve frameworks’ ability to adjust their resource usage to adapt to changing environmental variables, defined as system elasticity. This thesis investigates how elastic resource provisioning helps cloud systems today process real-time data while maintaining predictable performance under workload influence in an automated manner. We explore new algorithms, framework design, and efficient system implementation to achieve this goal. On the other hand, distributed systems today need to continuously handle various application specifications, hardware configurations, and workload characteristics. Maintaining stable performance requires systems to explicitly plan for resource allocation upon starting an application and tailor allocation dynamically during run time. In this thesis, we show how achieving system elasticity can help systems provide tunable performance under the dynamism of many environmental variables without compromising resource efficiency. Specifically, this thesis focuses on the two following aspects: i) Elasticity-aware Scheduling: Real-time data processing systems today are often designed in resource-, workload-agnostic fashion. As a result, most users are unable to perform resource planning before launching an application or adjust resource allocation (both within and across application boundaries) intelligently during the run. The first part of this thesis work (Stela [1], Henge [2], Getafix [3]) explores efficient mechanisms to conduct performance analysis while also enabling elasticity-aware scheduling in today’s cloud frameworks. ii) Resource Efficient Cloud Stack: The second line of work in this thesis aims to improve underlying cloud stacks to support self-adaptive, highly efficient resource provisioning. Today’s cloud systems enforce full isolation that prevents resource sharing among applications at a fine granularity over time. This work (Cameo [4], Dirigo) builds real- time data processing systems for emerging cloud infrastructures with high resource utilization through fine-grained resource sharing. Given that the market for real-time data analysis is expected to increase by the annual rate of 28.2% and reach 35.5 billion by the year 2024 [5], improving system elasticity can introduce a significant reduction to deployment cost and increase in resource utilization. Our works improve the performances of real-time data analytics applications within resource constraints. We highlight some of the improvements as the following: i) Stela explores elastic techniques for single-tenant, on-demand dataflow scale-out and scale-in operations. It improves post-scale throughput by 45-120% during on-demand scale-out and post-scale throughput by 2-5× during on-demand scale-in. ii) Henge develops a mechanism to map application’s performance into a unified scale of resource needs. It reduces resource consumption by 40-60% by maintaining the same level of SLO achievement throughout the cluster. iii) Getafix implements a strategy to analyze workload dynamically and proposes a solution that guides the systems to calculate the number of replicas to generate and the placement plan of these replicas adaptively. It achieves comparable query latency (both average and tail) by achieving 1.45-2.15× memory savings. iv) Cameo proposes a scheduler that supports data-driven, fine-grained operator execution guided by user expectations. It improves cluster utilization by 6× and reduces the performance violation by 72% while compacting more jobs into a shared cluster. v) Dirigo performs fully decentralized, function state-aware, global message scheduling for stateful functions. It is able to reduce tail latency by 60% compared to the local scheduling approach and reduce remote state accesses by 19× compared to the scheduling approach that is unaware of function states. These works can potentially lead to profound cost savings for both cloud providers and end-users

Illinois Digital Environment for Access to Learning and Scholarship Repository

Hardware-conscious query processing for the many-core era

Author: Pohl Constantin
Publication venue
Publication date: 01/01/2020
Field of study

Die optimale Nutzung von moderner Hardware zur Beschleunigung von Datenbank-Anfragen ist keine triviale Aufgabe. Viele DBMS als auch DSMS der letzten Jahrzehnte basieren auf Sachverhalten, die heute kaum noch Gültigkeit besitzen. Ein Beispiel hierfür sind heutige Server-Systeme, deren Hauptspeichergröße im Bereich mehrerer Terabytes liegen kann und somit den Weg für Hauptspeicherdatenbanken geebnet haben. Einer der größeren letzten Hardware Trends geht hin zu Prozessoren mit einer hohen Anzahl von Kernen, den sogenannten Manycore CPUs. Diese erlauben hohe Parallelitätsgrade für Programme durch Multithreading sowie Vektorisierung (SIMD), was die Anforderungen an die Speicher-Bandbreite allerdings deutlich erhöht. Der sogenannte High-Bandwidth Memory (HBM) versucht diese Lücke zu schließen, kann aber ebenso wie Many-core CPUs jeglichen Performance-Vorteil negieren, wenn dieser leichtfertig eingesetzt wird. Diese Arbeit stellt die Many-core CPU-Architektur zusammen mit HBM vor, um Datenbank sowie Datenstrom-Anfragen zu beschleunigen. Es wird gezeigt, dass ein hardwarenahes Kostenmodell zusammen mit einem Kalibrierungsansatz die Performance verschiedener Anfrageoperatoren verlässlich vorhersagen kann. Dies ermöglicht sowohl eine adaptive Partitionierungs und Merge-Strategie für die Parallelisierung von Datenstrom-Anfragen als auch eine ideale Konfiguration von Join-Operationen auf einem DBMS. Nichtsdestotrotz ist nicht jede Operation und Anwendung für die Nutzung einer Many-core CPU und HBM geeignet. Datenstrom-Anfragen sind oft auch an niedrige Latenz und schnelle Antwortzeiten gebunden, welche von höherer Speicher-Bandbreite kaum profitieren können. Hinzu kommen üblicherweise niedrigere Taktraten durch die hohe Kernzahl der CPUs, sowie Nachteile für geteilte Datenstrukturen, wie das Herstellen von Cache-Kohärenz und das Synchronisieren von parallelen Thread-Zugriffen. Basierend auf den Ergebnissen dieser Arbeit lässt sich ableiten, welche parallelen Datenstrukturen sich für die Verwendung von HBM besonders eignen. Des Weiteren werden verschiedene Techniken zur Parallelisierung und Synchronisierung von Datenstrukturen vorgestellt, deren Effizienz anhand eines Mehrwege-Datenstrom-Joins demonstriert wird.Exploiting the opportunities given by modern hardware for accelerating query processing speed is no trivial task. Many DBMS and also DSMS from past decades are based on fundamentals that have changed over time, e.g., servers of today with terabytes of main memory capacity allow complete avoidance of spilling data to disk, which has prepared the ground some time ago for main memory databases. One of the recent trends in hardware are many-core processors with hundreds of logical cores on a single CPU, providing an intense degree of parallelism through multithreading as well as vectorized instructions (SIMD). Their demand for memory bandwidth has led to the further development of high-bandwidth memory (HBM) to overcome the memory wall. However, many-core CPUs as well as HBM have many pitfalls that can nullify any performance gain with ease. In this work, we explore the many-core architecture along with HBM for database and data stream query processing. We demonstrate that a hardware-conscious cost model with a calibration approach allows reliable performance prediction of various query operations. Based on that information, we can, therefore, come to an adaptive partitioning and merging strategy for stream query parallelization as well as finding an ideal configuration of parameters for one of the most common tasks in the history of DBMS, join processing. However, not all operations and applications can exploit a many-core processor or HBM, though. Stream queries optimized for low latency and quick individual responses usually do not benefit well from more bandwidth and suffer from penalties like low clock frequencies of many-core CPUs as well. Shared data structures between cores also lead to problems with cache coherence as well as high contention. Based on our insights, we give a rule of thumb which data structures are suitable to parallelize with focus on HBM usage. In addition, different parallelization schemas and synchronization techniques are evaluated, based on the example of a multiway stream join operation

Digitale Bibliothek Thüringen

Runtime Management of Dynamic Dataflows with Partially Reconfigurable Pipelines on FPGAs

Author: Mätas Kaspar
Publication venue
Publication date: 31/12/2023
Field of study

The University of Manchester - Institutional Repository

Processamento de eventos complexos como serviço em ambientes multi-nuvem

Author: Higashino Wilson Akio, 1981
Publication venue: [s.n.]
Publication date: 31/08/2018
Field of study

Orientadores: Luiz Fernando Bittencourt, Miriam Akemi Manabe CapretzTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: O surgimento das tecnologias de dispositivos móveis e da Internet das Coisas, combinada com avanços das tecnologias Web, criou um novo mundo de Big Data em que o volume e a velocidade da geração de dados atingiu uma escala sem precedentes. Por ser uma tecnologia criada para processar fluxos contínuos de dados, o Processamento de Eventos Complexos (CEP, do inglês Complex Event Processing) tem sido frequentemente associado a Big Data e aplicado como uma ferramenta para obter informações em tempo real. Todavia, apesar desta onda de interesse, o mercado de CEP ainda é dominado por soluções proprietárias que requerem grandes investimentos para sua aquisição e não proveem a flexibilidade que os usuários necessitam. Como alternativa, algumas empresas adotam soluções de baixo nível que demandam intenso treinamento técnico e possuem alto custo operacional. A fim de solucionar esses problemas, esta pesquisa propõe a criação de um sistema de CEP que pode ser oferecido como serviço e usado através da Internet. Um sistema de CEP como Serviço (CEPaaS, do inglês CEP as a Service) oferece aos usuários as funcionalidades de CEP aliadas às vantagens do modelo de serviços, tais como redução do investimento inicial e baixo custo de manutenção. No entanto, a criação de tal serviço envolve inúmeros desafios que não são abordados no atual estado da arte de CEP. Em especial, esta pesquisa propõe soluções para três problemas em aberto que existem neste contexto. Em primeiro lugar, para o problema de entender e reusar a enorme variedade de procedimentos para gerência de sistemas CEP, esta pesquisa propõe o formalismo Reescrita de Grafos com Atributos para Gerência de Processamento de Eventos Complexos (AGeCEP, do inglês Attributed Graph Rewriting for Complex Event Processing Management). Este formalismo inclui modelos para consultas CEP e transformações de consultas que são independentes de tecnologia e linguagem. Em segundo lugar, para o problema de avaliar estratégias de gerência e processamento de consultas CEP, esta pesquisa apresenta CEPSim, um simulador de sistemas CEP baseado em nuvem. Por fim, esta pesquisa também descreve um sistema CEPaaS fundamentado em ambientes multi-nuvem, sistemas de gerência de contêineres e um design multiusuário baseado em AGeCEP. Para demonstrar sua viabilidade, o formalismo AGeCEP foi usado para projetar um gerente autônomo e um conjunto de políticas de auto-gerenciamento para sistemas CEP. Além disso, o simulador CEPSim foi minuciosamente avaliado através de experimentos que demonstram sua capacidade de simular sistemas CEP com acurácia e baixo custo adicional de processamento. Por fim, experimentos adicionais validaram o sistema CEPaaS e demonstraram que o objetivo de oferecer funcionalidades CEP como um serviço escalável e tolerante a falhas foi atingido. Em conjunto, esses resultados confirmam que esta pesquisa avança significantemente o estado da arte e também oferece novas ferramentas e metodologias que podem ser aplicadas à pesquisa em CEPAbstract: The rise of mobile technologies and the Internet of Things, combined with advances in Web technologies, have created a new Big Data world in which the volume and velocity of data generation have achieved an unprecedented scale. As a technology created to process continuous streams of data, Complex Event Processing (CEP) has been often related to Big Data and used as a tool to obtain real-time insights. However, despite this recent surge of interest, the CEP market is still dominated by solutions that are costly and inflexible or too low-level and hard to operate. To address these problems, this research proposes the creation of a CEP system that can be offered as a service and used over the Internet. Such a CEP as a Service (CEPaaS) system would give its users CEP functionalities associated with the advantages of the services model, such as no up-front investment and low maintenance cost. Nevertheless, creating such a service involves challenges that are not addressed by current CEP systems. This research proposes solutions for three open problems that exist in this context. First, to address the problem of understanding and reusing existing CEP management procedures, this research introduces the Attributed Graph Rewriting for Complex Event Processing Management (AGeCEP) formalism as a technology- and language-agnostic representation of queries and their reconfigurations. Second, to address the problem of evaluating CEP query management and processing strategies, this research introduces CEPSim, a simulator of cloud-based CEP systems. Finally, this research also introduces a CEPaaS system based on a multi-cloud architecture, container management systems, and an AGeCEP-based multi-tenant design. To demonstrate its feasibility, AGeCEP was used to design an autonomic manager and a selected set of self-management policies. Moreover, CEPSim was thoroughly evaluated by experiments that showed it can simulate existing systems with accuracy and low execution overhead. Finally, additional experiments validated the CEPaaS system and demonstrated it achieves the goal of offering CEP functionalities as a scalable and fault-tolerant service. In tandem, these results confirm this research significantly advances the CEP state of the art and provides novel tools and methodologies that can be applied to CEP researchDoutoradoCiência da ComputaçãoDoutor em Ciência da Computação140920/2012-9CNP

Repositorio da Producao Cientifica e Intelectual da Unicamp

Quality of Service Aware Data Stream Processing for Highly Dynamic and Scalable Applications

Author: Al Jawarneh Isam Mashhour Hasan <1981>
Publication venue: Alma Mater Studiorum - Università di Bologna
Publication date: 02/04/2020
Field of study

Huge amounts of georeferenced data streams are arriving daily to data stream management systems that are deployed for serving highly scalable and dynamic applications. There are innumerable ways at which those loads can be exploited to gain deep insights in various domains. Decision makers require an interactive visualization of such data in the form of maps and dashboards for decision making and strategic planning. Data streams normally exhibit fluctuation and oscillation in arrival rates and skewness. Those are the two predominant factors that greatly impact the overall quality of service. This requires data stream management systems to be attuned to those factors in addition to the spatial shape of the data that may exaggerate the negative impact of those factors. Current systems do not natively support services with quality guarantees for dynamic scenarios, leaving the handling of those logistics to the user which is challenging and cumbersome. Three workloads are predominant for any data stream, batch processing, scalable storage and stream processing. In this thesis, we have designed a quality of service aware system, SpatialDSMS, that constitutes several subsystems that are covering those loads and any mixed load that results from intermixing them. Most importantly, we natively have incorporated quality of service optimizations for processing avalanches of geo-referenced data streams in highly dynamic application scenarios. This has been achieved transparently on top of the codebases of emerging de facto standard best-in-class representatives, thus relieving the overburdened shoulders of the users in the presentation layer from having to reason about those services. Instead, users express their queries with quality goals and our system optimizers compiles that down into query plans with an embedded quality guarantee and leaves logistic handling to the underlying layers. We have developed standard compliant prototypes for all the subsystems that constitutes SpatialDSMS

AMS Tesi di Dottorato

Automatic Generation of Distributed Runtime Infrastructure for Internet of Things

Author: Mohammed Saleh
Publication venue: Newcastle University
Publication date: 01/01/2020
Field of study

Ph. D. ThesisThe Internet of Things (IoT) represents a network of connected devices that are able to cooperate and interact with each other in order to reach a particular goal. To attain this, the devices are equipped with identifying, sensing, networking and processing capabilities. Cloud computing, on the other hand, is the delivering of on-demand computing services – from applications, to storage, to processing power – typically over the internet. Clouds bring a number of advantages to distributed computing because of highly available pool of virtualized computing resource. Due to the large number of connected devices, real-world IoT use cases may generate overwhelmingly large amounts of data. This prompts the use of cloud resources for processing, storage and analysis of the data. Therefore, a typical IoT system comprises of a front-end (devices that collect and transmit data), and back-end – typically distributed Data Stream Management Systems (DSMSs) deployed on the cloud infrastructure, for data processing and analysis. Increasingly, new IoT devices are being manufactured to provide limited execution environment on top of their data sensing and transmitting capabilities. This consequently demands a change in the way data is being processed in a typical IoT-cloud setup. The traditional, centralised cloud-based data processing model – where IoT devices are used only for data collection – does not provide an efficient utilisation of all available resources. In addition, the fundamental requirements of real-time data processing such as short response time may not always be met. This prompts a new processing model which is based on decentralising the data processing tasks. The new decentralised architectural pattern allows some parts of data streaming computation to be executed directly on edge devices – closer to where the data is collected. Extending the processing capabilities to the IoT devices increases the robustness of applications as well as reduces the communication overhead between different components of an IoT system. However, this new pattern poses new challenges in the development, deployment and management of IoT applications. Firstly, there exists a large resource gap between the two parts of a typical IoT system (i.e. clouds and IoT devices); hence, prompting a new approach for IoT applications deployment and management. Secondly, the new decentralised approach necessitates the deployment of DSMS on distributed clusters of heterogeneous nodes resulting in unpredictable runtime performance and complex fault characteristics. Lastly, the environment where DSMSs are deployed is very dynamic due to user or device mobility, workload variation, and resource availability. In this thesis we present solutions to address the aforementioned challenges. We investigate how a high-level description of a data streaming computation can be used to automatically generate a distributed runtime infrastructure for Internet of Things. Subsequently, we develop a deployment and management system capable of distributing different operators of a data streaming computation onto different IoT gateway devices and cloud infrastructure. To address the other challenges, we propose a non-intrusive approach for performance evaluation of DSMSs and present a protocol and a set of algorithms for dynamic migration of stateful data stream operators. To improve our migration approach, we provide an optimisation technique which provides minimal application downtime and improves the accuracy of a data stream computation

Newcastle University eTheses

Une approche flexible et décentralisée du traitement de requêtes dans les systèmes géo-distribués

Author: Vasilas Dimitrios
Publication venue: HAL CCSD
Publication date: 19/02/2021
Field of study

This thesis studies the design of query processing systems, across a diversity of geo-distributed settings. Optimising performance metrics such as response time, freshness, or operational cost involves design decisions, such as what derived state (e.g., indexes, materialised views, or caches) to maintain, and how to distribute and where to place the corresponding computation and state. These metrics are often in tension, and the trade-offs depend on the specific application and/or environment. This requires the ability to adapt the query engine's topology and architecture, and the placement of its components. This thesis makes the following contributions: - A flexible architecture for geo-distributed query engines, based on components connected in a bidirectional acyclic graph. - A common microservice abstraction and API for these components, the Query Processing Unit (QPU). A QPU encapsulates some primitive query processing task. Multiple QPU types exist, which can be instantiated and composed into complex graphs. - A model for constructing modular query engine architectures as a distributed topology of QPUs, enabling flexible design and trade-offs between performance metrics. - Proteus, a QPU-based framework for constructing and deploying query engines. - Representative deployments of Proteus and experimental evaluation thereof.Cette thèse présente l'étude de la conception de systèmes de traitement de requêtes dans divers cadres géo-distribués. L'optimisation des mesures de performance telles que le temps de réponse, la fraîcheur ou le coût opérationnel implique des décisions de conception tel que le choix de l’état dérivé (indices, vues matérialisées, caches par ex.) à construire et maintenir, et la distribution et le placement de ces derniers et de leurs calculs. Ces métriques sont souvent opposées et les compromis dépendent de l'application et/ou de la spécificité de l'environnement. La capacité d'adapter la topologie et l'architecture du système de traitement de requêtes devient alors essentielle, ainsi que le placement de ses composants. Cette thèse apporte les contributions suivantes : - Une architecture flexible pour les systèmes de traitement de requêtes géo-distribués, basée sur des composants connectés dans un graphe bidirectionnel acyclique. - Une abstraction de micro-service et une API communes pour ces composants, le Query Processing Unit (QPU). Un QPU encapsule une tâche de traitement de requête primitive. Il existe plusieurs types de QPU qui peuvent être instanciés et composés en graphes complexes. - Un modèle pour construire des architectures de systèmes de traitement de requêtes modulaires composées d’une topologie distribuée de QPUs, permettant une conception flexible et des compromis selon les mesures de performance visées. - Proteus, un framework basé sur les QPU, permettant la construction et le déploiement de systèmes de traitement de requêtes. - Déploiements représentatifs de systèmes de traitement de requêtes à l'aide de Proteus, et leur évaluation expérimentale

INRIA a CCSD electronic archive server

Complex Event Processing as a Service in Multi-Cloud Environments

Author: Higashino Wilson A
Publication venue: Scholarship@Western
Publication date: 31/08/2016
Field of study

The rise of mobile technologies and the Internet of Things, combined with advances in Web technologies, have created a new Big Data world in which the volume and velocity of data generation have achieved an unprecedented scale. As a technology created to process continuous streams of data, Complex Event Processing (CEP) has been often related to Big Data and used as a tool to obtain real-time insights. However, despite this recent surge of interest, the CEP market is still dominated by solutions that are costly and inflexible or too low-level and hard to operate. To address these problems, this research proposes the creation of a CEP system that can be offered as a service and used over the Internet. Such a CEP as a Service (CEPaaS) system would give its users CEP functionalities associated with the advantages of the services model, such as no up-front investment and low maintenance cost. Nevertheless, creating such a service involves challenges that are not addressed by current CEP systems. This research proposes solutions for three open problems that exist in this context. First, to address the problem of understanding and reusing existing CEP management procedures, this research introduces the Attributed Graph Rewriting for Complex Event Processing Management (AGeCEP) formalism as a technology- and language-agnostic representation of queries and their reconfigurations. Second, to address the problem of evaluating CEP query management and processing strategies, this research introduces CEPSim, a simulator of cloud-based CEP systems. Finally, this research also introduces a CEPaaS system based on a multi-cloud architecture, container management systems, and an AGeCEP-based multi-tenant design. To demonstrate its feasibility, AGeCEP was used to design an autonomic manager and a selected set of self-management policies. Moreover, CEPSim was thoroughly evaluated by experiments that showed it can simulate existing systems with accuracy and low execution overhead. Finally, additional experiments validated the CEPaaS system and demonstrated it achieves the goal of offering CEP functionalities as a scalable and fault-tolerant service. In tandem, these results confirm this research significantly advances the CEP state of the art and provides novel tools and methodologies that can be applied to CEP research

Scholarship@Western