83 research outputs found

    A Survey on the Evolution of Stream Processing Systems

    Full text link
    Stream processing has been an active research field for more than 20 years, but it is now witnessing its prime time due to recent successful efforts by the research community and numerous worldwide open-source communities. This survey provides a comprehensive overview of fundamental aspects of stream processing systems and their evolution in the functional areas of out-of-order data management, state management, fault tolerance, high availability, load management, elasticity, and reconfiguration. We review noteworthy past research findings, outline the similarities and differences between early ('00-'10) and modern ('11-'18) streaming systems, and discuss recent trends and open problems.Comment: 34 pages, 15 figures, 5 table

    Chi: a scalable and programmable control plane for distributed stream processing systems

    Get PDF
    Stream-processing workloads and modern shared cluster environments exhibit high variability and unpredictability. Combined with the large parameter space and the diverse set of user SLOs, this makes modern streaming systems very challenging to statically configure and tune. To address these issues, in this paper we investigate a novel control-plane design, Chi, which supports continuous monitoring and feedback, and enables dynamic re-configuration. Chi leverages the key insight of embedding control-plane messages in the data-plane channels to achieve a low-latency and flexible control plane for stream-processing systems. Chi introduces a new reactive programming model and design mechanisms to asynchronously execute control policies, thus avoiding global synchronization. We show how this allows us to easily implement a wide spectrum of control policies targeting different use cases observed in production. Large-scale experiments using production workloads from a popular cloud provider demonstrate the flexibility and efficiency of our approach

    Elastic techniques to handle dynamism in real-time data processing systems

    Get PDF
    Real-time data processing is a crucial component of cloud computing today. It is widely adopted to provide an up-to-date view of data for social networks, cloud management, web applications, edge, and IoT infrastructures. Real-time processing frameworks are designed for time-sensitive tasks such as event detection, real-time data analysis, and prediction. Compared to handling offline, batched data, real-time data processing applications tend to be long-running and are prone to performance issues caused by many unpredictable environmental variables, including (but not limited to) job specification, user expectation, and available resources. In order to cope with this challenge, it is crucial for system designers to improve frameworks’ ability to adjust their resource usage to adapt to changing environmental variables, defined as system elasticity. This thesis investigates how elastic resource provisioning helps cloud systems today process real-time data while maintaining predictable performance under workload influence in an automated manner. We explore new algorithms, framework design, and efficient system implementation to achieve this goal. On the other hand, distributed systems today need to continuously handle various application specifications, hardware configurations, and workload characteristics. Maintaining stable performance requires systems to explicitly plan for resource allocation upon starting an application and tailor allocation dynamically during run time. In this thesis, we show how achieving system elasticity can help systems provide tunable performance under the dynamism of many environmental variables without compromising resource efficiency. Specifically, this thesis focuses on the two following aspects: i) Elasticity-aware Scheduling: Real-time data processing systems today are often designed in resource-, workload-agnostic fashion. As a result, most users are unable to perform resource planning before launching an application or adjust resource allocation (both within and across application boundaries) intelligently during the run. The first part of this thesis work (Stela [1], Henge [2], Getafix [3]) explores efficient mechanisms to conduct performance analysis while also enabling elasticity-aware scheduling in today’s cloud frameworks. ii) Resource Efficient Cloud Stack: The second line of work in this thesis aims to improve underlying cloud stacks to support self-adaptive, highly efficient resource provisioning. Today’s cloud systems enforce full isolation that prevents resource sharing among applications at a fine granularity over time. This work (Cameo [4], Dirigo) builds real- time data processing systems for emerging cloud infrastructures with high resource utilization through fine-grained resource sharing. Given that the market for real-time data analysis is expected to increase by the annual rate of 28.2% and reach 35.5 billion by the year 2024 [5], improving system elasticity can introduce a significant reduction to deployment cost and increase in resource utilization. Our works improve the performances of real-time data analytics applications within resource constraints. We highlight some of the improvements as the following: i) Stela explores elastic techniques for single-tenant, on-demand dataflow scale-out and scale-in operations. It improves post-scale throughput by 45-120% during on-demand scale-out and post-scale throughput by 2-5× during on-demand scale-in. ii) Henge develops a mechanism to map application’s performance into a unified scale of resource needs. It reduces resource consumption by 40-60% by maintaining the same level of SLO achievement throughout the cluster. iii) Getafix implements a strategy to analyze workload dynamically and proposes a solution that guides the systems to calculate the number of replicas to generate and the placement plan of these replicas adaptively. It achieves comparable query latency (both average and tail) by achieving 1.45-2.15× memory savings. iv) Cameo proposes a scheduler that supports data-driven, fine-grained operator execution guided by user expectations. It improves cluster utilization by 6× and reduces the performance violation by 72% while compacting more jobs into a shared cluster. v) Dirigo performs fully decentralized, function state-aware, global message scheduling for stateful functions. It is able to reduce tail latency by 60% compared to the local scheduling approach and reduce remote state accesses by 19× compared to the scheduling approach that is unaware of function states. These works can potentially lead to profound cost savings for both cloud providers and end-users

    Hardware-conscious query processing for the many-core era

    Get PDF
    Die optimale Nutzung von moderner Hardware zur Beschleunigung von Datenbank-Anfragen ist keine triviale Aufgabe. Viele DBMS als auch DSMS der letzten Jahrzehnte basieren auf Sachverhalten, die heute kaum noch GĂŒltigkeit besitzen. Ein Beispiel hierfĂŒr sind heutige Server-Systeme, deren HauptspeichergrĂ¶ĂŸe im Bereich mehrerer Terabytes liegen kann und somit den Weg fĂŒr Hauptspeicherdatenbanken geebnet haben. Einer der grĂ¶ĂŸeren letzten Hardware Trends geht hin zu Prozessoren mit einer hohen Anzahl von Kernen, den sogenannten Manycore CPUs. Diese erlauben hohe ParallelitĂ€tsgrade fĂŒr Programme durch Multithreading sowie Vektorisierung (SIMD), was die Anforderungen an die Speicher-Bandbreite allerdings deutlich erhöht. Der sogenannte High-Bandwidth Memory (HBM) versucht diese LĂŒcke zu schließen, kann aber ebenso wie Many-core CPUs jeglichen Performance-Vorteil negieren, wenn dieser leichtfertig eingesetzt wird. Diese Arbeit stellt die Many-core CPU-Architektur zusammen mit HBM vor, um Datenbank sowie Datenstrom-Anfragen zu beschleunigen. Es wird gezeigt, dass ein hardwarenahes Kostenmodell zusammen mit einem Kalibrierungsansatz die Performance verschiedener Anfrageoperatoren verlĂ€sslich vorhersagen kann. Dies ermöglicht sowohl eine adaptive Partitionierungs und Merge-Strategie fĂŒr die Parallelisierung von Datenstrom-Anfragen als auch eine ideale Konfiguration von Join-Operationen auf einem DBMS. Nichtsdestotrotz ist nicht jede Operation und Anwendung fĂŒr die Nutzung einer Many-core CPU und HBM geeignet. Datenstrom-Anfragen sind oft auch an niedrige Latenz und schnelle Antwortzeiten gebunden, welche von höherer Speicher-Bandbreite kaum profitieren können. Hinzu kommen ĂŒblicherweise niedrigere Taktraten durch die hohe Kernzahl der CPUs, sowie Nachteile fĂŒr geteilte Datenstrukturen, wie das Herstellen von Cache-KohĂ€renz und das Synchronisieren von parallelen Thread-Zugriffen. Basierend auf den Ergebnissen dieser Arbeit lĂ€sst sich ableiten, welche parallelen Datenstrukturen sich fĂŒr die Verwendung von HBM besonders eignen. Des Weiteren werden verschiedene Techniken zur Parallelisierung und Synchronisierung von Datenstrukturen vorgestellt, deren Effizienz anhand eines Mehrwege-Datenstrom-Joins demonstriert wird.Exploiting the opportunities given by modern hardware for accelerating query processing speed is no trivial task. Many DBMS and also DSMS from past decades are based on fundamentals that have changed over time, e.g., servers of today with terabytes of main memory capacity allow complete avoidance of spilling data to disk, which has prepared the ground some time ago for main memory databases. One of the recent trends in hardware are many-core processors with hundreds of logical cores on a single CPU, providing an intense degree of parallelism through multithreading as well as vectorized instructions (SIMD). Their demand for memory bandwidth has led to the further development of high-bandwidth memory (HBM) to overcome the memory wall. However, many-core CPUs as well as HBM have many pitfalls that can nullify any performance gain with ease. In this work, we explore the many-core architecture along with HBM for database and data stream query processing. We demonstrate that a hardware-conscious cost model with a calibration approach allows reliable performance prediction of various query operations. Based on that information, we can, therefore, come to an adaptive partitioning and merging strategy for stream query parallelization as well as finding an ideal configuration of parameters for one of the most common tasks in the history of DBMS, join processing. However, not all operations and applications can exploit a many-core processor or HBM, though. Stream queries optimized for low latency and quick individual responses usually do not benefit well from more bandwidth and suffer from penalties like low clock frequencies of many-core CPUs as well. Shared data structures between cores also lead to problems with cache coherence as well as high contention. Based on our insights, we give a rule of thumb which data structures are suitable to parallelize with focus on HBM usage. In addition, different parallelization schemas and synchronization techniques are evaluated, based on the example of a multiway stream join operation

    Processamento de eventos complexos como serviço em ambientes multi-nuvem

    Get PDF
    Orientadores: Luiz Fernando Bittencourt, Miriam Akemi Manabe CapretzTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: O surgimento das tecnologias de dispositivos mĂłveis e da Internet das Coisas, combinada com avanços das tecnologias Web, criou um novo mundo de Big Data em que o volume e a velocidade da geração de dados atingiu uma escala sem precedentes. Por ser uma tecnologia criada para processar fluxos contĂ­nuos de dados, o Processamento de Eventos Complexos (CEP, do inglĂȘs Complex Event Processing) tem sido frequentemente associado a Big Data e aplicado como uma ferramenta para obter informaçÔes em tempo real. Todavia, apesar desta onda de interesse, o mercado de CEP ainda Ă© dominado por soluçÔes proprietĂĄrias que requerem grandes investimentos para sua aquisição e nĂŁo proveem a flexibilidade que os usuĂĄrios necessitam. Como alternativa, algumas empresas adotam soluçÔes de baixo nĂ­vel que demandam intenso treinamento tĂ©cnico e possuem alto custo operacional. A fim de solucionar esses problemas, esta pesquisa propĂ”e a criação de um sistema de CEP que pode ser oferecido como serviço e usado atravĂ©s da Internet. Um sistema de CEP como Serviço (CEPaaS, do inglĂȘs CEP as a Service) oferece aos usuĂĄrios as funcionalidades de CEP aliadas Ă s vantagens do modelo de serviços, tais como redução do investimento inicial e baixo custo de manutenção. No entanto, a criação de tal serviço envolve inĂșmeros desafios que nĂŁo sĂŁo abordados no atual estado da arte de CEP. Em especial, esta pesquisa propĂ”e soluçÔes para trĂȘs problemas em aberto que existem neste contexto. Em primeiro lugar, para o problema de entender e reusar a enorme variedade de procedimentos para gerĂȘncia de sistemas CEP, esta pesquisa propĂ”e o formalismo Reescrita de Grafos com Atributos para GerĂȘncia de Processamento de Eventos Complexos (AGeCEP, do inglĂȘs Attributed Graph Rewriting for Complex Event Processing Management). Este formalismo inclui modelos para consultas CEP e transformaçÔes de consultas que sĂŁo independentes de tecnologia e linguagem. Em segundo lugar, para o problema de avaliar estratĂ©gias de gerĂȘncia e processamento de consultas CEP, esta pesquisa apresenta CEPSim, um simulador de sistemas CEP baseado em nuvem. Por fim, esta pesquisa tambĂ©m descreve um sistema CEPaaS fundamentado em ambientes multi-nuvem, sistemas de gerĂȘncia de contĂȘineres e um design multiusuĂĄrio baseado em AGeCEP. Para demonstrar sua viabilidade, o formalismo AGeCEP foi usado para projetar um gerente autĂŽnomo e um conjunto de polĂ­ticas de auto-gerenciamento para sistemas CEP. AlĂ©m disso, o simulador CEPSim foi minuciosamente avaliado atravĂ©s de experimentos que demonstram sua capacidade de simular sistemas CEP com acurĂĄcia e baixo custo adicional de processamento. Por fim, experimentos adicionais validaram o sistema CEPaaS e demonstraram que o objetivo de oferecer funcionalidades CEP como um serviço escalĂĄvel e tolerante a falhas foi atingido. Em conjunto, esses resultados confirmam que esta pesquisa avança significantemente o estado da arte e tambĂ©m oferece novas ferramentas e metodologias que podem ser aplicadas Ă  pesquisa em CEPAbstract: The rise of mobile technologies and the Internet of Things, combined with advances in Web technologies, have created a new Big Data world in which the volume and velocity of data generation have achieved an unprecedented scale. As a technology created to process continuous streams of data, Complex Event Processing (CEP) has been often related to Big Data and used as a tool to obtain real-time insights. However, despite this recent surge of interest, the CEP market is still dominated by solutions that are costly and inflexible or too low-level and hard to operate. To address these problems, this research proposes the creation of a CEP system that can be offered as a service and used over the Internet. Such a CEP as a Service (CEPaaS) system would give its users CEP functionalities associated with the advantages of the services model, such as no up-front investment and low maintenance cost. Nevertheless, creating such a service involves challenges that are not addressed by current CEP systems. This research proposes solutions for three open problems that exist in this context. First, to address the problem of understanding and reusing existing CEP management procedures, this research introduces the Attributed Graph Rewriting for Complex Event Processing Management (AGeCEP) formalism as a technology- and language-agnostic representation of queries and their reconfigurations. Second, to address the problem of evaluating CEP query management and processing strategies, this research introduces CEPSim, a simulator of cloud-based CEP systems. Finally, this research also introduces a CEPaaS system based on a multi-cloud architecture, container management systems, and an AGeCEP-based multi-tenant design. To demonstrate its feasibility, AGeCEP was used to design an autonomic manager and a selected set of self-management policies. Moreover, CEPSim was thoroughly evaluated by experiments that showed it can simulate existing systems with accuracy and low execution overhead. Finally, additional experiments validated the CEPaaS system and demonstrated it achieves the goal of offering CEP functionalities as a scalable and fault-tolerant service. In tandem, these results confirm this research significantly advances the CEP state of the art and provides novel tools and methodologies that can be applied to CEP researchDoutoradoCiĂȘncia da ComputaçãoDoutor em CiĂȘncia da Computação140920/2012-9CNP

    Quality of Service Aware Data Stream Processing for Highly Dynamic and Scalable Applications

    Get PDF
    Huge amounts of georeferenced data streams are arriving daily to data stream management systems that are deployed for serving highly scalable and dynamic applications. There are innumerable ways at which those loads can be exploited to gain deep insights in various domains. Decision makers require an interactive visualization of such data in the form of maps and dashboards for decision making and strategic planning. Data streams normally exhibit fluctuation and oscillation in arrival rates and skewness. Those are the two predominant factors that greatly impact the overall quality of service. This requires data stream management systems to be attuned to those factors in addition to the spatial shape of the data that may exaggerate the negative impact of those factors. Current systems do not natively support services with quality guarantees for dynamic scenarios, leaving the handling of those logistics to the user which is challenging and cumbersome. Three workloads are predominant for any data stream, batch processing, scalable storage and stream processing. In this thesis, we have designed a quality of service aware system, SpatialDSMS, that constitutes several subsystems that are covering those loads and any mixed load that results from intermixing them. Most importantly, we natively have incorporated quality of service optimizations for processing avalanches of geo-referenced data streams in highly dynamic application scenarios. This has been achieved transparently on top of the codebases of emerging de facto standard best-in-class representatives, thus relieving the overburdened shoulders of the users in the presentation layer from having to reason about those services. Instead, users express their queries with quality goals and our system optimizers compiles that down into query plans with an embedded quality guarantee and leaves logistic handling to the underlying layers. We have developed standard compliant prototypes for all the subsystems that constitutes SpatialDSMS

    Automatic Generation of Distributed Runtime Infrastructure for Internet of Things

    Get PDF
    Ph. D. ThesisThe Internet of Things (IoT) represents a network of connected devices that are able to cooperate and interact with each other in order to reach a particular goal. To attain this, the devices are equipped with identifying, sensing, networking and processing capabilities. Cloud computing, on the other hand, is the delivering of on-demand computing services – from applications, to storage, to processing power – typically over the internet. Clouds bring a number of advantages to distributed computing because of highly available pool of virtualized computing resource. Due to the large number of connected devices, real-world IoT use cases may generate overwhelmingly large amounts of data. This prompts the use of cloud resources for processing, storage and analysis of the data. Therefore, a typical IoT system comprises of a front-end (devices that collect and transmit data), and back-end – typically distributed Data Stream Management Systems (DSMSs) deployed on the cloud infrastructure, for data processing and analysis. Increasingly, new IoT devices are being manufactured to provide limited execution environment on top of their data sensing and transmitting capabilities. This consequently demands a change in the way data is being processed in a typical IoT-cloud setup. The traditional, centralised cloud-based data processing model – where IoT devices are used only for data collection – does not provide an efficient utilisation of all available resources. In addition, the fundamental requirements of real-time data processing such as short response time may not always be met. This prompts a new processing model which is based on decentralising the data processing tasks. The new decentralised architectural pattern allows some parts of data streaming computation to be executed directly on edge devices – closer to where the data is collected. Extending the processing capabilities to the IoT devices increases the robustness of applications as well as reduces the communication overhead between different components of an IoT system. However, this new pattern poses new challenges in the development, deployment and management of IoT applications. Firstly, there exists a large resource gap between the two parts of a typical IoT system (i.e. clouds and IoT devices); hence, prompting a new approach for IoT applications deployment and management. Secondly, the new decentralised approach necessitates the deployment of DSMS on distributed clusters of heterogeneous nodes resulting in unpredictable runtime performance and complex fault characteristics. Lastly, the environment where DSMSs are deployed is very dynamic due to user or device mobility, workload variation, and resource availability. In this thesis we present solutions to address the aforementioned challenges. We investigate how a high-level description of a data streaming computation can be used to automatically generate a distributed runtime infrastructure for Internet of Things. Subsequently, we develop a deployment and management system capable of distributing different operators of a data streaming computation onto different IoT gateway devices and cloud infrastructure. To address the other challenges, we propose a non-intrusive approach for performance evaluation of DSMSs and present a protocol and a set of algorithms for dynamic migration of stateful data stream operators. To improve our migration approach, we provide an optimisation technique which provides minimal application downtime and improves the accuracy of a data stream computation

    Une approche flexible et dĂ©centralisĂ©e du traitement de requĂȘtes dans les systĂšmes gĂ©o-distribuĂ©s

    Get PDF
    This thesis studies the design of query processing systems, across a diversity of geo-distributed settings. Optimising performance metrics such as response time, freshness, or operational cost involves design decisions, such as what derived state (e.g., indexes, materialised views, or caches) to maintain, and how to distribute and where to place the corresponding computation and state. These metrics are often in tension, and the trade-offs depend on the specific application and/or environment. This requires the ability to adapt the query engine's topology and architecture, and the placement of its components. This thesis makes the following contributions: - A flexible architecture for geo-distributed query engines, based on components connected in a bidirectional acyclic graph. - A common microservice abstraction and API for these components, the Query Processing Unit (QPU). A QPU encapsulates some primitive query processing task. Multiple QPU types exist, which can be instantiated and composed into complex graphs. - A model for constructing modular query engine architectures as a distributed topology of QPUs, enabling flexible design and trade-offs between performance metrics. - Proteus, a QPU-based framework for constructing and deploying query engines. - Representative deployments of Proteus and experimental evaluation thereof.Cette thĂšse prĂ©sente l'Ă©tude de la conception de systĂšmes de traitement de requĂȘtes dans divers cadres gĂ©o-distribuĂ©s. L'optimisation des mesures de performance telles que le temps de rĂ©ponse, la fraĂźcheur ou le coĂ»t opĂ©rationnel implique des dĂ©cisions de conception tel que le choix de l’état dĂ©rivĂ© (indices, vues matĂ©rialisĂ©es, caches par ex.) Ă  construire et maintenir, et la distribution et le placement de ces derniers et de leurs calculs. Ces mĂ©triques sont souvent opposĂ©es et les compromis dĂ©pendent de l'application et/ou de la spĂ©cificitĂ© de l'environnement. La capacitĂ© d'adapter la topologie et l'architecture du systĂšme de traitement de requĂȘtes devient alors essentielle, ainsi que le placement de ses composants. Cette thĂšse apporte les contributions suivantes : - Une architecture flexible pour les systĂšmes de traitement de requĂȘtes gĂ©o-distribuĂ©s, basĂ©e sur des composants connectĂ©s dans un graphe bidirectionnel acyclique. - Une abstraction de micro-service et une API communes pour ces composants, le Query Processing Unit (QPU). Un QPU encapsule une tĂąche de traitement de requĂȘte primitive. Il existe plusieurs types de QPU qui peuvent ĂȘtre instanciĂ©s et composĂ©s en graphes complexes. - Un modĂšle pour construire des architectures de systĂšmes de traitement de requĂȘtes modulaires composĂ©es d’une topologie distribuĂ©e de QPUs, permettant une conception flexible et des compromis selon les mesures de performance visĂ©es. - Proteus, un framework basĂ© sur les QPU, permettant la construction et le dĂ©ploiement de systĂšmes de traitement de requĂȘtes. - DĂ©ploiements reprĂ©sentatifs de systĂšmes de traitement de requĂȘtes Ă  l'aide de Proteus, et leur Ă©valuation expĂ©rimentale

    Complex Event Processing as a Service in Multi-Cloud Environments

    Get PDF
    The rise of mobile technologies and the Internet of Things, combined with advances in Web technologies, have created a new Big Data world in which the volume and velocity of data generation have achieved an unprecedented scale. As a technology created to process continuous streams of data, Complex Event Processing (CEP) has been often related to Big Data and used as a tool to obtain real-time insights. However, despite this recent surge of interest, the CEP market is still dominated by solutions that are costly and inflexible or too low-level and hard to operate. To address these problems, this research proposes the creation of a CEP system that can be offered as a service and used over the Internet. Such a CEP as a Service (CEPaaS) system would give its users CEP functionalities associated with the advantages of the services model, such as no up-front investment and low maintenance cost. Nevertheless, creating such a service involves challenges that are not addressed by current CEP systems. This research proposes solutions for three open problems that exist in this context. First, to address the problem of understanding and reusing existing CEP management procedures, this research introduces the Attributed Graph Rewriting for Complex Event Processing Management (AGeCEP) formalism as a technology- and language-agnostic representation of queries and their reconfigurations. Second, to address the problem of evaluating CEP query management and processing strategies, this research introduces CEPSim, a simulator of cloud-based CEP systems. Finally, this research also introduces a CEPaaS system based on a multi-cloud architecture, container management systems, and an AGeCEP-based multi-tenant design. To demonstrate its feasibility, AGeCEP was used to design an autonomic manager and a selected set of self-management policies. Moreover, CEPSim was thoroughly evaluated by experiments that showed it can simulate existing systems with accuracy and low execution overhead. Finally, additional experiments validated the CEPaaS system and demonstrated it achieves the goal of offering CEP functionalities as a scalable and fault-tolerant service. In tandem, these results confirm this research significantly advances the CEP state of the art and provides novel tools and methodologies that can be applied to CEP research
