29 research outputs found

    Parallelizing Windowed Stream Joins in a Shared-Nothing Cluster

    Full text link
    The availability of large number of processing nodes in a parallel and distributed computing environment enables sophisticated real time processing over high speed data streams, as required by many emerging applications. Sliding window stream joins are among the most important operators in a stream processing system. In this paper, we consider the issue of parallelizing a sliding window stream join operator over a shared nothing cluster. We propose a framework, based on fixed or predefined communication pattern, to distribute the join processing loads over the shared-nothing cluster. We consider various overheads while scaling over a large number of nodes, and propose solution methodologies to cope with the issues. We implement the algorithm over a cluster using a message passing system, and present the experimental results showing the effectiveness of the join processing algorithm.Comment: 11 page

    Processing Exact Results for Queries over Data Streams

    Get PDF
    In a growing number of information-processing applications, such as network-traffic monitoring, sensor networks, financial analysis, data mining for e-commerce, etc., data takes the form of continuous data streams rather than traditional stored databases/relational tuples. These applications have some common features like the need for real time analysis, huge volumes of data, and unpredictable and bursty arrivals of stream elements. In all of these applications, it is infeasible to process queries over data streams by loading the data into a traditional database management system (DBMS) or into main memory. Such an approach does not scale with high stream rates. As a consequence, systems that can manage streaming data have gained tremendous importance. The need to process a large number of continuous queries over bursty, high volume online data streams, potentially in real time, makes it imperative to design algorithms that should use limited resources. This dissertation focuses on processing exact results for join queries over high speed data streams using limited resources, and proposes several novel techniques for processing join queries incorporating secondary storages and non-dedicated computers. Existing approaches for stream joins either, (a) deal with memory limitations by shedding loads, and therefore can not produce exact or highly accurate results for the stream joins over data streams with time varying arrivals of stream tuples, or (b) suffer from large I/O-overheads due to random disk accesses. The proposed techniques exploit the high bandwidth of a disk subsystem by rendering the data access pattern largely sequential, eliminating small, random disk accesses. This dissertation proposes an I/O-efficient algorithm to process hybrid join queries, that join a fast, time varying or bursty data stream and a persistent disk relation. Such a hybrid join is the crux of a number of common transformations in an active data warehouse. Experimental results demonstrate that the proposed scheme reduces the response time in output results by exploiting spatio-temporal locality within the input stream, and minimizes disk overhead through disk-I/O amortization. The dissertation also proposes an algorithm to parallelize a stream join operator over a shared-nothing system. The proposed algorithm distributes the processing loads across a number of independent, non-dedicated nodes, based on a fixed or predefined communication pattern; dynamically maintains the degree of declustering in order to minimize communication and processing overheads; and presents mechanisms for reducing storage and communication overheads while scaling over a large number of nodes. We present experimental results showing the efficacy of the proposed algorithms

    Dynamic Scaling of Parallel Stream Joins on the Cloud

    Get PDF
    Οι μεγάλοι όγκοι δεδομένων που παράγονται από πολλές αναδυόμενες εφαρμογές και συστήματα απαιτούν την πολύπλοκη επεξεργασία ροών δεδομένων υψηλής ταχύτητας σε πραγματικό χρόνο. Η σύζευξη δεδομένων ροών είναι η αντίστοιχη διαδικασία σύζευξης των συμβατικών βάσεων δεδομένων και συγκρίνει τις πλειάδες που προέρχονται από διαφορετικές σχεσιακές ροές. Ο συγκεκριμένος operator χαρακτηρίζεται ως υπολογιστικά ακριβός και ταυτόχρονα εξαιρετικά σημαντικός για την ανάλυση δεδομένων σε πραγματικό χρόνο. Η αποτελεσματική και κλιμακούμενη επεξεργασία των συζεύξεων δεδομένων ροών μπορεί να γίνει εφικτή από τη διαθεσιμότητα ενός μεγάλου αριθμού κόμβων επεξεργασίας σε ένα παράλληλο και κατανεμημένο περιβάλλον. Επιπλέον, τα υπολογιστικά νέφη έχουν εξελιχθεί ως μια ελκυστική πλατφόρμα για την επεξεργασία δεδομένων μεγάλης κλίμακας, κυρίως λόγω της έννοιας της ελαστικότητας. Με τα υπολογιστικά νέφη δίνεται η δυνατότητα εκμίσθωσης εικονικής υπολογιστικής υποδομής, η οποία μπορεί να χρησιμοποιηθεί για όσο χρόνο χρειάζεται με δυναμικό τρόπο. Στη συγκεκριμένη εργασία υιοθετούμε τις βασικές ιδέες και τα χαρακτηριστικά των Qian Lin et al. από το έργο τους "Scalable Distributed Stream Join Processing". Η βασική ιδέα που παρουσιάζεται σε αυτό το έργο είναι το μοντέλο join-biclique το οποίο οργανώνει τις μονάδες επεξεργασίας ενός υπολογιστικού cluster ως έναν ολοκληρωμένο διμερές γράφο. Με βάση αυτή την ιδέα, αναπτύξαμε και υλοποιήσαμε ένα σύνολο αλγορίθμων που σχεδιάστηκαν ως microservices σε περιβάλλον software containers. Οι αλγόριθμοι εκτελούν την επεξεργασία και σύζευξη ροών δεδομένων και μπορούν να κλιμακωθούν οριζόντια. Πραγματοποιήσαμε τα πειράματά μας σε περιβάλλον υπολογιστικού νέφους στο Google Container Engine χρησιμοποιώντας πλατφόρμα Kubernetes και Docker containers.The large and varying volumes of data generated by many emerging applications and systems demand the sophisticated processing of high speed data streams in a real-time fashion. Stream joins is the streaming counterpart of conventional database joins and compares tuples coming from different streaming relations. This operator is characterized as computationally expensive and also quite important for real-time analytics. Efficient and scalable processing of stream joins may be enabled by the availability of a large number of processing nodes in a parallel and distributed environment. Furthermore, clouds have evolved as an appealing platform for large-scale data processing mainly due to the concept of elasticity; virtual computing infrastructure can be leased on demand and used for as much time as needed in a dynamic manner. For this thesis project, we adopt the main ideas and features of Qian Lin et al. in their paper “Scalable Distributed Stream Join Processing”. The basic idea presented in that paper is the join-biclique model which organizes the processing units of a cluster as a complete bipartite graph. Based on that idea, we developed and carried out a set of algorithms designed as containerized microservices, which perform stream join processing and can be scaled horizontally on demand. We performed our experiments on Google Container Engine using Kubernetes orchestration platform and Docker containers

    Scalable and fault-tolerant data stream processing on multi-core architectures

    Get PDF
    With increasing data volumes and velocity, many applications are shifting from the classical “process-after-store” paradigm to a stream processing model: data is produced and consumed as continuous streams. Stream processing captures latency-sensitive applications as diverse as credit card fraud detection and high-frequency trading. These applications are expressed as queries of algebraic operations (e.g., aggregation) over the most recent data using windows, i.e., finite evolving views over the input streams. To guarantee correct results, streaming applications require precise window semantics (e.g., temporal ordering) for operations that maintain state. While high processing throughput and low latency are performance desiderata for stateful streaming applications, achieving both poses challenges. Computing the state of overlapping windows causes redundant aggregation operations: incremental execution (i.e., reusing previous results) reduces latency but prevents parallelization; at the same time, parallelizing window execution for stateful operations with precise semantics demands ordering guarantees and state access coordination. Finally, streams and state must be recovered to produce consistent and repeatable results in the event of failures. Given the rise of shared-memory multi-core CPU architectures and high-speed networking, we argue that it is possible to address these challenges in a single node without compromising window semantics, performance, or fault-tolerance. In this thesis, we analyze, design, and implement stream processing engines (SPEs) that achieve high performance on multi-core architectures. To this end, we introduce new approaches for in-memory processing that address the previous challenges: (i) for overlapping windows, we provide a family of window aggregation techniques that enable computation sharing based on the algebraic properties of aggregation functions; (ii) for parallel window execution, we balance parallelism and incremental execution by developing abstractions for both and combining them to a novel design; and (iii) for reliable single-node execution, we enable strong fault-tolerance guarantees without sacrificing performance by reducing the required disk I/O bandwidth using a novel persistence model. We combine the above to implement an SPE that processes hundreds of millions of tuples per second with sub-second latencies. These results reveal the opportunity to reduce resource and maintenance footprint by replacing cluster-based SPEs with single-node deployments.Open Acces

    Accelerating Event Stream Processing in On- and Offline Systems

    Get PDF
    Due to a growing number of data producers and their ever-increasing data volume, the ability to ingest, analyze, and store potentially never-ending streams of data is a mission-critical task in today's data processing landscape. A widespread form of data streams are event streams, which consist of continuously arriving notifications about some real-world phenomena. For example, a temperature sensor naturally generates an event stream by periodically measuring the temperature and reporting it with measurement time in case of a substantial change to the previous measurement. In this thesis, we consider two kinds of event stream processing: online and offline. Online refers to processing events solely in main memory as soon as they arrive, while offline means processing event data previously persisted to non-volatile storage. Both modes are supported by widely used scale-out general-purpose stream processing engines (SPEs) like Apache Flink or Spark Streaming. However, such engines suffer from two significant deficiencies that severely limit their processing performance. First, for offline processing, they load the entire stream from non-volatile secondary storage and replay all data items into the associated online engine in order of their original arrival. While this naturally ensures unified query semantics for on- and offline processing, the costs for reading the entire stream from non-volatile storage quickly dominate the overall processing costs. Second, modern SPEs focus on scaling out computations across the nodes of a cluster, but use only a fraction of the available resources of individual nodes. This thesis tackles those problems with three different approaches. First, we present novel techniques for the offline processing of two important query types (windowed aggregation and sequential pattern matching). Our methods utilize well-understood indexing techniques to reduce the total amount of data to read from non-volatile storage. We show that this improves the overall query runtime significantly. In particular, this thesis develops the first index-based algorithms for pattern queries expressed with the Match_Recognize clause, a new and powerful language feature of SQL that has received little attention so far. Second, we show how to maximize resource utilization of single nodes by exploiting the capabilities of modern hardware. Therefore, we develop a prototypical shared-memory CPU-GPU-enabled event processing system. The system provides implementations of all major event processing operators (filtering, windowed aggregation, windowed join, and sequential pattern matching). Our experiments reveal that regarding resource utilization and processing throughput, such a hardware-enabled system is superior to hardware-agnostic general-purpose engines. Finally, we present TPStream, a new operator for pattern matching over temporal intervals. TPStream achieves low processing latency and, in contrast to sequential pattern matching, is easily parallelizable even for unpartitioned input streams. This results in maximized resource utilization, especially for modern CPUs with multiple cores

    Parallel Patterns for Adaptive Data Stream Processing

    Get PDF
    In recent years our ability to produce information has been growing steadily, driven by an ever increasing computing power, communication rates, hardware and software sensors diffusion. This data is often available in the form of continuous streams and the ability to gather and analyze it to extract insights and detect patterns is a valuable opportunity for many businesses and scientific applications. The topic of Data Stream Processing (DaSP) is a recent and highly active research area dealing with the processing of this streaming data. The development of DaSP applications poses several challenges, from efficient algorithms for the computation to programming and runtime systems to support their execution. In this thesis two main problems will be tackled: * need for high performance: high throughput and low latency are critical requirements for DaSP problems. Applications necessitate taking advantage of parallel hardware and distributed systems, such as multi/manycores or cluster of multicores, in an effective way; * dynamicity: due to their long running nature (24hr/7d), DaSP applications are affected by highly variable arrival rates and changes in their workload characteristics. Adaptivity is a fundamental feature in this context: applications must be able to autonomously scale the used resources to accommodate dynamic requirements and workload while maintaining the desired Quality of Service (QoS) in a cost-effective manner. In the current approaches to the development of DaSP applications are still missing efficient exploitation of intra-operator parallelism as well as adaptations strategies with well known properties of stability, QoS assurance and cost awareness. These are the gaps that this research work tries to fill, resorting to well know approaches such as Structured Parallel Programming and Control Theoretic models. The dissertation runs along these two directions. The first part deals with intra-operator parallelism. A DaSP application can be naturally expressed as a set of operators (i.e. intermediate computations) that cooperate to reach a common goal. If QoS requirements are not met by the current implementation, bottleneck operators must be internally parallelized. We will study recurrent computations in window based stateful operators and propose patterns for their parallel implementation. Windowed operators are the most representative class of stateful data stream operators. Here computations are applied on the most recent received data. Windows are dynamic data structures: they evolve over time in terms of content and, possibly, size. Therefore, with respect to traditional patterns, the DaSP domain requires proper specializations and enhanced features concerning data distribution and management policies for different windowing methods. A structured approach to the problem will reduce the effort and complexity of parallel programming. In addition, it simplifies the reasoning about the performance properties of a parallel solution (e.g. throughput and latency). The proposed patterns exhibit different properties in terms of applicability and profitability that will be discussed and experimentally evaluated. The second part of the thesis is devoted to the proposal and study of predictive strategies and reconfiguration mechanisms for autonomic DaSP operators. Reconfiguration activities can be implemented in a transparent way to the application programmer thanks to the exploitation of parallel paradigms with well known structures. Furthermore, adaptation strategies may take advantage of the QoS predictability of the used parallel solution. Autonomous operators will be driven by means of a Model Predictive Control approach, with the intent of giving QoS assurances in terms of throughput or latency in a resource-aware manner. An experimental section will show the effectiveness of the proposed approach in terms of execution costs reduction as well as the stability degree of a system reconfiguration. The experiments will target shared and distributed memory architectures

    StreamCloud: An elastic and scalable data streaming system

    Get PDF
    Many applications in several domains such as telecommunications, network security, large scale sensor networks, require online processing of continuous data lows. They produce very high loads that requires aggregating the processing capacity of many nodes. Current Stream Processing Engines do not scale with the input load due to single-node bottlenecks. Additionally, they are based on static con?gurations that lead to either under or over-provisioning. In this paper, we present StreamCloud, a scalable and elastic stream processing engine for processing large data stream volumes. StreamCloud uses a novel parallelization technique that splits queries into subqueries that are allocated to independent sets of nodes in a way that minimizes the distribution overhead. Its elastic protocols exhibit low intrusiveness, enabling effective adjustment of resources to the incoming load. Elasticity is combined with dynamic load balancing to minimize the computational resources used. The paper presents the system design, implementation and a thorough evaluation of the scalability and elasticity of the fully implemented system

    A PROCRUSTEAN APPROACH TO STREAM PROCESSING

    Get PDF
    The increasing demand for real-time data processing and the constantly growing data volume have contributed to the rapid evolution of Stream Processing Engines (SPEs), which are designed to continuously process data as it arrives. Low operational cost and timely delivery of results are both objectives of paramount importance for SPEs. Given the volatile and uncharted nature of data streams, achieving the aforementioned goals under fixed resources is a challenge. This calls for adaptable SPEs, which can react to fluctuations in processing demands. In the past, three techniques have been developed for improving an SPE’s ability to adapt. Those techniques are classified based on applications’ requirements on exact or approximate results: stream partitioning, and re-partitioning target exact, and load shedding targets approximate processing. Stream partitioning strives to balance load among processors, and previous techniques neglected hidden costs of distributed execution. Load Shedding lowers the accuracy of results by dropping part of the input, and previous techniques did not cope with evolving streams. Stream re-partitioning is used to reconfigure execution while processing takes place, and previous techniques did not fully utilize window semantics. In this dissertation, we put stream processing in a procrustean bed, in terms of the manner and the degree that processing takes place. To this end, we present new approaches, for window-based aggregate operators, which are applicable to both exact and approximate stream processing in modern SPEs. Our stream partitioning, re-partitioning, and load shedding solutions offer improvements in performance and accuracy on real-world data by exploiting the semantics of both data and operations. In addition, we present SPEAr, the design of an SPE that accelerates processing by delivering approximate results with accuracy guarantees and avoiding unnecessary load. Finally, we contribute a hybrid technique, ShedPart, which can further improve load balance and performance of an SPE

    Incremental Processing and Optimization of Update Streams

    Get PDF
    Over the recent years, we have seen an increasing number of applications in networking, sensor networks, cloud computing, and environmental monitoring, which monitor, plan, control, and make decisions over data streams from multiple sources. We are interested in extending traditional stream processing techniques to meet the new challenges of these applications. Generally, in order to support genuine continuous query optimization and processing over data streams, we need to systematically understand how to address incremental optimization and processing of update streams for a rich class of queries commonly used in the applications. Our general thesis is that efficient incremental processing and re-optimization of update streams can be achieved by various incremental view maintenance techniques if we cast the problems as incremental view maintenance problems over data streams. We focus on two incremental processing of update streams challenges currently not addressed in existing work on stream query processing: incremental processing of transitive closure queries over data streams, and incremental re-optimization of queries. In addition to addressing these specific challenges, we also develop a working prototype system Aspen, which serves as an end-to-end stream processing system that has been deployed as the foundation for a case study of our SmartCIS application. We validate our solutions both analytically and empirically on top of our prototype system Aspen, over a variety of benchmark workloads such as TPC-H and LinearRoad Benchmarks
    corecore