9 research outputs found

    Dynamic Scaling of Parallel Stream Joins on the Cloud

    Get PDF
    Οι μεγάλοι όγκοι δεδομένων που παράγονται από πολλές αναδυόμενες εφαρμογές και συστήματα απαιτούν την πολύπλοκη επεξεργασία ροών δεδομένων υψηλής ταχύτητας σε πραγματικό χρόνο. Η σύζευξη δεδομένων ροών είναι η αντίστοιχη διαδικασία σύζευξης των συμβατικών βάσεων δεδομένων και συγκρίνει τις πλειάδες που προέρχονται από διαφορετικές σχεσιακές ροές. Ο συγκεκριμένος operator χαρακτηρίζεται ως υπολογιστικά ακριβός και ταυτόχρονα εξαιρετικά σημαντικός για την ανάλυση δεδομένων σε πραγματικό χρόνο. Η αποτελεσματική και κλιμακούμενη επεξεργασία των συζεύξεων δεδομένων ροών μπορεί να γίνει εφικτή από τη διαθεσιμότητα ενός μεγάλου αριθμού κόμβων επεξεργασίας σε ένα παράλληλο και κατανεμημένο περιβάλλον. Επιπλέον, τα υπολογιστικά νέφη έχουν εξελιχθεί ως μια ελκυστική πλατφόρμα για την επεξεργασία δεδομένων μεγάλης κλίμακας, κυρίως λόγω της έννοιας της ελαστικότητας. Με τα υπολογιστικά νέφη δίνεται η δυνατότητα εκμίσθωσης εικονικής υπολογιστικής υποδομής, η οποία μπορεί να χρησιμοποιηθεί για όσο χρόνο χρειάζεται με δυναμικό τρόπο. Στη συγκεκριμένη εργασία υιοθετούμε τις βασικές ιδέες και τα χαρακτηριστικά των Qian Lin et al. από το έργο τους "Scalable Distributed Stream Join Processing". Η βασική ιδέα που παρουσιάζεται σε αυτό το έργο είναι το μοντέλο join-biclique το οποίο οργανώνει τις μονάδες επεξεργασίας ενός υπολογιστικού cluster ως έναν ολοκληρωμένο διμερές γράφο. Με βάση αυτή την ιδέα, αναπτύξαμε και υλοποιήσαμε ένα σύνολο αλγορίθμων που σχεδιάστηκαν ως microservices σε περιβάλλον software containers. Οι αλγόριθμοι εκτελούν την επεξεργασία και σύζευξη ροών δεδομένων και μπορούν να κλιμακωθούν οριζόντια. Πραγματοποιήσαμε τα πειράματά μας σε περιβάλλον υπολογιστικού νέφους στο Google Container Engine χρησιμοποιώντας πλατφόρμα Kubernetes και Docker containers.The large and varying volumes of data generated by many emerging applications and systems demand the sophisticated processing of high speed data streams in a real-time fashion. Stream joins is the streaming counterpart of conventional database joins and compares tuples coming from different streaming relations. This operator is characterized as computationally expensive and also quite important for real-time analytics. Efficient and scalable processing of stream joins may be enabled by the availability of a large number of processing nodes in a parallel and distributed environment. Furthermore, clouds have evolved as an appealing platform for large-scale data processing mainly due to the concept of elasticity; virtual computing infrastructure can be leased on demand and used for as much time as needed in a dynamic manner. For this thesis project, we adopt the main ideas and features of Qian Lin et al. in their paper “Scalable Distributed Stream Join Processing”. The basic idea presented in that paper is the join-biclique model which organizes the processing units of a cluster as a complete bipartite graph. Based on that idea, we developed and carried out a set of algorithms designed as containerized microservices, which perform stream join processing and can be scaled horizontally on demand. We performed our experiments on Google Container Engine using Kubernetes orchestration platform and Docker containers

    Quality-Driven Disorder Handling for M-way Sliding Window Stream Joins

    Full text link
    Sliding window join is one of the most important operators for stream applications. To produce high quality join results, a stream processing system must deal with the ubiquitous disorder within input streams which is caused by network delay, asynchronous source clocks, etc. Disorder handling involves an inevitable tradeoff between the latency and the quality of produced join results. To meet different requirements of stream applications, it is desirable to provide a user-configurable result-latency vs. result-quality tradeoff. Existing disorder handling approaches either do not provide such configurability, or support only user-specified latency constraints. In this work, we advocate the idea of quality-driven disorder handling, and propose a buffer-based disorder handling approach for sliding window joins, which minimizes sizes of input-sorting buffers, thus the result latency, while respecting user-specified result-quality requirements. The core of our approach is an analytical model which directly captures the relationship between sizes of input buffers and the produced result quality. Our approach is generic. It supports m-way sliding window joins with arbitrary join conditions. Experiments on real-world and synthetic datasets show that, compared to the state of the art, our approach can reduce the result latency incurred by disorder handling by up to 95% while providing the same level of result quality.Comment: 12 pages, 11 figures, IEEE ICDE 201

    Squall: Scalable Real-time Analytics

    Get PDF
    Squall is a scalable online query engine that runs complex analytics in a cluster using skew-resilient, adaptive operators. Squall builds on state-of-the-art partitioning schemes and local algorithms, including some of our own. This paper presents the overview of Squall, including some novel join operators. The paper also presents lessons learned over the five years of working on this system, and outlines the plan for the proposed system demonstration

    Streaming Data Algorithm Design for Big Trajectory Data Analysis

    Get PDF
    Trajectory streams consist of large volumes of time-stamped spatial data that are constantly generated from diverse and geographically distributed sources. Discovery of traveling patterns on trajectorystreamssuchasgatheringandcompaniesneedstoprocesseachrecordwhenitarrivesand correlatesacrossmultiplerecordsnearreal-time. Thustechniquesforhandlinghigh-speedtrajectorystreamsshouldscaleondistributedclustercomputing. Themainissuesencapsulatethreeaspects, namely a data model to represent the continuous trajectory data, the parallelism of a discovery algorithm, and end-to-end performance improvement. In this thesis, I propose two parallel discovery methods,namelysnapshotmodelandslotmodelthateachconsistsof1)amodelofpartitioningtrajectoriessampledondifferenttimeintervals;2)definitionondistancemeasurementsoftrajectories; and 3) a parallel discovery algorithm. I develop these methods in a stream processing workflow. I evaluate our solution with a public dataset on Amazon Web Services (AWS) cloud cluster. From parallelization point of view, I investigate system performance, scalability, stability and pinpoint principle operations that contribute most to the run-time cost of computation and data shuffling. I improve data locality with fine-tuned data partition and data aggregation techniques. I observe that both models can scale on a cluster of nodes as the intensity of trajectory data streams grows. Generally, snapshot model has higher throughput thus lower latency, while slot model produce more accurate trajectory discovery

    EXPLOITING THE SYNERGY BETWEEN SCHEDULING AND LOAD SHEDDING TO FACILITATE DIFFERENTIATED LEVELS OF SERVICE FOR CONTINUOUS QUERIES

    Get PDF
    Data Stream Management Systems (DSMSs) offer the most effective solution for processing data streams by efficiently executing continuous queries (CQs) over the incoming data. CQs inherently have different levels of criticality and hence different levels of expected quality of service (QoS) and quality of data (QoD). Adhering to such expected QoS/QoD metrics is even more important in cases of multi-tenant data stream management services. In this dissertation, we propose DILoS, a framework that supports differentiated QoS and QoD for multiple classes of CQs by tightly integrating priority-based scheduling and load shedding. Unlike existing works that consider scheduling and load shedding separately, DILoS is a novel unified framework that exploits the synergy between them. For the realization of DILoS, we propose ALoMa and SEaMLeSS, two general, adaptive load managers. Our load managers can also be used standalone and outperform the state-of-the-art in three dimensions: (1)they automatically tune the headroom factor, (2) they honor the delay target, and (3) they are applicable to complex query networks with shared operators. We implemented DILoS, ALoMa and SEaMLeSS in our real DSMS prototype system (AQSIOS) and systematically evaluate their performance using real and synthetic workloads.Our experimental evaluation of ALoMa and SEaMLeSS verified their advantages over the state-of-the-art approaches. Our evaluation of DILoS showed that it (a) allows the scheduler and load shedder to consistently honor CQs’ priorities, (b) significantly increases system capacity utilization by exploiting batch processing, and (c) enables operator sharing among query classes of different priorities while avoiding priority inversion. To further support differentiated QoS and QoD for CQs in distributed DSMSs, we propose ARMaDILoS, a conceptual framework for large scale adaptive resource management using DILoS. A fundamental component in ARMaDILoS is CQ migration. For this reason, we propose and implement UniMiCo, a protocol to migrate CQs without interrupting the execution of the queries. Our experiments showed that UniMiCo produced correct outputs and did not introduce any hiccup in the response time of the queries
    corecore