16,974 research outputs found

    Distributed Stream Filtering for Database Applications

    Get PDF
    Distributed stream filtering is a mechanism for implementing a new class of real-time applications with distributed processing requirements. These applications require scalable architectures to support the efficient processing and multiplexing of large volumes of continuously generated data. This paper provides an overview of a stream-oriented model for database query processing and presents a supporting implementation. To facilitate distributed stream filtering, we introduce several new query processing operations, including pipelined filtering that efficiently joins and eliminates duplicates from database streams and a new join method, the progressive join, that joins streams of tuples. Finally, recognizing that the stream-oriented model results in performance tradeoffs that differ significantly from those in traditional databases, we present a new query optimization strategy specifically designed for stream-oriented databases

    Dynamic Scaling of Parallel Stream Joins on the Cloud

    Get PDF
    Οι μεγάλοι όγκοι δεδομένων που παράγονται από πολλές αναδυόμενες εφαρμογές και συστήματα απαιτούν την πολύπλοκη επεξεργασία ροών δεδομένων υψηλής ταχύτητας σε πραγματικό χρόνο. Η σύζευξη δεδομένων ροών είναι η αντίστοιχη διαδικασία σύζευξης των συμβατικών βάσεων δεδομένων και συγκρίνει τις πλειάδες που προέρχονται από διαφορετικές σχεσιακές ροές. Ο συγκεκριμένος operator χαρακτηρίζεται ως υπολογιστικά ακριβός και ταυτόχρονα εξαιρετικά σημαντικός για την ανάλυση δεδομένων σε πραγματικό χρόνο. Η αποτελεσματική και κλιμακούμενη επεξεργασία των συζεύξεων δεδομένων ροών μπορεί να γίνει εφικτή από τη διαθεσιμότητα ενός μεγάλου αριθμού κόμβων επεξεργασίας σε ένα παράλληλο και κατανεμημένο περιβάλλον. Επιπλέον, τα υπολογιστικά νέφη έχουν εξελιχθεί ως μια ελκυστική πλατφόρμα για την επεξεργασία δεδομένων μεγάλης κλίμακας, κυρίως λόγω της έννοιας της ελαστικότητας. Με τα υπολογιστικά νέφη δίνεται η δυνατότητα εκμίσθωσης εικονικής υπολογιστικής υποδομής, η οποία μπορεί να χρησιμοποιηθεί για όσο χρόνο χρειάζεται με δυναμικό τρόπο. Στη συγκεκριμένη εργασία υιοθετούμε τις βασικές ιδέες και τα χαρακτηριστικά των Qian Lin et al. από το έργο τους "Scalable Distributed Stream Join Processing". Η βασική ιδέα που παρουσιάζεται σε αυτό το έργο είναι το μοντέλο join-biclique το οποίο οργανώνει τις μονάδες επεξεργασίας ενός υπολογιστικού cluster ως έναν ολοκληρωμένο διμερές γράφο. Με βάση αυτή την ιδέα, αναπτύξαμε και υλοποιήσαμε ένα σύνολο αλγορίθμων που σχεδιάστηκαν ως microservices σε περιβάλλον software containers. Οι αλγόριθμοι εκτελούν την επεξεργασία και σύζευξη ροών δεδομένων και μπορούν να κλιμακωθούν οριζόντια. Πραγματοποιήσαμε τα πειράματά μας σε περιβάλλον υπολογιστικού νέφους στο Google Container Engine χρησιμοποιώντας πλατφόρμα Kubernetes και Docker containers.The large and varying volumes of data generated by many emerging applications and systems demand the sophisticated processing of high speed data streams in a real-time fashion. Stream joins is the streaming counterpart of conventional database joins and compares tuples coming from different streaming relations. This operator is characterized as computationally expensive and also quite important for real-time analytics. Efficient and scalable processing of stream joins may be enabled by the availability of a large number of processing nodes in a parallel and distributed environment. Furthermore, clouds have evolved as an appealing platform for large-scale data processing mainly due to the concept of elasticity; virtual computing infrastructure can be leased on demand and used for as much time as needed in a dynamic manner. For this thesis project, we adopt the main ideas and features of Qian Lin et al. in their paper “Scalable Distributed Stream Join Processing”. The basic idea presented in that paper is the join-biclique model which organizes the processing units of a cluster as a complete bipartite graph. Based on that idea, we developed and carried out a set of algorithms designed as containerized microservices, which perform stream join processing and can be scaled horizontally on demand. We performed our experiments on Google Container Engine using Kubernetes orchestration platform and Docker containers

    Query Driven Operator Placement for Complex Event Detection over Data Streams

    Get PDF
    We consider the problem of efficiently processing subscription queries over data streams in large-scale interconnected sensor networks. We propose a scalable algorithm for distributed data stream processing, applicable on top of any platform granting access to interconnected sensor networks. We make use of a probabilistic algorithm to check whether subscriptions are subsumed by other subscriptions and thus can be pruned for more efficient processing. Our proposed methods are query driven, hence do not replicate data streams, but intelligently place join operators inside the global network of sources. We show by a performance evaluation using real world sensor data the suitability of our approach

    The Family of MapReduce and Large Scale Data Processing Systems

    Full text link
    In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large scale data processing mechanisms. MapReduce is a simple and powerful programming model that enables easy development of scalable parallel applications to process vast amounts of data on large clusters of commodity machines. It isolates the application from the details of running a distributed program such as issues on data distribution, scheduling and fault tolerance. However, the original implementation of the MapReduce framework had some limitations that have been tackled by many research efforts in several followup works after its introduction. This article provides a comprehensive survey for a family of approaches and mechanisms of large scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities. We also cover a set of introduced systems that have been implemented to provide declarative programming interfaces on top of the MapReduce framework. In addition, we review several large scale data processing systems that resemble some of the ideas of the MapReduce framework for different purposes and application scenarios. Finally, we discuss some of the future research directions for implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author

    Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management

    Get PDF
    As users of big data applications expect fresh results, we witness a new breed of stream processing systems (SPS) that are designed to scale to large numbers of cloud-hosted machines. Such systems face new challenges: (i) to benefit from the pay-as-you-go model of cloud computing, they must scale out on demand, acquiring additional virtual machines (VMs) and parallelising operators when the workload increases; (ii) failures are common with deployments on hundreds of VMs - systems must be fault-tolerant with fast recovery times, yet low per-machine overheads. An open question is how to achieve these two goals when stream queries include stateful operators, which must be scaled out and recovered without affecting query results. Our key idea is to expose internal operator state explicitly to the SPS through a set of state management primitives. Based on them, we describe an integrated approach for dynamic scale out and recovery of stateful operators. Externalised operator state is checkpointed periodically by the SPS and backed up to upstream VMs. The SPS identifies individual operator bottlenecks and automatically scales them out by allocating new VMs and partitioning the check-pointed state. At any point, failed operators are recovered by restoring checkpointed state on a new VM and replaying unprocessed tuples. We evaluate this approach with the Linear Road Benchmark on the Amazon EC2 cloud platform and show that it can scale automatically to a load factor of L=350 with 50 VMs, while recovering quickly from failures. Copyright © 2013 ACM

    Quality-Driven Disorder Handling for M-way Sliding Window Stream Joins

    Full text link
    Sliding window join is one of the most important operators for stream applications. To produce high quality join results, a stream processing system must deal with the ubiquitous disorder within input streams which is caused by network delay, asynchronous source clocks, etc. Disorder handling involves an inevitable tradeoff between the latency and the quality of produced join results. To meet different requirements of stream applications, it is desirable to provide a user-configurable result-latency vs. result-quality tradeoff. Existing disorder handling approaches either do not provide such configurability, or support only user-specified latency constraints. In this work, we advocate the idea of quality-driven disorder handling, and propose a buffer-based disorder handling approach for sliding window joins, which minimizes sizes of input-sorting buffers, thus the result latency, while respecting user-specified result-quality requirements. The core of our approach is an analytical model which directly captures the relationship between sizes of input buffers and the produced result quality. Our approach is generic. It supports m-way sliding window joins with arbitrary join conditions. Experiments on real-world and synthetic datasets show that, compared to the state of the art, our approach can reduce the result latency incurred by disorder handling by up to 95% while providing the same level of result quality.Comment: 12 pages, 11 figures, IEEE ICDE 201
    corecore