12 research outputs found

    Parallelizing Windowed Stream Joins in a Shared-Nothing Cluster

    Full text link
    The availability of large number of processing nodes in a parallel and distributed computing environment enables sophisticated real time processing over high speed data streams, as required by many emerging applications. Sliding window stream joins are among the most important operators in a stream processing system. In this paper, we consider the issue of parallelizing a sliding window stream join operator over a shared nothing cluster. We propose a framework, based on fixed or predefined communication pattern, to distribute the join processing loads over the shared-nothing cluster. We consider various overheads while scaling over a large number of nodes, and propose solution methodologies to cope with the issues. We implement the algorithm over a cluster using a message passing system, and present the experimental results showing the effectiveness of the join processing algorithm.Comment: 11 page

    DRS: Dynamic Resource Scheduling for Real-Time Analytics over Fast Streams

    Full text link
    In a data stream management system (DSMS), users register continuous queries, and receive result updates as data arrive and expire. We focus on applications with real-time constraints, in which the user must receive each result update within a given period after the update occurs. To handle fast data, the DSMS is commonly placed on top of a cloud infrastructure. Because stream properties such as arrival rates can fluctuate unpredictably, cloud resources must be dynamically provisioned and scheduled accordingly to ensure real-time response. It is quite essential, for the existing systems or future developments, to possess the ability of scheduling resources dynamically according to the current workload, in order to avoid wasting resources, or failing in delivering correct results on time. Motivated by this, we propose DRS, a novel dynamic resource scheduler for cloud-based DSMSs. DRS overcomes three fundamental challenges: (a) how to model the relationship between the provisioned resources and query response time (b) where to best place resources; and (c) how to measure system load with minimal overhead. In particular, DRS includes an accurate performance model based on the theory of \emph{Jackson open queueing networks} and is capable of handling \emph{arbitrary} operator topologies, possibly with loops, splits and joins. Extensive experiments with real data confirm that DRS achieves real-time response with close to optimal resource consumption.Comment: This is the our latest version with certain modificatio

    Exploring run-time reduction in programming codes via query optimization and caching

    Get PDF
    Object oriented programming languages raised the level of abstraction by supporting the explicit first class query constructs in the programming codes. These query constructs allow programmers to express operations on collections more abstractly than relying on their realization in loops or through provided libraries. Join optimization techniques from the field of database technology support efficient realizations of such language constructs. However, the problem associated with the existing techniques such as query optimization in Java Query Language (JQL) incurs run time overhead. Besides the programming languages supporting first-class query constructs, the usage of annotations has also increased in the software engineering community recently. Annotations are a common means of providing metadata information to the source code. The object oriented programming languages such as C# provides attributes constraints and Java has its own annotation constructs that allow the developers to include the metadata information in the program codes. This work introduces a series of query optimization approaches to reduce the run time of the programs involving explicit queries over collections. The proposed approaches rely on histograms to estimate the selectivity of the predicates and the joins in order to construct the query plans. The annotations in the source code are also utilized to gather the metadata required for the selectivity estimation of the numerical as well as the string valued predicates and joins in the queries. Several cache heuristics are proposed that effectively cache the results of repeated queries in the program codes. The cached query results are incrementally maintained up-to-date after the update operations to the collections --Abstract, page iv

    Exploring time related issues in data stream processing

    Get PDF

    Scalable and responsive real time event processing using cloud computing

    Get PDF
    PhD ThesisCloud computing provides the potential for scalability and adaptability in a cost e ective manner. However, when it comes to achieving scalability for real time applications response time cannot be high. Many applications require good performance and low response time, which need to be matched with the dynamic resource allocation. The real time processing requirements can also be characterized by unpredictable rates of incoming data streams and dynamic outbursts of data. This raises the issue of processing the data streams across multiple cloud computing nodes. This research analyzes possible methodologies to process the real time data in which applications can be structured as multiple event processing networks and be partitioned over the set of available cloud nodes. The approach is based on queuing theory principles to encompass the cloud computing. The transformation of the raw data into useful outputs occurs in various stages of processing networks which are distributed across the multiple computing nodes in a cloud. A set of valid options is created to understand the response time requirements for each application. Under a given valid set of conditions to meet the response time criteria, multiple instances of event processing networks are distributed in the cloud nodes. A generic methodology to scale-up and scale-down the event processing networks in accordance to the response time criteria is de ned. The real time applications that support sophisticated decision support mechanisms need to comply with response time criteria consisting of interdependent data ow paradigms making it harder to improve the performance. Consideration is given for ways to reduce the latency,improve response time and throughput of the real time applications by distributing the event processing networks in multiple computing nodes