8 research outputs found

    DRS: Dynamic Resource Scheduling for Real-Time Analytics over Fast Streams

    Full text link
    In a data stream management system (DSMS), users register continuous queries, and receive result updates as data arrive and expire. We focus on applications with real-time constraints, in which the user must receive each result update within a given period after the update occurs. To handle fast data, the DSMS is commonly placed on top of a cloud infrastructure. Because stream properties such as arrival rates can fluctuate unpredictably, cloud resources must be dynamically provisioned and scheduled accordingly to ensure real-time response. It is quite essential, for the existing systems or future developments, to possess the ability of scheduling resources dynamically according to the current workload, in order to avoid wasting resources, or failing in delivering correct results on time. Motivated by this, we propose DRS, a novel dynamic resource scheduler for cloud-based DSMSs. DRS overcomes three fundamental challenges: (a) how to model the relationship between the provisioned resources and query response time (b) where to best place resources; and (c) how to measure system load with minimal overhead. In particular, DRS includes an accurate performance model based on the theory of \emph{Jackson open queueing networks} and is capable of handling \emph{arbitrary} operator topologies, possibly with loops, splits and joins. Extensive experiments with real data confirm that DRS achieves real-time response with close to optimal resource consumption.Comment: This is the our latest version with certain modificatio

    Generic windowing support for extensible stream processing systems

    Get PDF
    Cataloged from PDF version of article.Stream processing applications process high volume, continuous feeds from live data sources, employ data-in-motion analytics to analyze these feeds, and produce near real-time insights with low latency. One of the fundamental characteristics of such applications is the on-the-fly nature of the computation, which does not require access to disk resident data. Stream processing applications store the most recent history of streams in memory and use it to perform the necessary modeling and analysis tasks. This recent history is often managed using windows. All data stream management systems provide some form of windowing functionality. Windowing makes it possible to implement streaming versions of the traditionally blocking relational operators, such as streaming aggregations, joins, and sorts, as well as any other analytic operator that requires keeping the most recent tuples as state, such as time series analysis operators and signal processing operators. In this paper, we provide a categorization of different window types and policies employed in stream processing applications and give detailed operational semantics for various window configurations. We describe an extensibility mechanism that makes it possible to integrate windowing support into user-defined operators, enabling consistent syntax and semantics across system-provided and third-party toolkits of streaming operators. We describe the design and implementation of a runtime windowing library that significantly simplifies the construction of window-based operators by decoupling the handling of window policies and operator logic from each other. We present our experience using the windowing library to implement a relational operators toolkit and compare the efficacy of the solution to an earlier implementation that did not employ a common windowing library. Copyright (c) 2013 John Wiley & Sons, Ltd

    StreamCloud: An elastic and scalable data streaming system

    Get PDF
    Many applications in several domains such as telecommunications, network security, large scale sensor networks, require online processing of continuous data lows. They produce very high loads that requires aggregating the processing capacity of many nodes. Current Stream Processing Engines do not scale with the input load due to single-node bottlenecks. Additionally, they are based on static con?gurations that lead to either under or over-provisioning. In this paper, we present StreamCloud, a scalable and elastic stream processing engine for processing large data stream volumes. StreamCloud uses a novel parallelization technique that splits queries into subqueries that are allocated to independent sets of nodes in a way that minimizes the distribution overhead. Its elastic protocols exhibit low intrusiveness, enabling effective adjustment of resources to the incoming load. Elasticity is combined with dynamic load balancing to minimize the computational resources used. The paper presents the system design, implementation and a thorough evaluation of the scalability and elasticity of the fully implemented system

    A Game-Theoretic Approach for Elastic Distributed Data Stream Processing

    Get PDF
    Distributed data stream processing applications are structured as graphs of interconnected modules able to ingest high-speed data and to transform them in order to generate results of interest. Elasticity is one of the most appealing features of stream processing applications. It makes it possible to scale up/down the allocated computing resources on demand in response to fluctuations of the workload. On clouds, this represents a necessary feature to keep the operating cost at affordable levels while accommodating user-defined QoS requirements. In this article, we study this problem from a game-theoretic perspective. The control logic driving elasticity is distributed among local control agents capable of choosing the right amount of resources to use by each module. In a first step, we model the problem as a noncooperative game in which agents pursue their self-interest. We identify the Nash equilibria and we design a distributed procedure to reach the best equilibrium in the Pareto sense. As a second step, we extend the noncooperative formulation with a decentralized incentive-based mechanism in order to promote cooperation by moving the agreement point closer to the system optimum. Simulations confirm the results of our theoretical analysis and the quality of our strategies

    Task Scheduling in Data Stream Processing Systems

    Get PDF
    In the era of big data, with streaming applications such as social media, surveillance monitoring and real-time search generating large volumes of data, efficient Data Stream Processing Systems (DSPSs) have become essential. When designing an efficient DSPS, a number of challenges need to be considered including task allocation, scalability, fault tolerance, QoS, parallelism degree, and state management, among others. In our research, we focus on task allocation as it has a significant impact on performance metrics such as data processing latency and system throughput. An application processed by DSPSs is represented as a Directed Acyclic Graph (DAG), where each vertex represents a task and the edges show the dataflow between the tasks. Task allocation can be defined as the assignment of the vertices in the DAG to the physical compute nodes such that the data movement between the nodes is minimised. Finding an optimal task placement for stream processing systems is NP-hard. Thus, approximate scheduling approaches are required to improve the performance of DSPSs. In this thesis, we present our three proposed schedulers, each having a different heuristic partitioning approach to minimise inter-node communication for either homogeneous or heterogeneous clusters. We demonstrate how each scheduler can efficiently assign groups of highly communicating tasks to compute nodes. Our schedulers are able to outperform two state-of-the-art schedulers for three micro-benchmarks and two real-world applications, increasing throughput and reducing data processing latency as a result of a better task placement

    Parallel Patterns for Adaptive Data Stream Processing

    Get PDF
    In recent years our ability to produce information has been growing steadily, driven by an ever increasing computing power, communication rates, hardware and software sensors diffusion. This data is often available in the form of continuous streams and the ability to gather and analyze it to extract insights and detect patterns is a valuable opportunity for many businesses and scientific applications. The topic of Data Stream Processing (DaSP) is a recent and highly active research area dealing with the processing of this streaming data. The development of DaSP applications poses several challenges, from efficient algorithms for the computation to programming and runtime systems to support their execution. In this thesis two main problems will be tackled: * need for high performance: high throughput and low latency are critical requirements for DaSP problems. Applications necessitate taking advantage of parallel hardware and distributed systems, such as multi/manycores or cluster of multicores, in an effective way; * dynamicity: due to their long running nature (24hr/7d), DaSP applications are affected by highly variable arrival rates and changes in their workload characteristics. Adaptivity is a fundamental feature in this context: applications must be able to autonomously scale the used resources to accommodate dynamic requirements and workload while maintaining the desired Quality of Service (QoS) in a cost-effective manner. In the current approaches to the development of DaSP applications are still missing efficient exploitation of intra-operator parallelism as well as adaptations strategies with well known properties of stability, QoS assurance and cost awareness. These are the gaps that this research work tries to fill, resorting to well know approaches such as Structured Parallel Programming and Control Theoretic models. The dissertation runs along these two directions. The first part deals with intra-operator parallelism. A DaSP application can be naturally expressed as a set of operators (i.e. intermediate computations) that cooperate to reach a common goal. If QoS requirements are not met by the current implementation, bottleneck operators must be internally parallelized. We will study recurrent computations in window based stateful operators and propose patterns for their parallel implementation. Windowed operators are the most representative class of stateful data stream operators. Here computations are applied on the most recent received data. Windows are dynamic data structures: they evolve over time in terms of content and, possibly, size. Therefore, with respect to traditional patterns, the DaSP domain requires proper specializations and enhanced features concerning data distribution and management policies for different windowing methods. A structured approach to the problem will reduce the effort and complexity of parallel programming. In addition, it simplifies the reasoning about the performance properties of a parallel solution (e.g. throughput and latency). The proposed patterns exhibit different properties in terms of applicability and profitability that will be discussed and experimentally evaluated. The second part of the thesis is devoted to the proposal and study of predictive strategies and reconfiguration mechanisms for autonomic DaSP operators. Reconfiguration activities can be implemented in a transparent way to the application programmer thanks to the exploitation of parallel paradigms with well known structures. Furthermore, adaptation strategies may take advantage of the QoS predictability of the used parallel solution. Autonomous operators will be driven by means of a Model Predictive Control approach, with the intent of giving QoS assurances in terms of throughput or latency in a resource-aware manner. An experimental section will show the effectiveness of the proposed approach in terms of execution costs reduction as well as the stability degree of a system reconfiguration. The experiments will target shared and distributed memory architectures

    Distributed Contextual Anomaly Detection from Big Event Streams

    Get PDF
    The age of big digital data is emerged and the size of generating data is rapidly increasing in a millisecond through the Internet of Things (IoT) and Internet of Everything (IoE) objects. Specifically, most of today’s available data are generated in a form of streams through different applications including sensor networks, bioinformatics, smart airport, smart highway traffic, smart home applications, e-commerce online shopping, and social media streams. In this context, processing and mining such high volume of data stream becomes one of the research priority concern and challenging tasks. On the one hand, processing high volumes of streaming data with low-latency response is a critical concern in most of the real-time application before the important information can be missed or disregarded. On the other hand, detecting events from data stream is becoming a new research challenging task since the existing traditional anomaly detection method is mainly focusing on; a) limited size of data, b) centralised detection with limited computing resource, and c) specific anomaly detection types of either point or collective rather than the Contextual behaviour of the data. Thus, detecting Contextual events from high sequence volume of data stream is one of the research concerns to be addressed in this thesis. As the size of IoT data stream is scaled up to a high volume, it is impractical to propose existing processing data structure and anomaly detection method. This is due to the space, time and the complexity of the existing data processing model and learning algorithms. In this thesis, a novel distributed anomaly detection method and algorithm is proposed to detect Contextual behaviours from the sequence of bounded streams. Capturing event streams and partitioning them over several windows to control the high rate of event streams mainly base on, the proposed solution firstly. Secondly, by proposing a parallel and distributed algorithm to detect Contextual anomalous event. The experimental results are evaluated based on the algorithm’s performances, processing low-latency response, and detecting Contextual anomalous behaviour accuracy rate from the event streams. Finally, to address scalability concerned of the Contextual events, appropriate computational metrics are proposed to measure and evaluate the processing latency of distributed method. The achieved result is evidenced distributed detection is effective in terms of learning from high volumes of streams in real-time
    corecore