10 research outputs found

    BriskStream: Scaling Data Stream Processing on Shared-Memory Multicore Architectures

    Full text link
    We introduce BriskStream, an in-memory data stream processing system (DSPSs) specifically designed for modern shared-memory multicore architectures. BriskStream's key contribution is an execution plan optimization paradigm, namely RLAS, which takes relative-location (i.e., NUMA distance) of each pair of producer-consumer operators into consideration. We propose a branch and bound based approach with three heuristics to resolve the resulting nontrivial optimization problem. The experimental evaluations demonstrate that BriskStream yields much higher throughput and better scalability than existing DSPSs on multi-core architectures when processing different types of workloads.Comment: To appear in SIGMOD'1

    Continuous Workflows: From Model to Enactment System

    Get PDF
    Workflows are actively being used in both business and scientific domains to automate processes and facilitate collaboration. A workflow management (or enactment) system (WfMS) defines, creates and manages the execution of workflows on one or more workflow engines, which are able to interpret workflow definitions, allocate resources, interact with workflow participants and, where required, invoke the needed tools (e.g., databases, job schedulers, etc.) and applications. Traditional WfMSs and workflow design processes view the workflow as a one-time interaction with the various data sources, i.e., when a workflow is invoked, its steps are executed once and in-order. The fundamental underlying assumption has been that data sources are passive and all interactions are structured along the request/reply (query) model. Hence, traditional WfMS cannot effectively support business or scientific monitoring applications that require the processing of data streams such as those generated by sensing devices as well as mobile and web applications. It is the hypothesis of this dissertation that Workflow Management Systems can be extended to support data stream semantics to enable monitoring applications. This includes the ability to apply flexible bounds on unbounded data streams and the ability to facilitate on-the-fly processing of bounded bundles of data (window semantics). To support this hypothesis this dissertation has produced new specifications, a design, an implementation and a thorough evaluation of a novel Continuous Workflows (CWf) model, which is backwards compatible with currently available workflow models. The CWf model was implemented in a CONtinuous workFLow ExeCution Engine, CONFLuEnCE, as an extension of Kepler, which is a popular scientific WfMS. The applicability of the CWf model in both scientific and business applications was demonstrated by utilizing CONFLuEnCE in Astroshelf to support live annotations (i.e., monitoring of astronomical data), and to support supply chain monitoring and management. The implementation of CONFLuEnCE led to the realization that different applications have different performance requirements and hence an integrated workflow scheduling framework is essential. Towards meeting this need, STAFiLOS, a Stream FLOw Scheduling framework for Continuous Workflows, was designed and implemented, within CONFLuEnCE. The performance of STAFiLOS was evaluated using the Linear Road Benchmark for continuous workflows

    Scalable and fault-tolerant data stream processing on multi-core architectures

    Get PDF
    With increasing data volumes and velocity, many applications are shifting from the classical “process-after-store” paradigm to a stream processing model: data is produced and consumed as continuous streams. Stream processing captures latency-sensitive applications as diverse as credit card fraud detection and high-frequency trading. These applications are expressed as queries of algebraic operations (e.g., aggregation) over the most recent data using windows, i.e., finite evolving views over the input streams. To guarantee correct results, streaming applications require precise window semantics (e.g., temporal ordering) for operations that maintain state. While high processing throughput and low latency are performance desiderata for stateful streaming applications, achieving both poses challenges. Computing the state of overlapping windows causes redundant aggregation operations: incremental execution (i.e., reusing previous results) reduces latency but prevents parallelization; at the same time, parallelizing window execution for stateful operations with precise semantics demands ordering guarantees and state access coordination. Finally, streams and state must be recovered to produce consistent and repeatable results in the event of failures. Given the rise of shared-memory multi-core CPU architectures and high-speed networking, we argue that it is possible to address these challenges in a single node without compromising window semantics, performance, or fault-tolerance. In this thesis, we analyze, design, and implement stream processing engines (SPEs) that achieve high performance on multi-core architectures. To this end, we introduce new approaches for in-memory processing that address the previous challenges: (i) for overlapping windows, we provide a family of window aggregation techniques that enable computation sharing based on the algebraic properties of aggregation functions; (ii) for parallel window execution, we balance parallelism and incremental execution by developing abstractions for both and combining them to a novel design; and (iii) for reliable single-node execution, we enable strong fault-tolerance guarantees without sacrificing performance by reducing the required disk I/O bandwidth using a novel persistence model. We combine the above to implement an SPE that processes hundreds of millions of tuples per second with sub-second latencies. These results reveal the opportunity to reduce resource and maintenance footprint by replacing cluster-based SPEs with single-node deployments.Open Acces

    Stateful data-parallel processing

    Get PDF
    Democratisation of data means that more people than ever are involved in the data analysis process. This is beneficial—it brings domain-specific knowledge from broad fields—but data scientists do not have adequate tools to write algorithms and execute them at scale. Processing models of current data-parallel processing systems, designed for scalability and fault tolerance, are stateless. Stateless processing facilitates capturing parallelisation opportunities and hides fault tolerance. However, data scientists want to write stateful programs—with explicit state that they can update, such as matrices in machine learning algorithms—and are used to imperative-style languages. These programs struggle to execute with high-performance in stateless data-parallel systems. Representing state explicitly makes data-parallel processing at scale challenging. To achieve scalability, state must be distributed and coordinated across machines. In the event of failures, state must be recovered to provide correct results. We introduce stateful data-parallel processing that addresses the previous challenges by: (i) representing state as a first-class citizen so that a system can manipulate it; (ii) introducing two distributed mutable state abstractions for scalability; and (iii) an integrated approach to scale out and fault tolerance that recovers large state—spanning the memory of multiple machines. To support imperative-style programs a static analysis tool analyses Java programs that manipulate state and translates them to a representation that can execute on SEEP, an implementation of a stateful data-parallel processing model. SEEP is evaluated with stateful Big Data applications and shows comparable or better performance than state-of-the-art stateless systems.Open Acces

    Task Scheduling in Data Stream Processing Systems

    Get PDF
    In the era of big data, with streaming applications such as social media, surveillance monitoring and real-time search generating large volumes of data, efficient Data Stream Processing Systems (DSPSs) have become essential. When designing an efficient DSPS, a number of challenges need to be considered including task allocation, scalability, fault tolerance, QoS, parallelism degree, and state management, among others. In our research, we focus on task allocation as it has a significant impact on performance metrics such as data processing latency and system throughput. An application processed by DSPSs is represented as a Directed Acyclic Graph (DAG), where each vertex represents a task and the edges show the dataflow between the tasks. Task allocation can be defined as the assignment of the vertices in the DAG to the physical compute nodes such that the data movement between the nodes is minimised. Finding an optimal task placement for stream processing systems is NP-hard. Thus, approximate scheduling approaches are required to improve the performance of DSPSs. In this thesis, we present our three proposed schedulers, each having a different heuristic partitioning approach to minimise inter-node communication for either homogeneous or heterogeneous clusters. We demonstrate how each scheduler can efficiently assign groups of highly communicating tasks to compute nodes. Our schedulers are able to outperform two state-of-the-art schedulers for three micro-benchmarks and two real-world applications, increasing throughput and reducing data processing latency as a result of a better task placement

    Policy research working papers : catalog of numbers 801-1200

    Get PDF
    This paper contains a numerical listing of working papers produced by the Central Vicepresidencies. Each citation contains a brief abstract, and the contact point for the paper.Environmental Economics&Policies,Economic Theory&Research,Banks&Banking Reform,Poverty Assessment,Health Economics&Finance

    Design, implementation, and evaluation of the linear road bnchmark on the stream processing core

    No full text
    corecore