138 research outputs found

    Accelerating Event Stream Processing in On- and Offline Systems

    Get PDF
    Due to a growing number of data producers and their ever-increasing data volume, the ability to ingest, analyze, and store potentially never-ending streams of data is a mission-critical task in today's data processing landscape. A widespread form of data streams are event streams, which consist of continuously arriving notifications about some real-world phenomena. For example, a temperature sensor naturally generates an event stream by periodically measuring the temperature and reporting it with measurement time in case of a substantial change to the previous measurement. In this thesis, we consider two kinds of event stream processing: online and offline. Online refers to processing events solely in main memory as soon as they arrive, while offline means processing event data previously persisted to non-volatile storage. Both modes are supported by widely used scale-out general-purpose stream processing engines (SPEs) like Apache Flink or Spark Streaming. However, such engines suffer from two significant deficiencies that severely limit their processing performance. First, for offline processing, they load the entire stream from non-volatile secondary storage and replay all data items into the associated online engine in order of their original arrival. While this naturally ensures unified query semantics for on- and offline processing, the costs for reading the entire stream from non-volatile storage quickly dominate the overall processing costs. Second, modern SPEs focus on scaling out computations across the nodes of a cluster, but use only a fraction of the available resources of individual nodes. This thesis tackles those problems with three different approaches. First, we present novel techniques for the offline processing of two important query types (windowed aggregation and sequential pattern matching). Our methods utilize well-understood indexing techniques to reduce the total amount of data to read from non-volatile storage. We show that this improves the overall query runtime significantly. In particular, this thesis develops the first index-based algorithms for pattern queries expressed with the Match_Recognize clause, a new and powerful language feature of SQL that has received little attention so far. Second, we show how to maximize resource utilization of single nodes by exploiting the capabilities of modern hardware. Therefore, we develop a prototypical shared-memory CPU-GPU-enabled event processing system. The system provides implementations of all major event processing operators (filtering, windowed aggregation, windowed join, and sequential pattern matching). Our experiments reveal that regarding resource utilization and processing throughput, such a hardware-enabled system is superior to hardware-agnostic general-purpose engines. Finally, we present TPStream, a new operator for pattern matching over temporal intervals. TPStream achieves low processing latency and, in contrast to sequential pattern matching, is easily parallelizable even for unpartitioned input streams. This results in maximized resource utilization, especially for modern CPUs with multiple cores

    Implementation of a general purpose dataflow multiprocessor

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1988.GRSN 409671Includes bibliographical references (leaves 151-155).by Gregory Michael Papadopoulos.Ph.D

    The tasks with effects model for safe concurrency

    Get PDF
    Today's state-of-the-art concurrent programming models either provide weak safety guarantees, making it easy to write code with subtle errors, or are limited in the class of programs that they can express. I believe that a concurrent programming model should offer strong safety guarantees such as data race freedom, atomicity, and optional determinism, while being flexible enough to express the wide range of uses for concurrency in realistic programs, and offering good performance and scalability. In my thesis research, I have defined a new concurrent programming model called tasks with effects (TWE) that is intended to fulfill these goals. The core unit of work in this model is a dynamically-created task. The model's key feature is that each task has programmer-specified effects, and a run-time scheduler is used to ensure that two tasks are run concurrently only if they have non-interfering effects. Through the combination of statically verifying the declared effects of tasks and using an effect-aware run-time scheduler, the TWE model is able to guarantee strong safety properties such as data race freedom and atomicity. I have implemented this programming model in a language called TWEJava and its accompanying runtime system. I have evaluated TWEJava's expressiveness by implementing several programs in it. This evaluation shows that it can express a variety of parallel and concurrent programs, including programs that combine unstructured concurrency driven by user input with structured parallelism for performance, a pattern that cannot be expressed by some more restrictive models for safe parallel programming. I describe the TWE programming model and provide a formal dynamic semantics for it. I also formalize the data flow analysis used to statically check effects in TWEJava programs. This analysis is used to ensure that the effect of each operation is included within the covering effect at that point in the program. The combination of these static checks with the dynamic semantics provided by the run-time system gives rise to the TWE model's strong safety guarantees. To make the TWE model usable, particularly for programs with many fine-grain tasks, a scalable, high-performance scheduler is crucial. I have designed a highly scalable scheduling algorithm for tasks with effects. It uses a tree structure corresponding to the hierarchical memory descriptions used in effect specifications, allowing scheduling operations for effects on separate parts of the tree to be performed concurrently and without explicitly checking such effects against each other. This allows for high scalability while preserving the expressiveness afforded by the hierarchical effect specifications. I describe the scalable scheduling algorithm in detail and prove that it correctly guarantees task isolation. I have evaluated this scheduler on six TWEJava programs, including programs using many fine-grain tasks. The evaluation shows that it gives significant speedups on all the programs, often comparable to versions of the programs with low-level synchronization and parallelism constructs that provide no safety guarantees. As originally defined, the TWE model could not express algorithms in which the side effects of each task are dependent on dynamic data structures and cannot be expressed statically. Existing systems that can handle such algorithms give substantially weaker safety guarantees than TWE. I developed an extension of the TWE model that enables it to be used for such algorithms. A new feature of the TWEJava language is defined to let programmers specify a set of object references that define some of the effects of a task, which can be added to dynamically as the task executes. The static covering effect checks used by TWE are extended to ensure they are still sound in the presence of such dynamic effect sets, and the formal dynamic semantics for the TWE model are extended to support these dynamic effects. I describe and implement a run-time system for handling these dynamic effects, including mechanisms to detect conflicts between dynamic effects and to abort and retry tasks when such conflicts arise. I show that this system can express parallel algorithms in which the effects of a task can only be expressed dynamically. Preliminary performance experiments show that my dynamic effect system can deliver self-relative speedups for certain benchmarks, but the current version of the system imposes substantial performance overheads. I believe that these overheads can be significantly reduced, making TWE practical to use for programs that require dynamic effects; this is a focus of my ongoing and future work

    Automatic Parallelization of Tiled Stencil Loop Nests on GPUs

    Full text link
    This thesis attempts to design and implement a compiler framework based on the polyhedral model. The compiler automatically parallelizes loop nests; especially stencil kernels, into efficient GPU code by loop tiling transformations which the polyhedral model describes. To enhance parallel performance, we introduce three practically efficient techniques to process different types of loop nests. The experimental results of our compiler framework have demonstrated that these advanced techniques can outperform previous approaches. Firstly, we aim to find efficient tiling transformations without violating data dependences. How to select a tile's shape and size is an open issue that is performance-critical and influenced by GPU's hardware constraints. We propose an approach to determine the tile shapes out of consideration for improving two-level parallelism of GPUs. The new approach finds appropriate tiling hyperplanes by embedding parallelism-enhancing constraints into the polyhedral model to maximize intra-tile, i.e., intra-SM parallelism. This improves the load balance among the streaming processors (SPs), which execute a wavefront of loop iterations within a tile. We eliminate parallelism-hindering false dependences to optimize inter-tile, i.e., inter-SM parallelism. This improves the load balance among the streaming multiprocessors (SMs), which execute a wavefront of tiles. Furthermore, to avoid combinatorial explosion of tile size's configurations, we present a model-driven approach to automating tile size selection that is performance-critical for loop tiling transformations, especially for DOACROSS loop nests. Our tile size selection model accurately estimates the execution times of tiled loop nests running on GPUs. The selected tile sizes lead to the performance results that are close to the best observed for a range of problem sizes tested. Finally, to address the difficulty and low-performance of parallelizing widely used SOR stencil loop nests, we present a new tiled parallel SOR method, called MLSOR, which admits more efficient data-parallel SIMD execution on GPUs. Unlike the previous two approaches that are dependence-preserving, the basic idea is to algorithmically restructure a stencil kernel based on a non-dependence-preserving parallelization scheme to avoid pipelining for higher parallelism. The new approach can be implemented in compilers through a pattern matching pass to optimize SOR-like DOACROSS loop nests on GPUs

    A computing origami: Optimized code generation for emerging parallel platforms

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Parallel persistent object-oriented simulation with applications

    Get PDF

    Parallel functional programming for message-passing multiprocessors

    Get PDF
    We propose a framework for the evaluation of implicitly parallel functional programs on message passing multiprocessors with special emphasis on the issue of load bounding. The model is based on a new encoding of the lambda-calculus in Milner's pi-calculus and combines lazy evaluation and eager (parallel) evaluation in the same framework. The pi-calculus encoding serves as the specification of a more concrete compilation scheme mapping a simple functional language into a message passing, parallel program. We show how and under which conditions we can guarantee successful load bounding based on this compilation scheme. Finally we discuss the architectural requirements for a machine to support our model efficiently and we present a simple RISC-style processor architecture which meets those criteria

    Mechanisms for Unbounded, Conflict-Robust Hardware Transactional Memory

    Get PDF
    Conventional lock implementations serialize access to critical sections guarded by the same lock, presenting programmers with a difficult tradeoff between granularity of synchronization and amount of parallelism realized. Recently, researchers have been investigating an emerging synchronization mechanism called transactional memory as an alternative to such conventional lock-based synchronization. Memory transactions have the semantics of executing in isolation from one another while in reality executing speculatively in parallel, aborting when necessary to maintain the appearance of isolation. This combination of coarse-grained isolation and optimistic parallelism has the potential to ease the tradeoff presented by lock-based programming. This dissertation studies the hardware implementation of transactional memory, making three main contributions. First, we propose the permissions-only cache, a mechanism that efficiently increases the size of transactions that can be handled in the local cache hierarchy to optimize performance. Second, we propose OneTM, an unbounded hardware transactional memory system that serializes transactions that escape the local cache hierarchy. Finally, we propose RetCon, a novel mechanism for detecting conflicts that reduces conflicts by allowing transactions to commit with different values than those with which they executed as long as dataflow and control-flow constraints are maintained
    corecore