21 research outputs found

    Implementing Performance Competitive Logical Recovery

    Full text link
    New hardware platforms, e.g. cloud, multi-core, etc., have led to a reconsideration of database system architecture. Our Deuteronomy project separates transactional functionality from data management functionality, enabling a flexible response to exploiting new platforms. This separation requires, however, that recovery is described logically. In this paper, we extend current recovery methods to work in this logical setting. While this is straightforward in principle, performance is an issue. We show how ARIES style recovery optimizations can work for logical recovery where page information is not captured on the log. In side-by-side performance experiments using a common log, we compare logical recovery with a state-of-the art ARIES style recovery implementation and show that logical redo performance can be competitive.Comment: VLDB201

    Enabling Operator Reordering in Data Flow Programs Through Static Code Analysis

    Full text link
    In many massively parallel data management platforms, programs are represented as small imperative pieces of code connected in a data flow. This popular abstraction makes it hard to apply algebraic reordering techniques employed by relational DBMSs and other systems that use an algebraic programming abstraction. We present a code analysis technique based on reverse data and control flow analysis that discovers a set of properties from user code, which can be used to emulate algebraic optimizations in this setting.Comment: 4 pages, accepted and presented at the First International Workshop on Cross-model Language Design and Implementation (XLDI), affiliated with ICFP 2012, Copenhage

    Lightweight Asynchronous Snapshots for Distributed Dataflows

    Full text link
    Distributed stateful stream processing enables the deployment and execution of large scale continuous computations in the cloud, targeting both low latency and high throughput. One of the most fundamental challenges of this paradigm is providing processing guarantees under potential failures. Existing approaches rely on periodic global state snapshots that can be used for failure recovery. Those approaches suffer from two main drawbacks. First, they often stall the overall computation which impacts ingestion. Second, they eagerly persist all records in transit along with the operation states which results in larger snapshots than required. In this work we propose Asynchronous Barrier Snapshotting (ABS), a lightweight algorithm suited for modern dataflow execution engines that minimises space requirements. ABS persists only operator states on acyclic execution topologies while keeping a minimal record log on cyclic dataflows. We implemented ABS on Apache Flink, a distributed analytics engine that supports stateful stream processing. Our evaluation shows that our algorithm does not have a heavy impact on the execution, maintaining linear scalability and performing well with frequent snapshots.Comment: 8 pages, 7 figure

    Spinning Fast Iterative Data Flows

    Full text link
    Parallel dataflow systems are a central part of most analytic pipelines for big data. The iterative nature of many analysis and machine learning algorithms, however, is still a challenge for current systems. While certain types of bulk iterative algorithms are supported by novel dataflow frameworks, these systems cannot exploit computational dependencies present in many algorithms, such as graph algorithms. As a result, these algorithms are inefficiently executed and have led to specialized systems based on other paradigms, such as message passing or shared memory. We propose a method to integrate incremental iterations, a form of workset iterations, with parallel dataflows. After showing how to integrate bulk iterations into a dataflow system and its optimizer, we present an extension to the programming model for incremental iterations. The extension alleviates for the lack of mutable state in dataflows and allows for exploiting the sparse computational dependencies inherent in many iterative algorithms. The evaluation of a prototypical implementation shows that those aspects lead to up to two orders of magnitude speedup in algorithm runtime, when exploited. In our experiments, the improved dataflow system is highly competitive with specialized systems while maintaining a transparent and unified dataflow abstraction.Comment: VLDB201

    Myriad: Scalable and Expressive Data Generation

    No full text
    ABSTRACT The current research focus on Big Data systems calls for a rethinking of data generation methods. The traditional sequential data generation approach is not well suited to large-scale systems as generating a terabyte of data may require days or even weeks depending on the number of constraints imposed on the generated model. We demonstrate Myriad, a new data generation toolkit that enables the specification of semantically rich data generator programs that can scale out linearly in a shared-nothing environment. Data generation programs built on top of Myriad implement an efficient parallel execution strategy leveraged by the extensive use of pseudo-random number generators with random access support

    Myriad: Scalable and Expressive Data Generation

    No full text
    ABSTRACT The current research focus on Big Data systems calls for a rethinking of data generation methods. The traditional sequential data generation approach is not well suited to large-scale systems as generating a terabyte of data may require days or even weeks depending on the number of constraints imposed on the generated model. We demonstrate Myriad, a new data generation toolkit that enables the specification of semantically rich data generator programs that can scale out linearly in a shared-nothing environment. Data generation programs built on top of Myriad implement an efficient parallel execution strategy leveraged by the extensive use of pseudo-random number generators with random access support
    corecore