21 research outputs found
Implementing Performance Competitive Logical Recovery
New hardware platforms, e.g. cloud, multi-core, etc., have led to a
reconsideration of database system architecture. Our Deuteronomy project
separates transactional functionality from data management functionality,
enabling a flexible response to exploiting new platforms. This separation
requires, however, that recovery is described logically. In this paper, we
extend current recovery methods to work in this logical setting. While this is
straightforward in principle, performance is an issue. We show how ARIES style
recovery optimizations can work for logical recovery where page information is
not captured on the log. In side-by-side performance experiments using a common
log, we compare logical recovery with a state-of-the art ARIES style recovery
implementation and show that logical redo performance can be competitive.Comment: VLDB201
Enabling Operator Reordering in Data Flow Programs Through Static Code Analysis
In many massively parallel data management platforms, programs are
represented as small imperative pieces of code connected in a data flow. This
popular abstraction makes it hard to apply algebraic reordering techniques
employed by relational DBMSs and other systems that use an algebraic
programming abstraction. We present a code analysis technique based on reverse
data and control flow analysis that discovers a set of properties from user
code, which can be used to emulate algebraic optimizations in this setting.Comment: 4 pages, accepted and presented at the First International Workshop
on Cross-model Language Design and Implementation (XLDI), affiliated with
ICFP 2012, Copenhage
Lightweight Asynchronous Snapshots for Distributed Dataflows
Distributed stateful stream processing enables the deployment and execution
of large scale continuous computations in the cloud, targeting both low latency
and high throughput. One of the most fundamental challenges of this paradigm is
providing processing guarantees under potential failures. Existing approaches
rely on periodic global state snapshots that can be used for failure recovery.
Those approaches suffer from two main drawbacks. First, they often stall the
overall computation which impacts ingestion. Second, they eagerly persist all
records in transit along with the operation states which results in larger
snapshots than required. In this work we propose Asynchronous Barrier
Snapshotting (ABS), a lightweight algorithm suited for modern dataflow
execution engines that minimises space requirements. ABS persists only operator
states on acyclic execution topologies while keeping a minimal record log on
cyclic dataflows. We implemented ABS on Apache Flink, a distributed analytics
engine that supports stateful stream processing. Our evaluation shows that our
algorithm does not have a heavy impact on the execution, maintaining linear
scalability and performing well with frequent snapshots.Comment: 8 pages, 7 figure
Spinning Fast Iterative Data Flows
Parallel dataflow systems are a central part of most analytic pipelines for
big data. The iterative nature of many analysis and machine learning
algorithms, however, is still a challenge for current systems. While certain
types of bulk iterative algorithms are supported by novel dataflow frameworks,
these systems cannot exploit computational dependencies present in many
algorithms, such as graph algorithms. As a result, these algorithms are
inefficiently executed and have led to specialized systems based on other
paradigms, such as message passing or shared memory. We propose a method to
integrate incremental iterations, a form of workset iterations, with parallel
dataflows. After showing how to integrate bulk iterations into a dataflow
system and its optimizer, we present an extension to the programming model for
incremental iterations. The extension alleviates for the lack of mutable state
in dataflows and allows for exploiting the sparse computational dependencies
inherent in many iterative algorithms. The evaluation of a prototypical
implementation shows that those aspects lead to up to two orders of magnitude
speedup in algorithm runtime, when exploited. In our experiments, the improved
dataflow system is highly competitive with specialized systems while
maintaining a transparent and unified dataflow abstraction.Comment: VLDB201
Myriad: Scalable and Expressive Data Generation
ABSTRACT The current research focus on Big Data systems calls for a rethinking of data generation methods. The traditional sequential data generation approach is not well suited to large-scale systems as generating a terabyte of data may require days or even weeks depending on the number of constraints imposed on the generated model. We demonstrate Myriad, a new data generation toolkit that enables the specification of semantically rich data generator programs that can scale out linearly in a shared-nothing environment. Data generation programs built on top of Myriad implement an efficient parallel execution strategy leveraged by the extensive use of pseudo-random number generators with random access support
Myriad: Scalable and Expressive Data Generation
ABSTRACT The current research focus on Big Data systems calls for a rethinking of data generation methods. The traditional sequential data generation approach is not well suited to large-scale systems as generating a terabyte of data may require days or even weeks depending on the number of constraints imposed on the generated model. We demonstrate Myriad, a new data generation toolkit that enables the specification of semantically rich data generator programs that can scale out linearly in a shared-nothing environment. Data generation programs built on top of Myriad implement an efficient parallel execution strategy leveraged by the extensive use of pseudo-random number generators with random access support