638 research outputs found
Labeling Workflow Views with Fine-Grained Dependencies
This paper considers the problem of efficiently answering reachability
queries over views of provenance graphs, derived from executions of workflows
that may include recursion. Such views include composite modules and model
fine-grained dependencies between module inputs and outputs. A novel
view-adaptive dynamic labeling scheme is developed for efficient query
evaluation, in which view specifications are labeled statically (i.e. as they
are created) and data items are labeled dynamically as they are produced during
a workflow execution. Although the combination of fine-grained dependencies and
recursive workflows entail, in general, long (linear-size) data labels, we show
that for a large natural class of workflows and views, labels are compact
(logarithmic-size) and reachability queries can be evaluated in constant time.
Experimental results demonstrate the benefit of this approach over the
state-of-the-art technique when applied for labeling multiple views.Comment: VLDB201
Replicated Data and Partition Failures
In a distributed database system, data is often replicated to improve performance and availability. By storing copies of shared data on processors where it is frequently accessed, the need for expensive, remote read accesses is decreased. By storing copies of critical data on processors with independent failure modes, the probability that at least one copy of the data will be accessible increases. In theory, data replication makes it possible to provide arbitrarily high data availability.
In practice, realizing the benefits of data replication is difficult since the correctness of data must be maintained. One important aspect of correctness with replicated data is mutual consistency: all copies of the same logical data-item must agree on exactly one current value for the data-item. Furthermore, this value should make sense in terms of the transactions executed on copies of the data-item. When communication fails between sites containing copies of the same logical data-item, mutual consistency between copies becomes complicated to ensure. The most disruptive of these communication failures are partition failures, which fragment the network into isolated subnetworks called partitions. Unless partition failures are detected and recognized by all affected processors, independent and uncoordinated updates may be applied to different copies of the data, thereby compromising the correctness of data. Consider, for example, an Airline Reservation System implemented by a distributed database which splits into two partitions when the communication network fails. If, at the time of the failure, all the nodes have one seat remaining for PAN AM 537, reservations could be made in both partitions. This would violate correctness: who should get the last seat? There should not be more seats reserved for a flight than physically exist on the plane. (Some airlines do not implement this constraint and allow overbookings.)
The design of a replicated data management algorithm tolerating partition failures (or partition processing strategy) is a notoriously hard problem. Typically, the cause or extent of a partition failure cannot be discerned by the processors themselves. At best, a processor may be able to identify the other processors in its partition; but, for the processors outside of its partition, it will not be able to distinguish between the case where those processors are simply isolated from it and the case where those processors are down. In addition, slow responses can cause the network to appear partitioned even when it is not, further complicating the design of a fault-tolerant algorithm
Search and Result Presentation in Scientific Workflow Repositories
We study the problem of searching a repository of complex hierarchical
workflows whose component modules, both composite and atomic, have been
annotated with keywords. Since keyword search does not use the graph structure
of a workflow, we develop a model of workflows using context-free bag grammars.
We then give efficient polynomial-time algorithms that, given a workflow and a
keyword query, determine whether some execution of the workflow matches the
query. Based on these algorithms we develop a search and ranking solution that
efficiently retrieves the top-k grammars from a repository. Finally, we propose
a novel result presentation method for grammars matching a keyword query, based
on representative parse-trees. The effectiveness of our approach is validated
through an extensive experimental evaluation
Answering Regular Path Queries on Workflow Provenance
This paper proposes a novel approach for efficiently evaluating regular path
queries over provenance graphs of workflows that may include recursion. The
approach assumes that an execution g of a workflow G is labeled with
query-agnostic reachability labels using an existing technique. At query time,
given g, G and a regular path query R, the approach decomposes R into a set of
subqueries R1, ..., Rk that are safe for G. For each safe subquery Ri, G is
rewritten so that, using the reachability labels of nodes in g, whether or not
there is a path which matches Ri between two nodes can be decided in constant
time. The results of each safe subquery are then composed, possibly with some
small unsafe remainder, to produce an answer to R. The approach results in an
algorithm that significantly reduces the number of subqueries k over existing
techniques by increasing their size and complexity, and that evaluates each
subquery in time bounded by its input and output size. Experimental results
demonstrate the benefit of this approach
Provenance Views for Module Privacy
Scientific workflow systems increasingly store provenance information about
the module executions used to produce a data item, as well as the parameter
settings and intermediate data items passed between module executions. However,
authors/owners of workflows may wish to keep some of this information
confidential. In particular, a module may be proprietary, and users should not
be able to infer its behavior by seeing mappings between all data inputs and
outputs. The problem we address in this paper is the following: Given a
workflow, abstractly modeled by a relation R, a privacy requirement \Gamma and
costs associated with data. The owner of the workflow decides which data
(attributes) to hide, and provides the user with a view R' which is the
projection of R over attributes which have not been hidden. The goal is to
minimize the cost of hidden data while guaranteeing that individual modules are
\Gamma -private. We call this the "secureview" problem. We formally define the
problem, study its complexity, and offer algorithmic solutions
A Performance Analysis of Timed Synchronous Communication Primitives
Two algorithms for timed synchronous communication between a single sender and single receiver have recently appeared in the literature. Each weakens the definition of correct timed synchronous communication in a different way, and exhibits a different undesirable behavior. In this paper, their performance is analyzed, and their sensitivity to various parameters is discussed. These parameters include how long the processes are willing to wait for communication to be successful, how well synchronized the processes are, the assumed upper bound on message delay, and the actual end-to-end message delay distribution. We conclude by discussing the fault tolerance of the algorithms, and propose a mixed strategy that avoids some of the performance problems
Updating Complex Value Databeses
Query languages and their optimizations have been a very important issue in the database community. Languages for updating databases, however, have not been studied to the same extent, although they are clearly important since databases must change over time. The structure and expressiveness of updates is largely dependent on the data model. In relational databases, for example, the update language typically allows the user to specify changes to individual fields of a subset of a relation that meets some selection criterion. The syntax is terse, specifying only the pieces of the database that are to be altered. Because of its simplicity, most of the optimizations take place in the internal processing of the update rather than at the language level. In complex value databases, the need for a terse and optimizable update language is much greater, due to the deeply nested structures involved.
Starting with a query language for complex value databases called the Collection Programming Language (CPL), we describe an extension called CPL+ which provides a convenient and intuitive specification of updates on complex values. CPL is a functional language, with powerful optimizations achieved through rewrite rules. Additional rewrite rules are derived for CPL+ and a notion of deltafication is introduced to transform complete updates, expressed as conventional CPL expressions, into equivalent update expressions in CPL+. As a result of applying these transformations, the performance of complex updates can increase substantially
Adding Time to Synchronous Process Communications
In distributed real-time systems, communicating processes cannot be delayed for arbitrary amounts of time while waiting for messages. Thus, communication primitives used for real-time programming usually allow the inclusion of a deadline or timeout to limit potential delays due to synchronization. This paper interprets timed synchronous communication as having absolute deadlines. Various ways of implementing deadlines are discussed, and two useful timed synchronous communication problems are identified which differ in the number of of participating senders and receivers and type of synchronous communication. For each problem, a simple algorithm is presented and shown to be correct. The algorithms are shown to guarantee maximal success and to require the smallest delay intervals during which processes wait for synchronous communication. We also evaluate the number of messages used to reach agreement
- …