523 research outputs found

    Provenance in Collaborative Data Sharing

    Get PDF
    This dissertation focuses on recording, maintaining and exploiting provenance information in Collaborative Data Sharing Systems (CDSS). These are systems that support data sharing across loosely-coupled, heterogeneous collections of relational databases related by declarative schema mappings. A fundamental challenge in a CDSS is to support the capability of update exchange --- which publishes a participant\u27s updates and then translates others\u27 updates to the participant\u27s local schema and imports them --- while tolerating disagreement between them and recording the provenance of exchanged data, i.e., information about the sources and mappings involved in their propagation. This provenance information can be useful during update exchange, e.g., to evaluate provenance-based trust policies. It can also be exploited after update exchange, to answer a variety of user queries, about the quality, uncertainty or authority of the data, for applications such as trust assessment, ranking for keyword search over databases, or query answering in probabilistic databases. To address these challenges, in this dissertation we develop a novel model of provenance graphs that is informative enough to satisfy the needs of CDSS users and captures the semantics of query answering on various forms of annotated relations. We extend techniques from data integration, data exchange, incremental view maintenance and view update to define the formal semantics of unidirectional and bidirectional update exchange. We develop algorithms to perform update exchange incrementally while maintaining provenance information. We present strategies for implementing our techniques over an RDBMS and experimentally demonstrate their viability in the Orchestra prototype system. We define ProQL, a query language for provenance graphs that can be used by CDSS users to combine data querying with provenance testing as well as to compute annotations for their data, based on their provenance, that are useful for a variety of applications. Finally, we develop a prototype implementation ProQL over an RDBMS and indexing techniques to speed up provenance querying, evaluate experimentally the performance of provenance querying and the benefits of our indexing techniques

    Asynchronous replication of eventually consistent updatable views

    Get PDF
    Users of software applications expect fast response times and high availability. This is despite several applications moving from local devices and into the cloud. A cloud-based application that could function locally will now be unavailable if a network partition occurs. A fundamental challenge in distributed systems is maintaining the right tradeoffs between strong consistency, high availability, and tolerance to network partitions. The impossibility of achieving all three properties is described in the CAP theorem. To guarantee the highest degree of responsiveness and availability, applications could be run entirely locally on a device without directly relying on cloud services. Software that can be run locally without a direct dependency on cloud services are called local-first software. Being local-first means that consistency guarantees may need to be relaxed. Weaker consistency, such as eventual consistency, can be used instead of strong consistency. Implementing conflict-free replicated data types (CRDTs) is a provably correct way to achieve eventual consistency. These data types guarantee that the state of different replicas will converge towards a common state when a system becomes connected and quiescent. The drawback of using CRDTs is that they are unbounded in their growth. This means they can quickly become too large to handle using less capable devices like smartphones, tablets, or other edge devices. To mitigate this, partial replication can be implemented to replicate only the data each device needs. This comes with the added benefit of limiting the information users obtain, thus possibly improving security and privacy. The main contribution of this thesis is a new approach to partial replication. It is based on an existing asynchronously replicated relational database to support local-first software and guarantees eventual consistency. The new approach uses database views to define partial replicas. The database views are made updatable by drawing inspiration from the large body of research on updatable views. We differentiate ourselves from earlier work on non-distributed updatable views by guaranteeing that the views are eventually consistent. The approach is evaluated to ensure it can be used for real scenarios. The approach has proved to be usable in the scenarios. The replication of database views has also been experimentally tested to ensure that our approach to partial replication is viable for less capable devices

    The {RDF}-3X Engine for Scalable Management of {RDF} Data

    Get PDF
    RDF is a data model for schema-free structured information that is gaining momentum in the context of Semantic-Web data, life sciences, and also Web 2.0 platforms. The ``pay-as-you-go'' nature of RDF and the flexible pattern-matching capabilities of its query language SPARQL entail efficiency and scalability challenges for complex queries including long join paths. This paper presents the RDF-3X engine, an implementation of SPARQL that achieves excellent performance by pursuing a RISC-style architecture with streamlined indexing and query processing. The physical design is identical for all RDF-3X databases regardless of their workloads, and completely eliminates the need for index tuning by exhaustive indexes for all permutations of subject-property-object triples and their binary and unary projections. These indexes are highly compressed, and the query processor can aggressively leverage fast merge joins with excellent performance of processor caches. The query optimizer is able to choose optimal join orders even for complex queries, with a cost model that includes statistical synopses for entire join paths. Although RDF-3X is optimized for queries, it also provides good support for efficient online updates by means of a staging architecture: direct updates to the main database indexes are deferred, and instead applied to compact differential indexes which are later merged into the main indexes in a batched manner. Experimental studies with several large-scale datasets with more than 50 million RDF triples and benchmark queries that include pattern matching, manyway star-joins, and long path-joins demonstrate that RDF-3X can outperform the previously best alternatives by one or two orders of magnitude

    Updating Recursive XML Views of Relations

    Get PDF
    This paper investigates the view update problem for XML views published from relational data. We consider XML views defined in terms of mappings directed by possibly recursive DTDs, compressed into DAGs and stored in relations. We provide new techniques to efficiently support XML view updates specified in terms of XPath expressions with recursion and complex filters. The interaction between XPath recursion and DAG compression of XML views makes the analysis of XML view updates rather intriguing. In addition, many issues are still open even for relational view updates, and need to be explored. In response to these, on the XML side, we revise the notion of side effects and update semantics based on the semantics of XML views, and present efficient algorithms to translate XML updates to relational view updates. On the relational side, we propose a mild condition on SPJ views, and show that under this condition the analysis of deletions on relational views becomes PTIME while the insertion analysis is NP-complete. We develop an efficient algorithm to process relational view deletions, and a heuristic algorithm to handle view insertions. Finally, we present an experimental study to verify the effectiveness of our techniques. 1
    • …
    corecore