    Compressed materialised views of semi-structured data

    Query performance issues over semi-structured data have led to the emergence of materialised XML views as a means of restricting the data structure processed by a query. However preserving the conventional representation of such views remains a significant limiting factor especially in the context of mobile devices where processing power, memory usage and bandwidth are significant factors. To explore the concept of a compressed materialised view, we extend our earlier work on structural XML compression to produce a combination of structural summarisation and data compression techniques. These techniques provide a basis for efficiently dealing with both structural queries and valuebased predicates. We evaluate the effectiveness of such a scheme, presenting results and performance measures that show advantages of using such structures

    Framework for Live Synchronization of RDF Views of Relational Data

    This Demo presents a framework for the live synchronization of an RDF view defined on top of relational database. In the proposed framework, rules are responsible for computing and publishing the changeset required for the RDB-RDF view to stay synchronized with the relational database. The computed changesets are then used for the incremental maintenance of the RDB_RDF views as well as application views. The Demo is based on the LinkedBrainz Live tool, developed to validate the proposed framework

    XML Reconstruction View Selection in XML Databases: Complexity Analysis and Approximation Scheme

    Query evaluation in an XML database requires reconstructing XML subtrees rooted at nodes found by an XML query. Since XML subtree reconstruction can be expensive, one approach to improve query response time is to use reconstruction views - materialized XML subtrees of an XML document, whose nodes are frequently accessed by XML queries. For this approach to be efficient, the principal requirement is a framework for view selection. In this work, we are the first to formalize and study the problem of XML reconstruction view selection. The input is a tree TT, in which every node ii has a size cic_i and profit pip_i, and the size limitation CC. The target is to find a subset of subtrees rooted at nodes i1,,iki_1,\cdots, i_k respectively such that ci1++cikCc_{i_1}+\cdots +c_{i_k}\le C, and pi1++pikp_{i_1}+\cdots +p_{i_k} is maximal. Furthermore, there is no overlap between any two subtrees selected in the solution. We prove that this problem is NP-hard and present a fully polynomial-time approximation scheme (FPTAS) as a solution

    K2/Kleisli and GUS: Experiments in Integrated Access to Genomic Data Sources

    The integration of heterogeneous data sources and software systems is a major issue in the biomed ical community and several approaches have been explored: linking databases, on-the- fly integration through views, and integration through warehousing. In this paper we report on our experiences with two systems that were developed at the University of Pennsylvania: an integration system called K2, which has primarily been used to provide views over multiple external data sources and software systems; and a data warehouse called GUS which downloads, cleans, integrates and annotates data from multiple external data sources. Although the view and warehouse approaches each have their advantages, there is no clear winner . Therefore, users must consider how the data is to be used, what the performance guarantees must be, and how much programmer time and expertise is available to choose the best strategy for a particular application

    Online Integration of Semistructured Data

    Data integration systems play an important role in the development of distributed multi-database systems. Data integration collects data from heterogeneous and distributed sources, and provides a global view of data to the users. Systems need to process user\u27s applications in the shortest possible time. The virtualization approach to data integration systems ensures that the answers to user requests are the most up-to-date ones. In contrast, the materialization approach reduces data transmission time at the expense of data consistency between the central and remote sites. The virtualization approach to data integration systems can be applied in either batch or online mode. Batch processing requires all data to be available at a central site before processing is started. Delays in transmission of data over a network contribute to a longer processing time. On the other hand, in an online processing mode data integration is performed piece-by-piece as soon as a unit of data is available at the central site. An online processing mode presents the partial results to the users earlier. Due to the heterogeneity of data models at the remote sites, a semistructured global view of data is required. The performance of data integration systems depends on an appropriate data model and the appropriate data integration algorithms used. This thesis presents a new algorithm for immediate processing of data collected from remote and autonomous database systems. The algorithm utilizes the idle processing states while the central site waits for completion of data transmission to produce instant partial results. A decomposition strategy included in the algorithm balances of the computations between the central and remote sites to force maximum resource utilization at both sites. The thesis chooses the XML data model for the representation of semistructured data, and presents a new formalization of the XML data model together with a set of algebraic operations. The XML data model is used to provide a virtual global view of semistructured data. The algebraic operators are consistent with operations of relational algebra, such that any existing syntax based query optimization technique developed for the relational model of data can be directly applied. The thesis shows how to optimize online processing by generating one online integration plan for several data increments. Further, the thesis shows how each independent increment expression can be processed in a parallel mode on a multi core processor system. The dynamic scheduling system proposed in the thesis is able to defer or terminate a plan such that materialization updates and unnecessary computations are minimized. The thesis shows that processing data chunks of fragmented XML documents allows for data integration in a shorter period of time. Finally, the thesis provides a clear formalization of the semistructured data model, a set of algorithms with high-level descriptions, and running examples. These formal backgrounds show that the proposed algorithms are implementable

    Managing Uncertainty and Ontologies in Databases

    Nowadays a vast amount of data is generated in Extensible Markup Language (XML). However, it is necessary for applications in some domains to store and manipulate uncertain information, e.g. when the sensor inputs are noisy, or we want to store data that is uncertain. Another big change we can see in applications and web data is the increasing use of ontologies to describe the semantics of data, i.e., the semantic relationships between the terms in the databases. As such information is usually absent from traditional databases, there is tremendous opportunity to ask new kinds of queries that could not be handled in the past. This provides new challenges on how to manipulate and maintain such new kinds of database systems. In this dissertation, we will see how we can (i) incorporate and manipulate uncertainty in databases, and (ii) efficiently compute aggregates and maintain views on ontology databases. First, I explain applications that require manipulating uncertain information in XML databases and maintaining web ontology databases written in Resource Description Framework (RDF). I then introduce the probabilistic semistructured PXML data model with two formal semantics. I describe a set of algebraic operations and its efficient implementation. Aggregations of PXML instances are studied with two semantics proposed: possible-worlds semantics and expectation semantics. Efficient algorithms with pruning are given and evaluated to show their feasibility. I introduce PIXML, an interval probability version of PXML, and develop a formal semantics for it. A query language and its operational semantics are given and proved to be sound and complete. Based on XML, RDF is a language used to describe web ontologies. RDQL, an RDF query language, is extended to support view definition and aggregations. Two sets of algorithms are given to maintain non-aggregate and aggregate views. Experimental results show that they are efficient compared with standard relational view maintenance algorithms

    Materialized view maintenance for XML documents

    Modeling ontology views: An abstract view model for semantic web

    The emergence of Semantic Web (SW) and the related technologies promise to make the web a meaningful experience. However, high level modelling, design and querying techniques proves to be a challenging task for organizations that are hoping to utilize the SW paradigm for their industrial applications. To address one such issue, in this paper, we propose an abstract view model with conceptual extensions for the SW. First we outline the view model, its properties and some modelling issues with the help of an industrial case study example. Then, we provide some discussions on constructing such views (at the conceptual level) using a set of operators. Later we provide a brief discussion on how such this view model can utilized in the MOVE [1] system, to design and construct materialized Ontology views to support Ontology extraction

    Bulkloading and Maintaining XML Documents

    The popularity of XML as a exchange and storage format brings about massive amounts of documents to be stored, maintained and analyzed -- a challenge that traditionally has been tackled with Database Management Systems (DBMS). To open up the content of XML documents to analysis with declarative query languages, efficient bulk loading techniques are necessary. Database technology has traditionally been offering support for these tasks but yet falls short of providing efficient automation techniques for the challenges that large collections of XML data raise. As storage back-end, many applications rely on relational databases, which are designed towards large data volumes. This paper studies the bulk load and update algorithms for XML data stored in relational format and outlines opportunities and problems. We investigate both (1) bulk insertion and deletion as well as (2) updates in the form of edit scripts which heavily use pointer-chasing techniques which often are considered orthogonal to the algebraic operations relational databases are optimized for. To get the most out of relational database systems, we show that one should make careful use of edit scripts and replace them with bulk operations if more than a very small portion of the database is updated. We implemented our ideas on top of the Monet Database System and benchmarked their performance