411 research outputs found

    Online Integration of Semistructured Data

    Get PDF
    Data integration systems play an important role in the development of distributed multi-database systems. Data integration collects data from heterogeneous and distributed sources, and provides a global view of data to the users. Systems need to process user\u27s applications in the shortest possible time. The virtualization approach to data integration systems ensures that the answers to user requests are the most up-to-date ones. In contrast, the materialization approach reduces data transmission time at the expense of data consistency between the central and remote sites. The virtualization approach to data integration systems can be applied in either batch or online mode. Batch processing requires all data to be available at a central site before processing is started. Delays in transmission of data over a network contribute to a longer processing time. On the other hand, in an online processing mode data integration is performed piece-by-piece as soon as a unit of data is available at the central site. An online processing mode presents the partial results to the users earlier. Due to the heterogeneity of data models at the remote sites, a semistructured global view of data is required. The performance of data integration systems depends on an appropriate data model and the appropriate data integration algorithms used. This thesis presents a new algorithm for immediate processing of data collected from remote and autonomous database systems. The algorithm utilizes the idle processing states while the central site waits for completion of data transmission to produce instant partial results. A decomposition strategy included in the algorithm balances of the computations between the central and remote sites to force maximum resource utilization at both sites. The thesis chooses the XML data model for the representation of semistructured data, and presents a new formalization of the XML data model together with a set of algebraic operations. The XML data model is used to provide a virtual global view of semistructured data. The algebraic operators are consistent with operations of relational algebra, such that any existing syntax based query optimization technique developed for the relational model of data can be directly applied. The thesis shows how to optimize online processing by generating one online integration plan for several data increments. Further, the thesis shows how each independent increment expression can be processed in a parallel mode on a multi core processor system. The dynamic scheduling system proposed in the thesis is able to defer or terminate a plan such that materialization updates and unnecessary computations are minimized. The thesis shows that processing data chunks of fragmented XML documents allows for data integration in a shorter period of time. Finally, the thesis provides a clear formalization of the semistructured data model, a set of algorithms with high-level descriptions, and running examples. These formal backgrounds show that the proposed algorithms are implementable

    Compressed materialised views of semi-structured data

    Get PDF
    Query performance issues over semi-structured data have led to the emergence of materialised XML views as a means of restricting the data structure processed by a query. However preserving the conventional representation of such views remains a significant limiting factor especially in the context of mobile devices where processing power, memory usage and bandwidth are significant factors. To explore the concept of a compressed materialised view, we extend our earlier work on structural XML compression to produce a combination of structural summarisation and data compression techniques. These techniques provide a basis for efficiently dealing with both structural queries and valuebased predicates. We evaluate the effectiveness of such a scheme, presenting results and performance measures that show advantages of using such structures

    XML Schema Clustering with Semantic and Hierarchical Similarity Measures

    Get PDF
    With the growing popularity of XML as the data representation language, collections of the XML data are exploded in numbers. The methods are required to manage and discover the useful information from them for the improved document handling. We present a schema clustering process by organising the heterogeneous XML schemas into various groups. The methodology considers not only the linguistic and the context of the elements but also the hierarchical structural similarity. We support our findings with experiments and analysis

    Efficient data representation for XML in peer-based systems

    Get PDF
    Purpose - New directions in the provision of end-user computing experiences mean that the best way to share data between small mobile computing devices needs to be determined. Partitioning large structures so that they can be shared efficiently provides a basis for data-intensive applications on such platforms. The partitioned structure can be compressed using dictionary-based approaches and then directly queried without firstly decompressing the whole structure. Design/methodology/approach - The paper describes an architecture for partitioning XML into structural and dictionary elements and the subsequent manipulation of the dictionary elements to make the best use of available space. Findings - The results indicate that considerable savings are available by removing duplicate dictionaries. The paper also identifies the most effective strategy for defining dictionary scope. Research limitations/implications - This evaluation is based on a range of benchmark XML structures and the approach to minimising dictionary size shows benefit in the majority of these. Where structures are small and regular, the benefits of efficient dictionary representation are lost. The authors' future research now focuses on heuristics for further partitioning of structural elements. Practical implications - Mobile applications that need access to large data collections will benefit from the findings of this research. Traditional client/server architectures are not suited to dealing with high volume demands from a multitude of small mobile devices. Peer data sharing provides a more scalable solution and the experiments that the paper describes demonstrate the most effective way of sharing data in this context. Social implications - Many services are available via smartphone devices but users are wary of exploiting the full potential because of the need to conserve battery power. The approach mitigates this challenge and consequently expands the potential for users to benefit from mobile information systems. This will have impact in areas such as advertising, entertainment and education but will depend on the acceptability of file sharing being extended from the desktop to the mobile environment. Originality/value - The original work characterises the most effective way of sharing large data sets between small mobile devices. This will save battery power on devices such as smartphones, thus providing benefits to users of such devices

    Accelerating data retrieval steps in XML documents

    Get PDF

    State-of-the-art on evolution and reactivity

    Get PDF
    This report starts by, in Chapter 1, outlining aspects of querying and updating resources on the Web and on the Semantic Web, including the development of query and update languages to be carried out within the Rewerse project. From this outline, it becomes clear that several existing research areas and topics are of interest for this work in Rewerse. In the remainder of this report we further present state of the art surveys in a selection of such areas and topics. More precisely: in Chapter 2 we give an overview of logics for reasoning about state change and updates; Chapter 3 is devoted to briefly describing existing update languages for the Web, and also for updating logic programs; in Chapter 4 event-condition-action rules, both in the context of active database systems and in the context of semistructured data, are surveyed; in Chapter 5 we give an overview of some relevant rule-based agents frameworks

    K2/Kleisli and GUS: Experiments in Integrated Access to Genomic Data Sources

    Get PDF
    The integration of heterogeneous data sources and software systems is a major issue in the biomed ical community and several approaches have been explored: linking databases, on-the- fly integration through views, and integration through warehousing. In this paper we report on our experiences with two systems that were developed at the University of Pennsylvania: an integration system called K2, which has primarily been used to provide views over multiple external data sources and software systems; and a data warehouse called GUS which downloads, cleans, integrates and annotates data from multiple external data sources. Although the view and warehouse approaches each have their advantages, there is no clear winner . Therefore, users must consider how the data is to be used, what the performance guarantees must be, and how much programmer time and expertise is available to choose the best strategy for a particular application
    • 

    corecore