18 research outputs found

    XML Matchers: approaches and challenges

    Full text link
    Schema Matching, i.e. the process of discovering semantic correspondences between concepts adopted in different data source schemas, has been a key topic in Database and Artificial Intelligence research areas for many years. In the past, it was largely investigated especially for classical database models (e.g., E/R schemas, relational databases, etc.). However, in the latest years, the widespread adoption of XML in the most disparate application fields pushed a growing number of researchers to design XML-specific Schema Matching approaches, called XML Matchers, aiming at finding semantic matchings between concepts defined in DTDs and XSDs. XML Matchers do not just take well-known techniques originally designed for other data models and apply them on DTDs/XSDs, but they exploit specific XML features (e.g., the hierarchical structure of a DTD/XSD) to improve the performance of the Schema Matching process. The design of XML Matchers is currently a well-established research area. The main goal of this paper is to provide a detailed description and classification of XML Matchers. We first describe to what extent the specificities of DTDs/XSDs impact on the Schema Matching task. Then we introduce a template, called XML Matcher Template, that describes the main components of an XML Matcher, their role and behavior. We illustrate how each of these components has been implemented in some popular XML Matchers. We consider our XML Matcher Template as the baseline for objectively comparing approaches that, at first glance, might appear as unrelated. The introduction of this template can be useful in the design of future XML Matchers. Finally, we analyze commercial tools implementing XML Matchers and introduce two challenging issues strictly related to this topic, namely XML source clustering and uncertainty management in XML Matchers.Comment: 34 pages, 8 tables, 7 figure

    Investigation into Indexing XML Data Techniques

    Get PDF
    The rapid development of XML technology improves the WWW, since the XML data has many advantages and has become a common technology for transferring data cross the internet. Therefore, the objective of this research is to investigate and study the XML indexing techniques in terms of their structures. The main goal of this investigation is to identify the main limitations of these techniques and any other open issues. Furthermore, this research considers most common XML indexing techniques and performs a comparison between them. Subsequently, this work makes an argument to find out these limitations. To conclude, the main problem of all the XML indexing techniques is the trade-off between the size and the efficiency of the indexes. So, all the indexes become large in order to perform well, and none of them is suitable for all users’ requirements. However, each one of these techniques has some advantages in somehow

    Accelerating data retrieval steps in XML documents

    Get PDF

    Clustering-based Labelling Scheme - A Hybrid Approach for Efficient Querying and Updating XML Documents

    Get PDF
    Extensible Markup Language (XML) has become a dominant technology for transferring data through the worldwide web. The XML labelling schemes play a key role in handling XML data efficiently and robustly. Thus, many labelling schemes have been proposed. However, these labelling schemes have limitations and shortcomings. Thus, the aim of this research was to investigate the existing XML labelling schemes and their limitations in order to address the issue of efficiency of XML query performance. This thesis investigated the existing labelling schemes and classified them into three categories based on certain criteria, in order to identify the limitations and challenges of these labelling schemes. Based on the outcomes of this investigation, this thesis proposed a state-of-theart labelling scheme, called clustering-based labelling scheme, to resolve or improve the key limitations such as the efficiency of the XML query processing, labelling XML nodes, and XML updates cost. This thesis argued that using certain existing labelling schemes to label nodes, and using the clustering-based techniques can improve query and labelling nodes efficiency. Theoretically, the proposed scheme is based on dividing the nodes of an XML document into clusters. Two existing labelling schemes, which are the Dewey and LLS labelling schemes, were selected for labelling these clusters and their nodes. Subsequently, the proposed scheme was designed and implemented. In addition, the Dewey and LLS labelling scheme were implemented for the purpose of evaluating the proposed scheme. Subsequently, four experiments were designed in order to test the proposed scheme against the Dewey and LLS labelling schemes. The results of these experiments suggest that the proposed scheme achieved better results than the Dewey and LLS schemes. Consequently, the research hypothesis was accepted overall with few exceptions, and the proposed scheme showed an improvement in the performance and all the targeted features and aspects

    DescribeX: A Framework for Exploring and Querying XML Web Collections

    Full text link
    This thesis introduces DescribeX, a powerful framework that is capable of describing arbitrarily complex XML summaries of web collections, providing support for more efficient evaluation of XPath workloads. DescribeX permits the declarative description of document structure using all axes and language constructs in XPath, and generalizes many of the XML indexing and summarization approaches in the literature. DescribeX supports the construction of heterogeneous summaries where different document elements sharing a common structure can be declaratively defined and refined by means of path regular expressions on axes, or axis path regular expression (AxPREs). DescribeX can significantly help in the understanding of both the structure of complex, heterogeneous XML collections and the behaviour of XPath queries evaluated on them. Experimental results demonstrate the scalability of DescribeX summary refinements and stabilizations (the key enablers for tailoring summaries) with multi-gigabyte web collections. A comparative study suggests that using a DescribeX summary created from a given workload can produce query evaluation times orders of magnitude better than using existing summaries. DescribeX's light-weight approach of combining summaries with a file-at-a-time XPath processor can be a very competitive alternative, in terms of performance, to conventional fully-fledged XML query engines that provide DB-like functionality such as security, transaction processing, and native storage.Comment: PhD thesis, University of Toronto, 2008, 163 page

    A Labeling DOM-Based Tree Walking Algorithm for Mapping XML Documents into Relational Databases

    Get PDF
    XML has emerged as the standard format for representing and exchanging data on the World Wide Web. For practical purposes, it is found to be critical to have efficient mechanisms to store and query XML data, as well as to exploit the full power of this new technology. Several researchers have proposed to use relational databases to store and query XML data. With the understanding the limitations of current approaches, this thesis aims to propose an algorithm for automatic mapping XML documents to RDBMS with XML-API as a database utility. The algorithm uses best fit auto mapping technique, and dynamic shredding, of a specified selected XML document type (datacentric, document-centric, and mixed documents).e. The propose algorithm use DOM(Data Object Model) as a warehouse and stack as a data structure to mapping the XML document into relational database and reconstructing the XML document from the relational database. The experiment study show that the algorithm mapping document and reconstructing it again well. Finally, the algorithm compare with other algorithms the result is good in time and efficiency, also the algorithm complexity is O(11n+2)

    Labelling Dynamic XML Documents: A GroupBased Approach

    Get PDF
    Documents that comply with the XML standard are characterised by inherent ordering and their modelling usually takes the form of a tree. Nowadays, applications generate massive amounts of XML data, which requires accurate and efficient query-able XML database systems. XML querying depends on XML labelling in much the same way as relational databases rely on indexes. Document order and structural information are encoded by labelling schemes, thus facilitating their use by queries without having to access the original XML document. Dynamic XML data, data which changes, complicates the labelling scheme. As demonstrated by much research efforts, it is difficult to allocate unique labels to nodes in a dynamic XML tree so that all structural relationships between the nodes are encoded by the labels. Static XML documents are generally managed with labelling schemes that use simple labels. By contrast, dynamic labelling schemes have extra labelling costs and lower query performance to allow random updates irrespective of the document update frequency. Given that static and dynamic XML documents are often not clearly distinguished, a labelling scheme whose efficiency does not depend on updating frequency would be useful. The GroupBased labelling scheme proposed in this thesis is compatible with static as well as dynamic XML documents. In particular, this scheme has a high performance in processing dynamic XML data updates. What differentiates it from other dynamic labelling schemes is its uniform behaviour irrespective of whether the document is static or dynamic, ability to determine all structural relationships between nodes, and the improved query performance in both types of document. The advantages of the GroupBased scheme in comparison to earlier schemes are highlighted by the experiment results
    corecore