50 research outputs found

    An efficient and scalable algorithm for clustering XML documents by structure

    Full text link

    An Optimistic Approach for Clustering Multi-version XML Documents Using Compressed Delta

    Get PDF
    Today with Standardization of XML as an information exchange over web, huge amount of information is formatted in the XML document. XML documents are huge in size. The amount of information that has to be transmitted, processed, stored, and queried is often larger than that of other data formats. Also in real world applications XML documents are dynamic in nature. The versatile applicability of XML documents in different fields of information maintenance and management is increasing the demand to store different versions of XML documents with time. However, storage of all versions of an XML document may introduce the redundancy. Self describing nature of XML creates the problem of verbosity,in result documents are in huge size. This paper proposes optimistic approach to Re-cluster multi-version XML documents which change in time by reassessing distance between them by using knowledge from initial clustering solution and changes stored in compressed delta. Evolving size of XML document is reduced by applying homomorphic compression before clustering them which retains its original structure. Compressed delta stores the changes responsible for document versions, without decompressing them. Test results shows that our approach performs much better than using full pair-wise document comparison

    A Progressive Clustering Algorithm to Group the XML Data by Structural and Semantic Similarity

    Get PDF
    Since the emergence in the popularity of XML for data representation and exchange over the Web, the distribution of XML documents has rapidly increased. It has become a challenge for researchers to turn these documents into a more useful information utility. In this paper, we introduce a novel clustering algorithm PCXSS that keeps the heterogeneous XML documents into various groups according to their similar structural and semantic representations. We develop a global criterion function CPSim that progressively measures the similarity between a XML document and existing clusters, ignoring the need to compute the similarity between two individual documents. The experimental analysis shows the method to be fast and accurate

    Measuring the similarity of PML documents with RFID-based sensors

    Get PDF
    The Electronic Product Code (EPC) Network is an important part of the Internet of Things. The Physical Mark-Up Language (PML) is to represent and de-scribe data related to objects in EPC Network. The PML documents of each component to exchange data in EPC Network system are XML documents based on PML Core schema. For managing theses huge amount of PML documents of tags captured by Radio frequency identification (RFID) readers, it is inevitable to develop the high-performance technol-ogy, such as filtering and integrating these tag data. So in this paper, we propose an approach for meas-uring the similarity of PML documents based on Bayesian Network of several sensors. With respect to the features of PML, while measuring the similarity, we firstly reduce the redundancy data except information of EPC. On the basis of this, the Bayesian Network model derived from the structure of the PML documents being compared is constructed.Comment: International Journal of Ad Hoc and Ubiquitous Computin

    Mining XML Documents

    Get PDF
    XML documents are becoming ubiquitous because of their rich and flexible format that can be used for a variety of applications. Giving the increasing size of XML collections as information sources, mining techniques that traditionally exist for text collections or databases need to be adapted and new methods to be invented to exploit the particular structure of XML documents. Basically XML documents can be seen as trees, which are well known to be complex structures. This chapter describes various ways of using and simplifying this tree structure to model documents and support efficient mining algorithms. We focus on three mining tasks: classification and clustering which are standard for text collections; discovering of frequent tree structure which is especially important for heterogeneous collection. This chapter presents some recent approaches and algorithms to support these tasks together with experimental evaluation on a variety of large XML collections

    XML Matchers: approaches and challenges

    Full text link
    Schema Matching, i.e. the process of discovering semantic correspondences between concepts adopted in different data source schemas, has been a key topic in Database and Artificial Intelligence research areas for many years. In the past, it was largely investigated especially for classical database models (e.g., E/R schemas, relational databases, etc.). However, in the latest years, the widespread adoption of XML in the most disparate application fields pushed a growing number of researchers to design XML-specific Schema Matching approaches, called XML Matchers, aiming at finding semantic matchings between concepts defined in DTDs and XSDs. XML Matchers do not just take well-known techniques originally designed for other data models and apply them on DTDs/XSDs, but they exploit specific XML features (e.g., the hierarchical structure of a DTD/XSD) to improve the performance of the Schema Matching process. The design of XML Matchers is currently a well-established research area. The main goal of this paper is to provide a detailed description and classification of XML Matchers. We first describe to what extent the specificities of DTDs/XSDs impact on the Schema Matching task. Then we introduce a template, called XML Matcher Template, that describes the main components of an XML Matcher, their role and behavior. We illustrate how each of these components has been implemented in some popular XML Matchers. We consider our XML Matcher Template as the baseline for objectively comparing approaches that, at first glance, might appear as unrelated. The introduction of this template can be useful in the design of future XML Matchers. Finally, we analyze commercial tools implementing XML Matchers and introduce two challenging issues strictly related to this topic, namely XML source clustering and uncertainty management in XML Matchers.Comment: 34 pages, 8 tables, 7 figure

    Ant colony optimization based clustering for data partitioning.

    Get PDF
    Woo Kwan Ho.Thesis (M.Phil.)--Chinese University of Hong Kong, 2005.Includes bibliographical references (leaves 148-155).Abstracts in English and Chinese.Contents --- p.iiAbstract --- p.ivAcknowledgements --- p.viiList of Figures --- p.viiiList of Tables --- p.xChapter Chapter 1 --- Introduction --- p.1Chapter Chapter 2 --- Literature Reviews --- p.7Chapter 2.1 --- Block Clustering --- p.7Chapter 2.2 --- Clustering XML by structure --- p.10Chapter 2.2.1 --- Definition of XML schematic information --- p.10Chapter 2.2.2 --- Identification of XML schematic information --- p.12Chapter Chapter 3 --- Bi-Tour Ant Colony Optimization for diagonal clustering --- p.15Chapter 3.1 --- Motivation --- p.15Chapter 3.2 --- Framework of Bi-Tour Ant Colony Algorithm --- p.21Chapter 3.3 --- Re-order of the data matrix in BTACO clustering method --- p.27Chapter 3.3.1 --- Review of Ant Colony Optimization --- p.29Chapter 3.3.2 --- Bi-Tour Ant Colony Optimization --- p.36Chapter 3.4 --- Determination of partitioning scheme --- p.44Chapter 3.4.1 --- Weighed Sum of Error (WSE) --- p.48Chapter 3.4.2 --- Materialization of partitioning scheme via hypothetic matrix --- p.50Chapter 3.4.3 --- Search of best-fit hypothetic matrix --- p.52Chapter 3.4.4 --- Dynamic programming approach --- p.53Chapter 3.4.5 --- Heuristic partitioning approach --- p.57Chapter 3.5 --- Experimental Study --- p.62Chapter 3.5.1 --- Data set --- p.63Chapter 3.5.2 --- Study on DP Approach and HP Approach --- p.65Chapter 3.5.3 --- Study on parameter settings --- p.69Chapter 3.5.4 --- Comparison with GA-based & hierarchical clustering methods --- p.81Chapter 3.6 --- Chapter conclusion --- p.90Chapter Chapter 4 --- Application of BTACO-based clustering in XML database system --- p.93Chapter 4.1 --- Introduction --- p.93Chapter 4.2 --- Overview of normalization and vertical partitioning in relational DB design --- p.95Chapter 4.2.1 --- Normalization of relational models in database design --- p.95Chapter 4.2.2 --- Vertical partitioning in database design --- p.98Chapter 4.3 --- Clustering XML documents --- p.100Chapter 4.4 --- Proposed approach using BTACO-based clustering --- p.103Chapter 4.4.1 --- Clustering XML documents by structure --- p.103Chapter 4.4.2 --- Clustering XML documents by user transaction patterns --- p.109Chapter 4.4.3 --- Implementation of Query Manager for our experimental study --- p.114Chapter 4.5 --- Experimental Study --- p.118Chapter 4.5.1 --- Experimental Study on the clustering by structure --- p.118Chapter 4.5.2 --- Experimental Study on the clustering by user access patterns --- p.133Chapter 4.6 --- Chapter conclusion --- p.141Chapter Chapter 5 --- Conclusions --- p.143Chapter 5.1 --- Contributions --- p.144Chapter 5.2 --- Future works --- p.146Bibliography --- p.148Appendix I --- p.156Appendix II --- p.168Index tables for Profile A --- p.168Index tables for Profile B --- p.171Appendix III --- p.17
    corecore