3 research outputs found

    Exploiting Index Pruning Methods for Clustering XML Collections

    Get PDF
    In this paper, we first employ the well known Cover-Coefficient Based Clustering Methodology (C3 M) for clustering XML documents. Next, we apply index pruning techniques from the literature to reduce the size of the document vectors. Our experiments show that for certain cases, it is possible to prune up to 70% of the collection (or, more specifically, underlying document vectors) and still generate a clustering structure that yields the same quality with that of the original collection, in terms of a set of evaluation metrics

    Exploiting index pruning methods for clustering XML collections

    Get PDF
    In this paper, we first employ the well known Cover-Coefficient Based Clustering Methodology (C3M) for clustering XML documents. Next, we apply index pruning techniques from the literature to reduce the size of the document vectors. Our experiments show that for certain cases, it is possible to prune up to 70% of the collection (or, more specifically, underlying document vectors) and still generate a clustering structure that yields the same quality with that of the original collection, in terms of a set of evaluation metrics. © 2010 Springer-Verlag Berlin Heidelberg

    Utilizing the structure and content information for XML document clustering

    No full text
    This paper reports on the experiments and results of a clustering approach used in the INEX 2008 document mining challenge. The clustering approach utilizes both the structure and content information of the Wikipedia XML document collection. A latent semantic kernel (LSK) is used to measure the semantic similarity between XML documents based on their content features. The construction of a latent semantic kernel involves the computing of singular vector decomposition (SVD). On a large feature space matrix, the computation of SVD is very expensive in terms of time and memory requirements. Thus in this clustering approach, the dimension of the document space of a term-document matrix is reduced before performing SVD. The document space reduction is based on the common structural information of the Wikipedia XML document collection. The proposed clustering approach has shown to be effective on the Wikipedia collection in the INEX 2008 document mining challenge.</p
    corecore