Search CORE

3 research outputs found

Exploiting Index Pruning Methods for Clustering XML Collections

Author: C.M. Vries De
F. Can
F. Can
N. Jardine
S. Kutty
S. Zhang
T. Tran
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

In this paper, we first employ the well known Cover-Coefficient Based Clustering Methodology (C3 M) for clustering XML documents. Next, we apply index pruning techniques from the literature to reduce the size of the document vectors. Our experiments show that for certain cases, it is possible to prune up to 70% of the collection (or, more specifically, underlying document vectors) and still generate a clustering structure that yields the same quality with that of the original collection, in terms of a set of evaluation metrics

Crossref

OpenMETU (Middle East Technical University)

Exploiting index pruning methods for clustering XML collections

Author: Altingovde I.S.
Atilgan D.
Ulusoy Ö.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

In this paper, we first employ the well known Cover-Coefficient Based Clustering Methodology (C3M) for clustering XML documents. Next, we apply index pruning techniques from the literature to reduce the size of the document vectors. Our experiments show that for certain cases, it is possible to prune up to 70% of the collection (or, more specifically, underlying document vectors) and still generate a clustering structure that yields the same quality with that of the original collection, in terms of a set of evaluation metrics. © 2010 Springer-Verlag Berlin Heidelberg

Bilkent University Institutional Repository

Utilizing the structure and content information for XML document clustering

Author: Kutty Sangeetha
Nayak Richi
Tran Tien
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

This paper reports on the experiments and results of a clustering approach used in the INEX 2008 document mining challenge. The clustering approach utilizes both the structure and content information of the Wikipedia XML document collection. A latent semantic kernel (LSK) is used to measure the semantic similarity between XML documents based on their content features. The construction of a latent semantic kernel involves the computing of singular vector decomposition (SVD). On a large feature space matrix, the computation of SVD is very expensive in terms of time and memory requirements. Thus in this clustering approach, the dimension of the document space of a term-document matrix is reduced before performing SVD. The document space reduction is based on the common structural information of the Wikipedia XML document collection. The proposed clustering approach has shown to be effective on the Wikipedia collection in the INEX 2008 document mining challenge.</p

Queensland University of Technology ePrints Archive