Search CORE

3 research outputs found

XML documents clustering using a tensor space model

Author: Kutty Sangeetha
Li Yuefeng
Nayak Richi
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

The traditional Vector Space Model (VSM) is not able to represent both the structure and the content of XML documents. This paper introduces a novel method of representing XML documents in a Tensor Space Model (TSM) and then utilizing it for clustering. Empirical analysis shows that the proposed method is scalable for large-sized datasets; as well, the factorized matrices produced from the proposed method help to improve the quality of clusters through the enriched document representation of both structure and content information

CiteSeerX

Queensland University of Technology ePrints Archive

Mining XML Documents

Author: Candillier Laurent
Denoyer Ludovic
Gallinari Patrick
Rousset Marie-Christine
Termier Alexandre
Vercoustre Anne-Marie
Publication venue: 'IGI Global'
Publication date: 01/01/2007
Field of study

XML documents are becoming ubiquitous because of their rich and flexible format that can be used for a variety of applications. Giving the increasing size of XML collections as information sources, mining techniques that traditionally exist for text collections or databases need to be adapted and new methods to be invented to exploit the particular structure of XML documents. Basically XML documents can be seen as trees, which are well known to be complex structures. This chapter describes various ways of using and simplifying this tree structure to model documents and support efficient mining algorithms. We focus on three mining tasks: classification and clustering which are standard for text collections; discovering of frequent tree structure which is especially important for heterogeneous collection. This chapter presents some recent approaches and algorithms to support these tasks together with experimental evaluation on a variety of large XML collections

HAL - Lille 3

INRIA a CCSD electronic archive server

A Flexible Structured-Based Representation for XML Document Mining

Author
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2006
Field of study

Crossref