Search CORE

3 research outputs found

On the Midpoint of a Set of XML Documents.

Author: Abellò Alberto
de Palol Xavier
Hacid Mohand-Said
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/08/2005
Field of study

International audienceThe WWW contains a huge amount of documents. Some of them share the subject, but are generated by different people or even organizations. To guarantee the interchange of such documents, we can use XML, which allows to share documents that do not have the same structure. However, it makes difficult to understand the core of such heterogeneous documents (in general, schema is not available). In this paper, we offer a characterization and algorithm to obtain the midpoint (in terms of a resemblance function) of a set of semi-structured, heterogeneous documents without optional elements. The trivial case of midpoint would be the common elements to all documents. Nevertheless, in cases with several heterogeneous documents this may result in an empty set. Thus, we consider that those elements present in a given amount of documents belong to the midpoint. A exact schema could always be found generating optional elements. However, the exact schema of the whole set may result in overspecialization (lots of optional elements), which would make it useless

HAL

Hal-Diderot

On the Midpoint of a Set of XML Documents.

Author: Abellò Alberto
de Palol Xavier
Hacid Mohand-Said
Publication venue: Springer
Publication date: 01/08/2005
Field of study

HAL

On the Midpoint of a Set of XML Documents

Author: Alberto Abelló
Bernard Lyon
Liris- Ufr D’informatique
Saïd Hacid
U. Claude
Xavier De Palol
Publication venue
Publication date: 01/01/2005
Field of study

Abstract. The WWW contains a huge amount of documents. Some of them share the subject, but are generated by different people or even organizations. To guarantee the interchange of such documents, we can use XML, which allows to share documents that do not have the same structure. However, it makes difficult to understand the core of such heterogeneous documents (in general, schema is not available). In this paper, we offer a characterization and algorithm to obtain the midpoint (in terms of a resemblance function) of a set of semi-structured, heterogeneous documents without optional elements. The trivial case of midpoint would be the common elements to all documents. Nevertheless, in cases with several heterogeneous documents this may result in an empty set. Thus, we consider that those elements present in a given amount of documents belong to the midpoint. A exact schema could always be found generating optional elements. However, the exact schema of the whole set may result in overspecialization (lots of optional elements), which would make it useless.

CiteSeerX