Search CORE

3 research outputs found

Approximating the schema of a set of documents by means of resemblance

Author: Abelló Gamazo Alberto
Hacid Mohand-Saïd
Palol Xavier de
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2018
Field of study

The WWW contains a huge amount of documents. Some of them share the same subject, but are generated by different people or even by different organizations. A semi-structured model allows to share documents that do not have exactly the same structure. However, it does not facilitate the understanding of such heterogeneous documents. In this paper, we offer a characterization and algorithm to obtain a representative (in terms of a resemblance function) of a set of heterogeneous semi-structured documents. We approximate the representative so that the resemblance function is maximized. Then, the algorithm is generalized to deal with repetitions and different classes of documents. Although an exact representative could always be found using an unlimited number of optional elements, it would cause an overfitting problem. The size of an exact representative for a set of heterogeneous documents may even make it useless. Our experiments show that, for users, it is easier and faster to deal with smaller representatives, even compensating the loss in the approximation.Peer ReviewedPostprint (author's final draft

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

HAL

Hal-Diderot

Approximating the schema of a set of documents by means of resemblance

Author: Abelló Gamazo Alberto
Hacid Mohand-Saïd
Palol Xavier de
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date
Field of study

RECERCAT

Approximating the Schema of a Set of Documents by Means of Resemblance

Author: AV Leonov
E Bertino
E Gallinucci
G Guerrini
GJ Bex
J Albert
J Widom
J-K Min
L Wang
M Garofalakis
R Nayak
S Abiteboul
T Dalamagas
V Batagelj
W Lian
Z Zhang
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref