Search CORE

4 research outputs found

Obsah

Author
Publication venue: Západočeská univerzita v Plzni
Publication date: 01/01/2017
Field of study

On the synthesis of metadata tags for HTML files

Author: Corchuelo Gil Rafael
Gallego Fernando O.
Jiménez Aguirre Patricia
Roldán Salvador Juan Carlos
Publication venue: 'Wiley'
Publication date: 01/01/2020
Field of study

RDFa, JSON-LD, Microdata, and Microformats allow to endow the data in HTML files with metadata tags that help software agents understand them. Unluckily, there are many HTML files that do not have any metadata tags, which has motivated many authors to work on proposals to synthesize them. But they have some problems: the authors either provide an overall picture of their designs without too many details on the techniques behind the scenes or focus on the techniques but do not describe the design of the software systems that support them; many of them cannot deal with data that are encoded using semistructured formats like forms, listings, or tables; and the few proposals that can work on tables can deal with horizontal listings only. In this article, we describe the design of a system that overcomes the previous limitations using a novel embedding approach that has proven to outperform four state-of-the-art techniques on a repository with randomly selected HTML files from 40 differ ent sites. According to our experimental analysis, our proposal can achieve an F1 score that outperforms the others by 10.14%; this difference was confirmed to be statistically significant at the standard confidence level.Junta de Andalucía P18-RT-1060Ministerio de Economía y Competitividad TIN2013-40848-RMinisterio de Economía y Competitividad TIN2016-75394-

Crossref

idUS. Depósito de Investigación Universidad de Sevilla

Information extraction from the web by matching visual presentation patterns

Author: Burget Radek
Minárik Matej
Publication venue: Západočeská univerzita v Plzni
Publication date: 01/01/2017
Field of study

There is a large amount of data available on the Web. Data are often represented as text, enriched with tables, lists, images or other visual structures. These data are usually coded in HTML without any additional semantics, which makes them nigh impossible to automatically process and extract. There are ap-proaches based on top-down document segmentation according to visual infor-mation and layout. We present a bottom-up approach which starts with the smallest consistent elements and matches the visual relationships among these elements to a pre-defined ontological structure of extracted records. This meth-od considers not only the visual attributes of a particular segment, but also its position amongst other segments

University of West Bohemia Digital Library

DSpace at University of West Bohemia

Information Extraction from the Web by Matching Visual Presentation Patterns

Author: AD Iorio
D Weng
D Weng
DW Embley
JL Hong
M Kolchin
M Milicka
N Anderson
R Burget
W Su
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref