Extracting Web Information using Representation Patterns

Corchuelo Gil, Rafael; Jiménez Aguirre, Patricia; Roldán Salvador, Juan Carlos

Extracting Web Information using Representation Patterns

Authors: Rafael Corchuelo Gil
Patricia Jiménez Aguirre
Juan Carlos Roldán Salvador
Publication date: 1 January 2017
Publisher: 'Association for Computing Machinery (ACM)'
Doi

Abstract

Feeding decision support systems with Web information typically requires sifting through an unwieldy amount of information that is available in human-friendly formats only. Our focus is on a scalable proposal to extract information from semi-structured documents in a structured format, with an emphasis on it being scalable and open. By semi-structured we mean that it must focus on informa tion that is rendered using regular formats, not free text; by scal able, we mean that the system must require a minimum amount of human intervention and it must not be targeted to extracting in formation from a particular domain or web site; by open, we mean that it must extract as much useful information as possible and not be subject to any pre-defined data model. In the literature, there is only one open but not scalable proposal, since it requires human supervision on a per-domain basis. In this paper, we present a new proposal that relies on a number of heuristics to identify patterns that are typically used to represent the information in a web docu ment. Our experimental results confirm that our proposal is very competitive in terms of effectiveness and efficiency.Ministerio de Economía y Competitividad TIN2016-75394-RMinisterio de Economía y Competitividad TIN2013-40848-

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

idUS. Depósito de Investigación Universidad de Sevilla

oai:idus.us.es:11441/131931

Last time updated on 19/05/2022