Extracting Web Information using Representation Patterns

Abstract

Feeding decision support systems with Web information typically requires sifting through an unwieldy amount of information that is available in human-friendly formats only. Our focus is on a scalable proposal to extract information from semi-structured documents in a structured format, with an emphasis on it being scalable and open. By semi-structured we mean that it must focus on informa tion that is rendered using regular formats, not free text; by scal able, we mean that the system must require a minimum amount of human intervention and it must not be targeted to extracting in formation from a particular domain or web site; by open, we mean that it must extract as much useful information as possible and not be subject to any pre-defined data model. In the literature, there is only one open but not scalable proposal, since it requires human supervision on a per-domain basis. In this paper, we present a new proposal that relies on a number of heuristics to identify patterns that are typically used to represent the information in a web docu ment. Our experimental results confirm that our proposal is very competitive in terms of effectiveness and efficiency.Ministerio de Economía y Competitividad TIN2016-75394-RMinisterio de Economía y Competitividad TIN2013-40848-

    Similar works