Feeding decision support systems with Web information typically
requires sifting through an unwieldy amount of information that is
available in human-friendly formats only. Our focus is on a scalable
proposal to extract information from semi-structured documents
in a structured format, with an emphasis on it being scalable and
open. By semi-structured we mean that it must focus on informa tion that is rendered using regular formats, not free text; by scal able, we mean that the system must require a minimum amount of
human intervention and it must not be targeted to extracting in formation from a particular domain or web site; by open, we mean
that it must extract as much useful information as possible and not
be subject to any pre-defined data model. In the literature, there is
only one open but not scalable proposal, since it requires human
supervision on a per-domain basis. In this paper, we present a new
proposal that relies on a number of heuristics to identify patterns
that are typically used to represent the information in a web docu ment. Our experimental results confirm that our proposal is very
competitive in terms of effectiveness and efficiency.Ministerio de Economía y Competitividad TIN2016-75394-RMinisterio de Economía y Competitividad TIN2013-40848-