Search CORE

28 research outputs found

An Unsupervised Technique to Extract Information from Semi-structured Web Pages

Author: Corchuelo Gil Rafael
Sleiman Hassan A.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/11/2012
Field of study

We propose a technique that takes two or more web pages generated by the same server-side template and tries to learn a regular expression that represents it and helps extract relevant information from similar pages. Our experimental results on real-world web sites demonstrate that our technique outperforms others in terms of both effectiveness and efficiency and is not affected by HTML errors.Ministerio de Ciencia y Tecnología TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08-TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-EMinisterio de Ciencia e Innovación TIN2010- 21744Ministerio de Economía, Industria y Competitividad TIN2010-09809-EMinisterio de Ciencia e Innovación TIN2010-10811-EMinisterio de Ciencia e Innovación TIN2010-09988-EMinisterio de Economía y Competitividad TIN2011-15497-

idUS. Depósito de Investigación Universidad de Sevilla

A Survey on Region Extractors from Web Documents

Author: Corchuelo Gil Rafael
Sleiman Hassan A.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/09/2013
Field of study

Extracting information from web documents has become a research area in which new proposals sprout out year after year. This has motivated several researchers to work on surveys that attempt to provide an overall picture of the many existing proposals. Unfortunately, none of these surveys provide a complete picture, because they do not take region extractors into account. These tools are kind of preprocessors, because they help information extractors focus on the regions of a web document that contain relevant information. With the increasing complexity of web documents, region extractors are becoming a must to extract information from many websites. Beyond information extraction, region extractors have also found their way into information retrieval, focused web crawling, topic distillation, adaptive content delivery, mashups, and metasearch engines. In this paper, we survey the existing proposals regarding region extractors and compare them side by side.Ministerio de Educación y Ciencia TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08- TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-EMinisterio de Ciencia e Innovación TIN2010-21744Ministerio de Economía, Industria y Competitividad TIN2010-09809-EMinisterio de Ciencia e Innovación TIN2010-10811-EMinisterio de Ciencia e Innovación TIN2010-09988-

idUS. Depósito de Investigación Universidad de Sevilla

Information Extraction Framework

Author: Corchuelo Gil Rafael
Sleiman Hassan A.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/03/2012
Field of study

The literature provides many techniques to infer rules that can be used to configure web information extractors. Unfortunately, these techniques have been developed independently, which makes it very difficult to compare the results: there is not even a collection of datasets on which these techniques can be assessed. Furthermore, there is not a common infrastructure to implement these techniques, which makes implementing them costly. In this paper, we propose a framework that helps software engineers implement their techniques and compare the results. Having such a framework allows comparing techniques side by side and our experiments prove that it helps reduce development costs.Ministerio de Ciencia e Innovación TIN2010-21744-C02-01Ministerio de Educación y Ciencia TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08-TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-EMinisterio de Ciencia e Innovación TIN2010-09988-

idUS. Depósito de Investigación Universidad de Sevilla

A Reference Architecture to Devise Web Information Extractors

Author: Corchuelo Gil Rafael
Sleiman Hassan A.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/06/2012
Field of study

The Web is the largest repository of human-friendly information. Unfortunately, web information is embedded in formatting tags and is surrounded by irrelevant information. Researchers are working on information extractors that allow transforming this information into structured data for its later integration into automated processes. Devising a new information extraction technique requires an array of tasks that are specific to this technique and many tasks that are actually common between all techniques. The lack of a reference architectural proposal in the literature to guide software engineers in the design and implementation of information extractors, amounts to little reuse and the focus is usually blurred because of irrelevant details. In this paper, we present a reference architecture to design and implement rule learners for information extractors. We have implemented a software framework to support our architecture, and we have validated it by means of four case studies and a number of experiments that prove that our proposal helps reduce development costs significantly.Ministerio de Educación y Ciencia TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08-TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-EMinisterio de Ciencia e Innovación TIN2010-21744Ministerio de Economía, Industria y Competitividad TIN2010-09809-EMinisterio de Ciencia e Innovación TIN2010-10811-EMinisterio de Ciencia e Innovación TIN2010-09988-

idUS. Depósito de Investigación Universidad de Sevilla

Trinity: On Using Trinary Trees for Unsupervised Web Data Extraction

Author: Corchuelo Gil Rafael
Sleiman Hassan A.
Publication venue: IEEE Xplore
Publication date: 01/06/2014
Field of study

Web data extractors are used to extract data from web documents in order to feed automated processes. In this article, we propose a technique that works on two or more web documents generated by the same server-side template and learns a regular expression that models it and can later be used to extract data from similar documents. The technique builds on the hypothesis that the template introduces some shared patterns that do not provide any relevant data and can thus be ignored. We have evaluated and compared our technique to others in the literature on a large collection of web documents; our results demonstrate that our proposal performs better than the others and that input errors do not have a negative impact on its effectiveness; furthermore, its efficiency can be easily boosted by means of a couple of parameters, without sacrificing its effectiveness.Ministerio de Ciencia y Tecnología TIN2007-64119Junta de Andalucía P07- TIC-2602Junta de Andalucía P08-TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-EMinisterio de Ciencia e Innovación TIN2010- 21744Ministerio de Economía, Industria y Competitividad TIN2010-09809-EMinisterio de Ciencia e Innovación TIN2010-10811-EMinisterio de Ciencia e Innovación TIN2010-09988-EMinisterio de Economía y Competitividad TIN2011-15497-

idUS. Depósito de Investigación Universidad de Sevilla

ARIEX: Automated ranking of information extractors

Author: Corchuelo Gil Rafael
Jiménez Aguirre Patricia
Sleiman Hassan A.
Publication venue: 'Elsevier BV'
Publication date: 01/01/2016
Field of study

Information extractors are used to transform the user-friendly information in a web document into structured information that can be used to feed a knowledge-based system. Researchers are interested in ranking them to find out which one performs the best. Unfortunately, many rankings in the literature are deficient. There are a number of formal methods to rank information extractors, but they also have many problems and have not reached widespread popularity. In this article, we present ARIEX, which is an automated method to rank web information extraction proposals. It does not have any of the problems that we have identified in the literature. Our proposal shall definitely help authors make sure that they have advanced the state of the art not only conceptually, but from an empirical point of view; it shall also help practitioners make informed decisions on which proposal is the most adequate for a particular problem.Ministerio de Educación y Ciencia TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08-TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-EMinisterio de Ciencia e Innovación TIN2010-21744Ministerio de Economía, Industria y Competitividad TIN2010-09809-EMinisterio de Ciencia e Innovación TIN2010-10811-EMinisterio de Ciencia e Innovación TIN2010-09988-EMinisterio de Economía y Competitividad TIN2011-15497-EMinisterio de Economía y Competitividad TIN2013-40848-

idUS. Depósito de Investigación Universidad de Sevilla

A Transducer Model for Web Information Extraction

Author: Corchuelo Gil Rafael
Fernández Gretel
Hernández Salmerón Inmaculada Concepción
Sleiman Hassan A.
Publication venue: CSREA Press
Publication date: 01/01/2011
Field of study

In recent years, many authors have paid attention to web information extractors. They usually build on an algorithm that interprets extraction rules that are inferred from examples. Several rule learning techniques are based on transducers, but none of them proposed a transducer generic model for web in formation extraction. In this paper, we propose a new transducer model that is specifically tailored to web information extraction. The model has proven quite flexible since we have adapted three techniques in the literature to infer state transitions, and the results prove that it can achieve high precision and recall ratesMinisterio de Educación y Ciencia TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08-TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-EMinisterio de Ciencia e Innovación TIN2010-21744Ministerio de Economía, Industria y Competitividad TIN2010-09809-EMinisterio de Ciencia e Innovación TIN2010-10811-EMinisterio de Ciencia e Innovación TIN2010-09988-

idUS. Depósito de Investigación Universidad de Sevilla

A Conceptual Framework for Efficient Web Crawling in Virtual Integration Contexts

Author: Corchuelo Gil Rafael
Hernández Salmerón Inmaculada Concepción
Ruiz Cortés David
Sleiman Hassan A.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

Virtual Integration systems require a crawling tool able to navigate and reach relevant pages in the Web in an efficient way. Existing proposals in the crawling area are aware of the efficiency problem, but still most of them need to download pages in order to classify them as relevant or not. In this paper, we present a conceptual framework for designing crawlers supported by a web page classifier that relies solely on URLs to determine page relevance. Such a crawler is able to choose in each step only the URLs that lead to relevant pages, and therefore reduces the number of unnecessary pages downloaded, optimising bandwidth and making it efficient and suitable for virtual integration systems. Our preliminary experiments show that such a classifier is able to distinguish between links leading to different kinds of pages, without previous intervention from the user.Ministerio de Educación y Ciencia TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08- TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-EMinisterio de Ciencia e Innovación TIN2010-21744Ministerio de Economía, Industria y Competitividad TIN2010-09809-EMinisterio de Ciencia e Innovación TIN2010-10811-EMinisterio de Ciencia e Innovación TIN2010-09988-

idUS. Depósito de Investigación Universidad de Sevilla

A Tool for Web Links Prototyping

Author: Corchuelo Gil Rafael
Hernández Salmerón Inmaculada Concepción
Ruiz Cortés David
Sleiman Hassan A.
Publication venue: CSREA Press
Publication date: 01/01/2011
Field of study

Crawlers for Virtual Integration processes must be efficient, given that VI process is online, which means that while the system is looking for the required information, the user is waiting for a response. Therefore, downloading a minimum number of irrelevant pages is mandatory in order to improve the crawler efficiency. Most crawlers need to download a page in order the determine its relevance, which results in a high number of irrelevant pages downloaded. We propose a tool that builds a set of prototype links for a given site, where each prototype represents links leading to pages containing a certain concept. These prototypes can then be used to classify pages before downloading them, just by analysing their URL. Therefore, they are the support for crawlers to navigate through sites downloading a minimum number of irrelevant pages while reducing bandwidth, making them suitable for VI systems.Ministerio de Educación y Ciencia TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08-TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-EMinisterio de Ciencia e Innovación TIN2010-21744Ministerio de Ciencia e Innovación TIN2010-09809-EMinisterio de Ciencia e Innovación TIN2010-10811-EMinisterio de Ciencia e Innovación TIN2010-09988-

idUS. Depósito de Investigación Universidad de Sevilla

An Assisted Workflow for the Early Design of Nearly Zero Emission Healthcare Buildings

Author: Bruinenberg Sander
Hempel Steffen
Sleiman Hassan A.
Traversari Roberto
Publication venue: MDPI
Publication date: 01/07/2017
Field of study

Energy efficiency in buildings is one of the main goals of many governmental policies due to their high impact on the carbon dioxide emissions in Europe. One of these targets is to reduce the energy consumption in healthcare buildings, which are known to be among the most energy-demanding building types. Although design decisions made at early design phases have a significant impact on the energy performance of the realized buildings, only a small portion of possible early designs is analyzed, which does not ensure an optimal building design. We propose an automated early design support workflow, accompanied by a set of tools, for achieving nearly zero emission healthcare buildings. It is intended to be used by decision makers during the early design phase. It starts with the user-defined brief and the design rules, which are the input for the Early Design Configurator (EDC). The EDC generates multiple design alternatives following an evolutionary algorithm while trying to satisfy user requirements and geometric constraints. The generated alternatives are then validated by means of an Early Design Validator (EDV), and then, early energy and cost assessments are made using two early assessment tools. A user-friendly dashboard is used to guide the user and to illustrate the workflow results, whereas the chosen alternative at the end of the workflow is considered as the starting point for the next design phases. Our proposal has been implemented using Building Information Models (BIM) and validated by means of a case study on a healthcare building and several real demonstrations from different countries in the context of the European project STREAMER. View Full-Tex

Multidisciplinary Digital Publishing Institute

Crossref

KITopen

Directory of Open Access Journals

HAL-CEA