28 research outputs found

    An Unsupervised Technique to Extract Information from Semi-structured Web Pages

    Get PDF
    We propose a technique that takes two or more web pages generated by the same server-side template and tries to learn a regular expression that represents it and helps extract relevant information from similar pages. Our experimental results on real-world web sites demonstrate that our technique outperforms others in terms of both effectiveness and efficiency and is not affected by HTML errors.Ministerio de Ciencia y Tecnolog铆a TIN2007-64119Junta de Andaluc铆a P07-TIC-2602Junta de Andaluc铆a P08-TIC-4100Ministerio de Ciencia e Innovaci贸n TIN2008-04718-EMinisterio de Ciencia e Innovaci贸n TIN2010- 21744Ministerio de Econom铆a, Industria y Competitividad TIN2010-09809-EMinisterio de Ciencia e Innovaci贸n TIN2010-10811-EMinisterio de Ciencia e Innovaci贸n TIN2010-09988-EMinisterio de Econom铆a y Competitividad TIN2011-15497-

    A Survey on Region Extractors from Web Documents

    Get PDF
    Extracting information from web documents has become a research area in which new proposals sprout out year after year. This has motivated several researchers to work on surveys that attempt to provide an overall picture of the many existing proposals. Unfortunately, none of these surveys provide a complete picture, because they do not take region extractors into account. These tools are kind of preprocessors, because they help information extractors focus on the regions of a web document that contain relevant information. With the increasing complexity of web documents, region extractors are becoming a must to extract information from many websites. Beyond information extraction, region extractors have also found their way into information retrieval, focused web crawling, topic distillation, adaptive content delivery, mashups, and metasearch engines. In this paper, we survey the existing proposals regarding region extractors and compare them side by side.Ministerio de Educaci贸n y Ciencia TIN2007-64119Junta de Andaluc铆a P07-TIC-2602Junta de Andaluc铆a P08- TIC-4100Ministerio de Ciencia e Innovaci贸n TIN2008-04718-EMinisterio de Ciencia e Innovaci贸n TIN2010-21744Ministerio de Econom铆a, Industria y Competitividad TIN2010-09809-EMinisterio de Ciencia e Innovaci贸n TIN2010-10811-EMinisterio de Ciencia e Innovaci贸n TIN2010-09988-

    Information Extraction Framework

    Get PDF
    The literature provides many techniques to infer rules that can be used to configure web information extractors. Unfortunately, these techniques have been developed independently, which makes it very difficult to compare the results: there is not even a collection of datasets on which these techniques can be assessed. Furthermore, there is not a common infrastructure to implement these techniques, which makes implementing them costly. In this paper, we propose a framework that helps software engineers implement their techniques and compare the results. Having such a framework allows comparing techniques side by side and our experiments prove that it helps reduce development costs.Ministerio de Ciencia e Innovaci贸n TIN2010-21744-C02-01Ministerio de Educaci贸n y Ciencia TIN2007-64119Junta de Andaluc铆a P07-TIC-2602Junta de Andaluc铆a P08-TIC-4100Ministerio de Ciencia e Innovaci贸n TIN2008-04718-EMinisterio de Ciencia e Innovaci贸n TIN2010-09988-

    A Reference Architecture to Devise Web Information Extractors

    Get PDF
    The Web is the largest repository of human-friendly information. Unfortunately, web information is embedded in formatting tags and is surrounded by irrelevant information. Researchers are working on information extractors that allow transforming this information into structured data for its later integration into automated processes. Devising a new information extraction technique requires an array of tasks that are specific to this technique and many tasks that are actually common between all techniques. The lack of a reference architectural proposal in the literature to guide software engineers in the design and implementation of information extractors, amounts to little reuse and the focus is usually blurred because of irrelevant details. In this paper, we present a reference architecture to design and implement rule learners for information extractors. We have implemented a software framework to support our architecture, and we have validated it by means of four case studies and a number of experiments that prove that our proposal helps reduce development costs significantly.Ministerio de Educaci贸n y Ciencia TIN2007-64119Junta de Andaluc铆a P07-TIC-2602Junta de Andaluc铆a P08-TIC-4100Ministerio de Ciencia e Innovaci贸n TIN2008-04718-EMinisterio de Ciencia e Innovaci贸n TIN2010-21744Ministerio de Econom铆a, Industria y Competitividad TIN2010-09809-EMinisterio de Ciencia e Innovaci贸n TIN2010-10811-EMinisterio de Ciencia e Innovaci贸n TIN2010-09988-

    Trinity: On Using Trinary Trees for Unsupervised Web Data Extraction

    Get PDF
    Web data extractors are used to extract data from web documents in order to feed automated processes. In this article, we propose a technique that works on two or more web documents generated by the same server-side template and learns a regular expression that models it and can later be used to extract data from similar documents. The technique builds on the hypothesis that the template introduces some shared patterns that do not provide any relevant data and can thus be ignored. We have evaluated and compared our technique to others in the literature on a large collection of web documents; our results demonstrate that our proposal performs better than the others and that input errors do not have a negative impact on its effectiveness; furthermore, its efficiency can be easily boosted by means of a couple of parameters, without sacrificing its effectiveness.Ministerio de Ciencia y Tecnolog铆a TIN2007-64119Junta de Andaluc铆a P07- TIC-2602Junta de Andaluc铆a P08-TIC-4100Ministerio de Ciencia e Innovaci贸n TIN2008-04718-EMinisterio de Ciencia e Innovaci贸n TIN2010- 21744Ministerio de Econom铆a, Industria y Competitividad TIN2010-09809-EMinisterio de Ciencia e Innovaci贸n TIN2010-10811-EMinisterio de Ciencia e Innovaci贸n TIN2010-09988-EMinisterio de Econom铆a y Competitividad TIN2011-15497-

    ARIEX: Automated ranking of information extractors

    Get PDF
    Information extractors are used to transform the user-friendly information in a web document into structured information that can be used to feed a knowledge-based system. Researchers are interested in ranking them to find out which one performs the best. Unfortunately, many rankings in the literature are deficient. There are a number of formal methods to rank information extractors, but they also have many problems and have not reached widespread popularity. In this article, we present ARIEX, which is an automated method to rank web information extraction proposals. It does not have any of the problems that we have identified in the literature. Our proposal shall definitely help authors make sure that they have advanced the state of the art not only conceptually, but from an empirical point of view; it shall also help practitioners make informed decisions on which proposal is the most adequate for a particular problem.Ministerio de Educaci贸n y Ciencia TIN2007-64119Junta de Andaluc铆a P07-TIC-2602Junta de Andaluc铆a P08-TIC-4100Ministerio de Ciencia e Innovaci贸n TIN2008-04718-EMinisterio de Ciencia e Innovaci贸n TIN2010-21744Ministerio de Econom铆a, Industria y Competitividad TIN2010-09809-EMinisterio de Ciencia e Innovaci贸n TIN2010-10811-EMinisterio de Ciencia e Innovaci贸n TIN2010-09988-EMinisterio de Econom铆a y Competitividad TIN2011-15497-EMinisterio de Econom铆a y Competitividad TIN2013-40848-

    A Transducer Model for Web Information Extraction

    Get PDF
    In recent years, many authors have paid attention to web information extractors. They usually build on an algorithm that interprets extraction rules that are inferred from examples. Several rule learning techniques are based on transducers, but none of them proposed a transducer generic model for web in formation extraction. In this paper, we propose a new transducer model that is specifically tailored to web information extraction. The model has proven quite flexible since we have adapted three techniques in the literature to infer state transitions, and the results prove that it can achieve high precision and recall ratesMinisterio de Educaci贸n y Ciencia TIN2007-64119Junta de Andaluc铆a P07-TIC-2602Junta de Andaluc铆a P08-TIC-4100Ministerio de Ciencia e Innovaci贸n TIN2008-04718-EMinisterio de Ciencia e Innovaci贸n TIN2010-21744Ministerio de Econom铆a, Industria y Competitividad TIN2010-09809-EMinisterio de Ciencia e Innovaci贸n TIN2010-10811-EMinisterio de Ciencia e Innovaci贸n TIN2010-09988-

    A Conceptual Framework for Efficient Web Crawling in Virtual Integration Contexts

    Get PDF
    Virtual Integration systems require a crawling tool able to navigate and reach relevant pages in the Web in an efficient way. Existing proposals in the crawling area are aware of the efficiency problem, but still most of them need to download pages in order to classify them as relevant or not. In this paper, we present a conceptual framework for designing crawlers supported by a web page classifier that relies solely on URLs to determine page relevance. Such a crawler is able to choose in each step only the URLs that lead to relevant pages, and therefore reduces the number of unnecessary pages downloaded, optimising bandwidth and making it efficient and suitable for virtual integration systems. Our preliminary experiments show that such a classifier is able to distinguish between links leading to different kinds of pages, without previous intervention from the user.Ministerio de Educaci贸n y Ciencia TIN2007-64119Junta de Andaluc铆a P07-TIC-2602Junta de Andaluc铆a P08- TIC-4100Ministerio de Ciencia e Innovaci贸n TIN2008-04718-EMinisterio de Ciencia e Innovaci贸n TIN2010-21744Ministerio de Econom铆a, Industria y Competitividad TIN2010-09809-EMinisterio de Ciencia e Innovaci贸n TIN2010-10811-EMinisterio de Ciencia e Innovaci贸n TIN2010-09988-

    A Tool for Web Links Prototyping

    Get PDF
    Crawlers for Virtual Integration processes must be efficient, given that VI process is online, which means that while the system is looking for the required information, the user is waiting for a response. Therefore, downloading a minimum number of irrelevant pages is mandatory in order to improve the crawler efficiency. Most crawlers need to download a page in order the determine its relevance, which results in a high number of irrelevant pages downloaded. We propose a tool that builds a set of prototype links for a given site, where each prototype represents links leading to pages containing a certain concept. These prototypes can then be used to classify pages before downloading them, just by analysing their URL. Therefore, they are the support for crawlers to navigate through sites downloading a minimum number of irrelevant pages while reducing bandwidth, making them suitable for VI systems.Ministerio de Educaci贸n y Ciencia TIN2007-64119Junta de Andaluc铆a P07-TIC-2602Junta de Andaluc铆a P08-TIC-4100Ministerio de Ciencia e Innovaci贸n TIN2008-04718-EMinisterio de Ciencia e Innovaci贸n TIN2010-21744Ministerio de Ciencia e Innovaci贸n TIN2010-09809-EMinisterio de Ciencia e Innovaci贸n TIN2010-10811-EMinisterio de Ciencia e Innovaci贸n TIN2010-09988-

    An Assisted Workflow for the Early Design of Nearly Zero Emission Healthcare Buildings

    Get PDF
    Energy efficiency in buildings is one of the main goals of many governmental policies due to their high impact on the carbon dioxide emissions in Europe. One of these targets is to reduce the energy consumption in healthcare buildings, which are known to be among the most energy-demanding building types. Although design decisions made at early design phases have a significant impact on the energy performance of the realized buildings, only a small portion of possible early designs is analyzed, which does not ensure an optimal building design. We propose an automated early design support workflow, accompanied by a set of tools, for achieving nearly zero emission healthcare buildings. It is intended to be used by decision makers during the early design phase. It starts with the user-defined brief and the design rules, which are the input for the Early Design Configurator (EDC). The EDC generates multiple design alternatives following an evolutionary algorithm while trying to satisfy user requirements and geometric constraints. The generated alternatives are then validated by means of an Early Design Validator (EDV), and then, early energy and cost assessments are made using two early assessment tools. A user-friendly dashboard is used to guide the user and to illustrate the workflow results, whereas the chosen alternative at the end of the workflow is considered as the starting point for the next design phases. Our proposal has been implemented using Building Information Models (BIM) and validated by means of a case study on a healthcare building and several real demonstrations from different countries in the context of the European project STREAMER. View Full-Tex
    corecore