13 research outputs found

    A Workflow-Based Approach for Creating Complex Web Wrappers

    Get PDF
    This version of the article has been accepted for publication, after peer review and is subject to Springer Nature’s AM terms of use, but is not the Version of Record and does not reflect post-acceptance improvements, or any corrections. The Version of Record is available online at: https://doi.org/10.1007/978-3-540-85481-4_30.[Abstract]: In order to let software programs access and use the information and services provided by web sources, wrapper programs must be built to provide a “machine-readable” view over them. Although research literature on web wrappers is vast, the problem of how to specify the internal logic of complex wrappers in a graphical and simple way remains mainly ignored. In this paper, we propose a new language for addressing this task. Our approach leverages on the existing work on intelligent web data extraction and automatic web navigation as building blocks, and uses a workflow-based approach to specify the wrapper control logic. The features included in the language have been decided from the results of a study of a wide range of real web automation applications from different business areas. In this paper, we also present the most salient results of the study.This research was partially supported by the Spanish Ministry of Education and Science under project TSI2005-07730. Alberto Pan’s work was partially supported by the “Ramón y Cajal” programme of the Spanish Ministry of Education and Scienc

    Interactive Tuples Extraction from Semi-Structured Data

    Get PDF
    International audienceThis paper studies from a machine learning viewpoint the problem of extracting tuples of a target n-ary relation from tree structured data like XML or XHTML documents. Our system can extract, without any post-processing, tuples for all data structures including nested, rotated and cross tables. The wrapper induction algorithm we propose is based on two main ideas. It is incremental: partial tuples are extracted by increasing length. It is based on a representation-enrichment procedure: partial tuples of length i are encoded with the knowledge of extracted tu- ples of length i − 1. The algorithm is then set in a friendly interactive wrapper induction system for Web documents. We evaluate our system on several information extraction tasks over corporate Web sites. It achieves state-of-the-art results on simple data structures and succeeds on complex data structures where previous approaches fail. Experiments also show that our interactive framework significantly reduces the number of user interactions needed to build a wrapper

    BlogForever D2.6: Data Extraction Methodology

    Get PDF
    This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform

    Automatically Extract Information from Web Documents

    Get PDF
    The Internet could be considered to be a reservoir of useful information in textual form — product catalogs, airline schedules, stock market quotations, weather forecast etc. There has been much interest in building systems that gather such information on a user\u27s behalf. But because these information resources are formatted differently, mechanically extracting their content is difficult. Systems using such resources typically use hand-coded wrappers, customized procedures for information extraction. Structured data objects are a very important type of information on the Web. Such data objects are often records from underlying databases and displayed in Web pages with some fixed templates. Mining data records in Web pages is useful because they typically present their host pages\u27 essential information, such as lists of products and services. Extracting these structured data objects enables one to integrate data/information from multiple Web pages to provide value-added services, e.g., comparative shopping, meta-querying and search. Web content mining has thus become an area of interest for many researchers because of the phenomenal growth of the Web contents and the economic benefits associated with it. However, due to the heterogeneity of Web pages, automated discovery of targeted information is still posing as a challenging problem

    Towards Comparative Web Content Mining using Object Oriented Model

    Get PDF
    Web content data are heterogeneous in nature; usually composed of different types of contents and data structure. Thus, extraction and mining of web content data is a challenging branch of data mining. Traditional web content extraction and mining techniques are classified into three categories: programming language based wrappers, wrapper (data extraction program) induction techniques, and automatic wrapper generation techniques. First category constructs data extraction system by providing some specialized pattern specification languages, second category is a supervised learning, which learns data extraction rules and third category is automatic extraction process. All these data extraction techniques rely on web document presentation structures, which need complicated matching and tree alignment algorithms, routine maintenance, hard to unify for vast variety of websites and fail to catch heterogeneous data together. To catch more diversity of web documents, a feasible implementation of an automatic data extraction technique based on object oriented data model technique, 00Web, had been proposed in Annoni and Ezeife (2009). This thesis implements, materializes and extends the structured automatic data extraction technique. We developed a system (called WebOMiner) for extraction and mining of structured web contents based on object-oriented data model. Thesis extends the extraction algorithms proposed by Annoni and Ezeife (2009) and develops an automata based automatic wrapper generation algorithm for extraction and mining of structured web content data. Our algorithm identifies data blocks from flat array data structure and generates Non-Deterministic Finite Automata (NFA) pattern for different types of content data for extraction. Objective of this thesis is to extract and mine heterogeneous web content and relieve the hard effort of matching, tree alignment and routine maintenance. Experimental results show that our system is highly effective and it performs the mining task with 100% precision and 96.22% recall value

    Spatially Aware Computing for Natural Interaction

    Get PDF
    Spatial information refers to the location of an object in a physical or digital world. Besides, it also includes the relative position of an object related to other objects around it. In this dissertation, three systems are designed and developed. All of them apply spatial information in different fields. The ultimate goal is to increase the user friendliness and efficiency in those applications by utilizing spatial information. The first system is a novel Web page data extraction application, which takes advantage of 2D spatial information to discover structured records from a Web page. The extracted information is useful to re-organize the layout of a Web page to fit mobile browsing. The second application utilizes the 3D spatial information of a mobile device within a large paper-based workspace to implement interactive paper that combines the merits of paper documents and mobile devices. This application can overlay digital information on top of a paper document based on the location of a mobile device within a workspace. The third application further integrates 3D space information with sound detection to realize an automatic camera management system. This application automatically controls multiple cameras in a conference room, and creates an engaging video by intelligently switching camera shots among meeting participants based on their activities. Evaluations have been made on all three applications, and the results are promising. In summary, this dissertation comprehensively explores the usage of spatial information in various applications to improve the usability
    corecore