106,119 research outputs found

    Web Data Extraction, Applications and Techniques: A Survey

    Full text link
    Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System

    BlogForever D2.6: Data Extraction Methodology

    Get PDF
    This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform

    A Mining Algorithm for Extracting Decision Process Data Models

    Get PDF
    The paper introduces an algorithm that mines logs of user interaction with simulation software. It outputs a model that explicitly shows the data perspective of the decision process, namely the Decision Data Model (DDM). In the first part of the paper we focus on how the DDM is extracted by our mining algorithm. We introduce it as pseudo-code and, then, provide explanations and examples of how it actually works. In the second part of the paper, we use a series of small case studies to prove the robustness of the mining algorithm and how it deals with the most common patterns we found in real logs.Decision Process Data Model, Decision Process Mining, Decision Mining Algorithm

    Construction of a taxonomy for requirements engineering commercial-off-the-shelf components

    Get PDF
    This article presents a procedure for constructing a taxonomy of COTS products in the field of Requirements Engineering (RE). The taxonomy and the obtained information reach transcendental benefits to the selection of systems and tools that aid to RE-related actors to simplify and facilitate their work. This taxonomy is performed by means of a goal-oriented methodology inspired in GBRAM (Goal-Based Requirements Analysis Method), called GBTCM (Goal-Based Taxonomy Construction Method), that provides a guide to analyze sources of information and modeling requirements and domains, as well as gathering and organizing the knowledge in any segment of the COTS market. GBTCM claims to promote the use of standards and the reuse of requirements in order to support different processes of selection and integration of components.Peer ReviewedPostprint (published version

    The Locus Algorithm II: A robust software system to maximise the quality of fields of view for Differential Photometry

    Get PDF
    We present the software system developed to implement the Locus Algorithm, a novel algorithm designed to maximise the performance of differential photometry systems by optimising the number and quality of reference stars in the Field of View with the target. Firstly, we state the design requirements, constraints and ambitions for the software system required to implement this algorithm. Then, a detailed software design is presented for the system in operation. Next, the data design including file structures used and the data environment required for the system are defined. Finally, we conclude by illustrating the scaling requirements which mandate a high-performance computing implementation of this system, which is discussed in the other papers in this series

    Using visualization for visualization : an ecological interface design approach to inputting data

    Get PDF
    Visualization is experiencing growing use by a diverse community, with continuing improvements in the availability and usability of systems. In spite of these developments the problem of how first to get the data in has received scant attention: the established approach of pre-defined readers and programming aids has changed little in the last two decades. This paper proposes a novel way of inputting data for scientific visualization that employs rapid interaction and visual feedback in order to understand how the data is stored. The approach draws on ideas from the discipline of ecological interface design to extract and control important parameters describing the data, at the same time harnessing our innate human ability to recognize patterns. Crucially, the emphasis is on file format discovery rather than file format description, so the method can therefore still work when nothing is known initially of how the file was originally written, as is often the case with legacy binary data. © 2013 Elsevier Ltd
    • 

    corecore