5,990 research outputs found

    Web Data Extraction, Applications and Techniques: A Survey

    Full text link
    Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System

    BlogForever D2.6: Data Extraction Methodology

    Get PDF
    This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform

    The Grid[Way] Job Template Manager, a tool for parameter sweeping

    Full text link
    Parameter sweeping is a widely used algorithmic technique in computational science. It is specially suited for high-throughput computing since the jobs evaluating the parameter space are loosely coupled or independent. A tool that integrates the modeling of a parameter study with the control of jobs in a distributed architecture is presented. The main task is to facilitate the creation and deletion of job templates, which are the elements describing the jobs to be run. Extra functionality relies upon the GridWay Metascheduler, acting as the middleware layer for job submission and control. It supports interesting features like multi-dimensional sweeping space, wildcarding of parameters, functional evaluation of ranges, value-skipping and job template automatic indexation. The use of this tool increases the reliability of the parameter sweep study thanks to the systematic bookkeping of job templates and respective job statuses. Furthermore, it simplifies the porting of the target application to the grid reducing the required amount of time and effort.Comment: 26 pages, 1 figure

    Natural language processing

    Get PDF
    Beginning with the basic issues of NLP, this chapter aims to chart the major research activities in this area since the last ARIST Chapter in 1996 (Haas, 1996), including: (i) natural language text processing systems - text summarization, information extraction, information retrieval, etc., including domain-specific applications; (ii) natural language interfaces; (iii) NLP in the context of www and digital libraries ; and (iv) evaluation of NLP systems

    Preface of the Proceedings of WRAP 2004

    Get PDF
    corecore