1,803 research outputs found

    Web Data Extraction, Applications and Techniques: A Survey

    Full text link
    Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System

    Personalization by Partial Evaluation.

    Get PDF
    The central contribution of this paper is to model personalization by the programmatic notion of partial evaluation.Partial evaluation is a technique used to automatically specialize programs, given incomplete information about their input.The methodology presented here models a collection of information resources as a program (which abstracts the underlying schema of organization and flow of information),partially evaluates the program with respect to user input,and recreates a personalized site from the specialized program.This enables a customizable methodology called PIPE that supports the automatic specialization of resources,without enumerating the interaction sequences beforehand .Issues relating to the scalability of PIPE,information integration,sessioniz-ling scenarios,and case studies are presented

    A Review on Extraction and Recommendation of Educational Resources from WWW

    Get PDF
    Keyphrases give a basic method for portraying a report, giving the peruser a few pieces of information about its substance. Wrapper adjustment goes for consequently adjusting a formerly took in wrapper from the source Web webpage to another concealed website for data extraction. It depends on a generative model for the age of content parts identified with characteristic things and designing information in a Web page. To take care of the wrapper adjustment issue, we consider two sorts of data from the source Web webpage. The principal sort of data is the extraction information contained in the already took in wrapper from the source Web webpage. The second sort of data is the beforehand separated or gathered things. Utilize a Bayesian learning way to deal with naturally select an arrangement of preparing cases for adjusting a wrapper for the new concealed site. To take care of the new property revelation issue, we build up a model which breaks down the encompassing content sections of the qualities in the new inconspicuous site. A Bayesian learning strategy is produced to find the new qualities and their headers. The direct broad investigations from various genuine Web locales to show the viability of our structure. Keyphrases can be helpful in a different applications, for example, recovery motors, perusing interfaces, thesaurus development, content mining and so on. There are likewise different errands for which keyphrases are helpful

    Information Extraction in Illicit Domains

    Full text link
    Extracting useful entities and attribute values from illicit domains such as human trafficking is a challenging problem with the potential for widespread social impact. Such domains employ atypical language models, have `long tails' and suffer from the problem of concept drift. In this paper, we propose a lightweight, feature-agnostic Information Extraction (IE) paradigm specifically designed for such domains. Our approach uses raw, unlabeled text from an initial corpus, and a few (12-120) seed annotations per domain-specific attribute, to learn robust IE models for unobserved pages and websites. Empirically, we demonstrate that our approach can outperform feature-centric Conditional Random Field baselines by over 18\% F-Measure on five annotated sets of real-world human trafficking datasets in both low-supervision and high-supervision settings. We also show that our approach is demonstrably robust to concept drift, and can be efficiently bootstrapped even in a serial computing environment.Comment: 10 pages, ACM WWW 201

    An Expressive Language and Efficient Execution System for Software Agents

    Full text link
    Software agents can be used to automate many of the tedious, time-consuming information processing tasks that humans currently have to complete manually. However, to do so, agent plans must be capable of representing the myriad of actions and control flows required to perform those tasks. In addition, since these tasks can require integrating multiple sources of remote information ? typically, a slow, I/O-bound process ? it is desirable to make execution as efficient as possible. To address both of these needs, we present a flexible software agent plan language and a highly parallel execution system that enable the efficient execution of expressive agent plans. The plan language allows complex tasks to be more easily expressed by providing a variety of operators for flexibly processing the data as well as supporting subplans (for modularity) and recursion (for indeterminate looping). The executor is based on a streaming dataflow model of execution to maximize the amount of operator and data parallelism possible at runtime. We have implemented both the language and executor in a system called THESEUS. Our results from testing THESEUS show that streaming dataflow execution can yield significant speedups over both traditional serial (von Neumann) as well as non-streaming dataflow-style execution that existing software and robot agent execution systems currently support. In addition, we show how plans written in the language we present can represent certain types of subtasks that cannot be accomplished using the languages supported by network query engines. Finally, we demonstrate that the increased expressivity of our plan language does not hamper performance; specifically, we show how data can be integrated from multiple remote sources just as efficiently using our architecture as is possible with a state-of-the-art streaming-dataflow network query engine

    Semantic lifting and reasoning on the personalised activity big data repository for healthcare research

    Get PDF
    The fast growing markets of smart health monitoring devices and mobile applications provide opportunities for common citizens to have capability for understanding and managing their own health situations. However, there are many challenges for data engineering and knowledge discovery research to enable efficient extraction of knowledge from data that is collected from heterogonous devices and applications with big volumes and velocity. This paper presents research that initially started with the EC MyHealthAvatar project and is under continual improvement following the project’s completion. The major contribution of the work is a comprehensive big data and semantic knowledge discovery framework which integrates data from varied data resources. The framework applies hybrid database architecture of NoSQL and RDF repositories with introductions for semantic oriented data mining and knowledge lifting algorithms. The activity stream data is collected through Kafka’s big data processing component. The motivation of the research is to enhance the knowledge management, discovery capabilities and efficiency to support further accurate health risk analysis and lifestyle summarisation

    WAQS : a web-based approximate query system

    Get PDF
    The Web is often viewed as a gigantic database holding vast stores of information and provides ubiquitous accessibility to end-users. Since its inception, the Internet has experienced explosive growth both in the number of users and the amount of content available on it. However, searching for information on the Web has become increasingly difficult. Although query languages have long been part of database management systems, the standard query language being the Structural Query Language is not suitable for the Web content retrieval. In this dissertation, a new technique for document retrieval on the Web is presented. This technique is designed to allow a detailed retrieval and hence reduce the amount of matches returned by typical search engines. The main objective of this technique is to allow the query to be based on not just keywords but also the location of the keywords within the logical structure of a document. In addition, the technique also provides approximate search capabilities based on the notion of Distance and Variable Length Don\u27t Cares. The proposed techniques have been implemented in a system, called Web-Based Approximate Query System, which contains an SQL-like query language called Web-Based Approximate Query Language. Web-Based Approximate Query Language has also been integrated with EnviroDaemon, an environmental domain specific search engine. It provides EnviroDaemon with more detailed searching capabilities than just keyword-based search. Implementation details, technical results and future work are presented in this dissertation
    • …
    corecore