1,242 research outputs found

    Web Data Extraction, Applications and Techniques: A Survey

    Full text link
    Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System

    On Extracting Information from Semi-structured Deep Web Documents

    Get PDF
    Some software agents need information that is provided by some web sites, which is difficult if they lack a query API. Information extractors are intended to extract the information of interest automati cally and offer it in a structured format. Unfortunately, most of them rely on ad-hoc techniques, which make them fade away as the Web evolves. In this paper, we present a proposal that relies on an open catalogue of features that allows to adapt it easily; we have also devised an optimi sation that allows it to be very efficient. Our experimental results prove that our proposal outperforms other state-of-the-art proposals.Ministerio de Educación y Ciencia TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08-TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-EMinisterio de Ciencia e Innovación TIN2010-21744Ministerio de Economía, Industria y Competitividad TIN2010-09809-EMinisterio de Ciencia e Innovación TIN2010-10811-EMinisterio de Ciencia e Innovación TIN2010-09988-EMinisterio de Economía y Competitividad TIN2011-15497-EMinisterio de Economía y Competitividad TIN2013-40848-

    An Expressive Language and Efficient Execution System for Software Agents

    Full text link
    Software agents can be used to automate many of the tedious, time-consuming information processing tasks that humans currently have to complete manually. However, to do so, agent plans must be capable of representing the myriad of actions and control flows required to perform those tasks. In addition, since these tasks can require integrating multiple sources of remote information ? typically, a slow, I/O-bound process ? it is desirable to make execution as efficient as possible. To address both of these needs, we present a flexible software agent plan language and a highly parallel execution system that enable the efficient execution of expressive agent plans. The plan language allows complex tasks to be more easily expressed by providing a variety of operators for flexibly processing the data as well as supporting subplans (for modularity) and recursion (for indeterminate looping). The executor is based on a streaming dataflow model of execution to maximize the amount of operator and data parallelism possible at runtime. We have implemented both the language and executor in a system called THESEUS. Our results from testing THESEUS show that streaming dataflow execution can yield significant speedups over both traditional serial (von Neumann) as well as non-streaming dataflow-style execution that existing software and robot agent execution systems currently support. In addition, we show how plans written in the language we present can represent certain types of subtasks that cannot be accomplished using the languages supported by network query engines. Finally, we demonstrate that the increased expressivity of our plan language does not hamper performance; specifically, we show how data can be integrated from multiple remote sources just as efficiently using our architecture as is possible with a state-of-the-art streaming-dataflow network query engine

    Towards Comparative Web Content Mining using Object Oriented Model

    Get PDF
    Web content data are heterogeneous in nature; usually composed of different types of contents and data structure. Thus, extraction and mining of web content data is a challenging branch of data mining. Traditional web content extraction and mining techniques are classified into three categories: programming language based wrappers, wrapper (data extraction program) induction techniques, and automatic wrapper generation techniques. First category constructs data extraction system by providing some specialized pattern specification languages, second category is a supervised learning, which learns data extraction rules and third category is automatic extraction process. All these data extraction techniques rely on web document presentation structures, which need complicated matching and tree alignment algorithms, routine maintenance, hard to unify for vast variety of websites and fail to catch heterogeneous data together. To catch more diversity of web documents, a feasible implementation of an automatic data extraction technique based on object oriented data model technique, 00Web, had been proposed in Annoni and Ezeife (2009). This thesis implements, materializes and extends the structured automatic data extraction technique. We developed a system (called WebOMiner) for extraction and mining of structured web contents based on object-oriented data model. Thesis extends the extraction algorithms proposed by Annoni and Ezeife (2009) and develops an automata based automatic wrapper generation algorithm for extraction and mining of structured web content data. Our algorithm identifies data blocks from flat array data structure and generates Non-Deterministic Finite Automata (NFA) pattern for different types of content data for extraction. Objective of this thesis is to extract and mine heterogeneous web content and relieve the hard effort of matching, tree alignment and routine maintenance. Experimental results show that our system is highly effective and it performs the mining task with 100% precision and 96.22% recall value

    A Novel Approach to Web Information Extraction

    Get PDF
    Business Intelligence requires the acquisition and aggrega tion of key pieces of knowledge from multiple sources in order to provide valuable information to customers. The Web is the largest source of infor mation nowadays. Unfortunately, the information it provides is available in semi-structured human-friendly formats, which makes it difficult to be processed by automated business processes. Classical propositional and ILP machine-learning techniques have been applied for this pur pose. However, the former have not enough expressive power, whereas the latter are more expressive but intractable with large datasets. Propo sitionalisation was devised as a means to provide propositional techniques with more expressive power, enabling them to exploit structural infor mation in a propositional way that allows them to be efficient. In this paper, we present a proposal to extract information from semi-structured web documents that uses this approach. It leverages a classical propo sitional machine learning technique and enhances it with the ability to learn from an unbounded context, which helps increase its precision and recall. Our experiments prove that our proposal outperforms other state of-art techniques in the literature.Ministerio de Educación y Ciencia TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08-TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-EMinisterio de Ciencia e Innovación TIN2010-21744Ministerio de Economía, Industria y Competitividad TIN2010-09809-EMinisterio de Ciencia e Innovación TIN2010-10811-EMinisterio de Ciencia e Innovación TIN2010-09988-EMinisterio de Economía y Competitividad TIN2011-15497-EMinisterio de Economía y Competitividad TIN2013-40848-
    corecore