98,747 research outputs found

    Logic-based Web Information Extraction

    Get PDF

    Automatic Wrapper Adaptation by Tree Edit Distance Matching

    Get PDF
    Information distributed through the Web keeps growing faster day by day,\ud and for this reason, several techniques for extracting Web data have been suggested\ud during last years. Often, extraction tasks are performed through so called wrappers,\ud procedures extracting information from Web pages, e.g. implementing logic-based\ud techniques. Many fields of application today require a strong degree of robustness\ud of wrappers, in order not to compromise assets of information or reliability of data\ud extracted.\ud Unfortunately, wrappers may fail in the task of extracting data from a Web page, if\ud its structure changes, sometimes even slightly, thus requiring the exploiting of new\ud techniques to be automatically held so as to adapt the wrapper to the new structure\ud of the page, in case of failure. In this work we present a novel approach of automatic wrapper adaptation based on the measurement of similarity of trees through\ud improved tree edit distance matching techniques

    A Fuzzy Logic intelligent agent for Information Extraction: Introducing a new Fuzzy Logic-based term weighting scheme

    Get PDF
    In this paper, we propose a novel method for Information Extraction (IE) in a set of knowledge in order to answer to user consultations using natural language. The system is based on a Fuzzy Logic engine, which takes advantage of its flexibility for managing sets of accumulated knowledge. These sets may be built in hierarchic levels by a tree structure. The aim of this system is to design and implement an intelligent agent to manage any set of knowledge where information is abundant, vague or imprecise. The method was applied to the case of a major university web portal, University of Seville web portal, which contains a huge amount of information. Besides, we also propose a novel method for term weighting (TW). This method also is based on Fuzzy Logic, and replaces the classical TF–IDF method, usually used for TW, for its flexibility

    User Preference Web Search -- Experiments with a System Connecting Web and User

    Get PDF
    We present models, methods, implementations and experiments with a system enabling personalized web search for many users with different preferences. The system consists of a web information extraction part, a text search engine, a middleware supporting top-k answers and a user interface for querying and evaluation of search results. We integrate several tools (implementing our models and methods) into one framework connecting user with the web. The model represents user preferences with fuzzy sets and fuzzy logic, here understood as a scoring describing user satisfaction. This model can be acquired with explicit or implicit methods. Model-theoretic semantics is based on fuzzy description logic f-EL. User preference learning is based on our model of fuzzy inductive logic programming. Our system works both for English and Slovak resources. The primary application domain are job offers and job search, however we show extension to mutual investment funds search and a possibility of extension into other application domains. Our top-k search is optimized with own heuristics and repository with special indexes. Our model was experimentally implemented, the integration was tested and is web accessible. We focus on experiments with several users and measure their satisfaction according to correlation coefficients

    Document Grouping by Using Meronyms and Type-2 Fuzzy Association Rule Mining

    Get PDF
    The growth of the number of textual documents in the digital world, especially on the World Wide Web, is incredibly fast. This causes an accumulation of information, so we need efficient organization to manage textual documents. One way to accurately classify documents is using fuzzy association rules. The quality of the document clustering is affected by phase extraction of key terms and type of fuzzy logic system (FLS) used for clustering. The use of meronyms in the extraction of key terms to obtain cluster labels helps obtaining meaningful cluster labels and in addition ambiguities and uncertainties that occur in the rules of type-1 fuzzy logic systems can be overcome by using type-2 fuzzy sets. This study proposes a method of key term extraction based on meronyms with an initialization cluster using fuzzy association rule mining for document clustering. This method consists of four stages, i.e. preprocessing of the document, extraction of key terms with meronyms, extraction of candidate clusters, and cluster tree construction. Testing of this method was done with three different datasets: classic, Reuters, and 20 Newsgroup. Testing was done by comparing the overall F-measure of the method without meronyms and with meronyms. Based on the testing, the method with meronyms in the extraction of keywords produced an overall F-measure of 0.5753 for the classic dataset, 0.3984 for the Reuters dataset, and 0.6285 for the 20 Newsgroup dataset

    Entity set expansion from the Web via ASP

    Get PDF
    Knowledge on the Web in a large part is stored in various semantic resources that formalize, represent and organize it differently. Combining information from several sources can improve results of tasks such as recognizing similarities among objects. In this paper, we propose a logic-based method for the problem of entity set expansion (ESE), i.e. extending a list of named entities given a set of seeds. This problem has relevant applications in the Information Extraction domain, specifically in automatic lexicon generation for dictionary-based annotating tools. Contrary to typical approaches in natural languages processing, based on co-occurrence statistics of words, we determine the common category of the seeds by analyzing the semantic relations of the objects the words represent. To do it, we integrate information from selected Web resources. We introduce a notion of an entity network that uniformly represents the combined knowledge and allow to reason over it. We show how to use the network to disambiguate word senses by relying on a concept of optimal common ancestor and how to discover similarities between two entities. Finally, we show how to expand a set of entities, by using answer set programming with external predicates

    Structured and Unstructured Information Extraction Using Text Mining and Natural Language Processing Techniques

    Get PDF
    Information on web is increasing at infinitum. Thus, web has become an unstructured global area where information even if available, cannot be directly used for desired applications. One is often faced with an information overload and demands for some automated help. Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents by means of Text Mining and Natural Language Processing (NLP) techniques. Extracted structured information can be used for variety of enterprise or personal level task of varying complexity. The Information Extraction (IE) in also a set of knowledge in order to answer to user consultations using natural language. The system is based on a Fuzzy Logic engine, which takes advantage of its flexibility for managing sets of accumulated knowledge. These sets may be built in hierarchic levels by a tree structure. Information extraction is structured data or knowledge from unstructured text by identifying references to named entities as well as stated relationships between such entities. Data mining research assumes that the information to be “mined” is already in the form of a relational database. IE can serve an important technology for text mining. The knowledge discovered is expressed directly in the documents to be mined, then IE alone can serve as an effective approach to text mining. However, if the documents contain concrete data in unstructured form rather than abstract knowledge, it may be useful to first use IE to transform the unstructured data in the document corpus into a structured database, and then use traditional data mining tools to identify abstract patterns in this extracted data. We propose a novel method for text mining with natural language processing techniques to extract the information from data base with efficient way, where the extraction time and accuracy is measured and plotted with simulation. Where the attributes of entities and relationship entities from structured and semi structured information .Results are compared with conventional methods

    A Workflow-Based Approach for Creating Complex Web Wrappers

    Get PDF
    This version of the article has been accepted for publication, after peer review and is subject to Springer Nature’s AM terms of use, but is not the Version of Record and does not reflect post-acceptance improvements, or any corrections. The Version of Record is available online at: https://doi.org/10.1007/978-3-540-85481-4_30.[Abstract]: In order to let software programs access and use the information and services provided by web sources, wrapper programs must be built to provide a “machine-readable” view over them. Although research literature on web wrappers is vast, the problem of how to specify the internal logic of complex wrappers in a graphical and simple way remains mainly ignored. In this paper, we propose a new language for addressing this task. Our approach leverages on the existing work on intelligent web data extraction and automatic web navigation as building blocks, and uses a workflow-based approach to specify the wrapper control logic. The features included in the language have been decided from the results of a study of a wide range of real web automation applications from different business areas. In this paper, we also present the most salient results of the study.This research was partially supported by the Spanish Ministry of Education and Science under project TSI2005-07730. Alberto Pan’s work was partially supported by the “Ramón y Cajal” programme of the Spanish Ministry of Education and Scienc
    corecore