30,023 research outputs found

    Active Learning of Regular Expressions for Entity Extraction

    Get PDF
    We consider the automatic synthesis of an entity extractor, in the form of a regular expression, from examples of the desired extractions in an unstructured text stream. This is a long-standing problem for which many different approaches have been proposed, which all require the preliminary construction of a large dataset fully annotated by the user. In this work we propose an active learning approach aimed at minimizing the user annotation effort: the user annotates only one desired extraction and then merely answers extraction queries generated by the system. During the learning process, the system digs into the input text for selecting the most appropriate extraction query to be submitted to the user in order to improve the current extractor. We construct candidate solutions with Genetic Programming and select queries with a form of querying-by-committee, i.e., based on a measure of disagreement within the best candidate solutions. All the components of our system are carefully tailored to the peculiarities of active learning with Genetic Programming and of entity extraction from unstructured text. We evaluate our proposal in depth, on a number of challenging datasets and based on a realistic estimate of the user effort involved in answering each single query. The results demonstrate high accuracy with significant savings in terms of computational effort, annotated characters and execution time over a state-of-the-art baseline

    Towards a relation extraction framework for cyber-security concepts

    Full text link
    In order to assist security analysts in obtaining information pertaining to their network, such as novel vulnerabilities, exploits, or patches, information retrieval methods tailored to the security domain are needed. As labeled text data is scarce and expensive, we follow developments in semi-supervised Natural Language Processing and implement a bootstrapping algorithm for extracting security entities and their relationships from text. The algorithm requires little input data, specifically, a few relations or patterns (heuristics for identifying relations), and incorporates an active learning component which queries the user on the most important decisions to prevent drifting from the desired relations. Preliminary testing on a small corpus shows promising results, obtaining precision of .82.Comment: 4 pages in Cyber & Information Security Research Conference 2015, AC

    Information Extraction, Data Integration, and Uncertain Data Management: The State of The Art

    Get PDF
    Information Extraction, data Integration, and uncertain data management are different areas of research that got vast focus in the last two decades. Many researches tackled those areas of research individually. However, information extraction systems should have integrated with data integration methods to make use of the extracted information. Handling uncertainty in extraction and integration process is an important issue to enhance the quality of the data in such integrated systems. This article presents the state of the art of the mentioned areas of research and shows the common grounds and how to integrate information extraction and data integration under uncertainty management cover

    Evaluation of automatic hypernym extraction from technical corpora in English and Dutch

    Get PDF
    In this research, we evaluate different approaches for the automatic extraction of hypernym relations from English and Dutch technical text. The detected hypernym relations should enable us to semantically structure automatically obtained term lists from domain- and user-specific data. We investigated three different hypernymy extraction approaches for Dutch and English: a lexico-syntactic pattern-based approach, a distributional model and a morpho-syntactic method. To test the performance of the different approaches on domain-specific data, we collected and manually annotated English and Dutch data from two technical domains, viz. the dredging and financial domain. The experimental results show that especially the morpho-syntactic approach obtains good results for automatic hypernym extraction from technical and domain-specific texts
    • …
    corecore