54 research outputs found

    On the Automatic Construction of Regular Expressions from Examples (GP vs. Humans 1-0)

    Get PDF
    Regular expressions are systematically used in a number of different application domains. Writing a regular expression for solving a specific task is usually quite difficult, requiring significant technical skills and creativity. We have developed a tool based on Genetic Programming capable of constructing regular expressions for text extraction automatically, based on examples of the text to be extracted. We have recently demonstrated that our tool is human-competitive in terms of both accuracy of the regular expressions and time required for their construction. We base this claim on a large-scale experiment involving more than 1700 users on 10 text extraction tasks of realistic complexity. The F-measure of the expressions constructed by our tool was almost always higher than the average F-measure of the expressions constructed by each of the three categories of users involved in our experiment (Novice, Intermediate, Experienced). The time required by our tool was almost always smaller than the average time required by each of the three categories of users. The experiment is described in full detail in "Can a machine replace humans? A case study. IEEE Intelligent Systems, 2016

    Can a Machine Replace Humans in Building Regular Expressions? A Case Study

    Get PDF
    Regular expressions are routinely used in a variety of different application domains. But building a regular expression involves a considerable amount of skill, expertise, and creativity. In this work, the authors investigate whether a machine can surrogate these qualities and automatically construct regular expressions for tasks of realistic complexity. They discuss a large-scale experiment involving more than 1,700 users on 10 challenging tasks. The authors compare the solutions constructed by these users to those constructed by a tool based on genetic programming that they recently developed and made publicly available. The quality of automatically constructed solutions turned out to be similar to the quality of those constructed by the most skilled user group; the time for automatic construction was likewise similar to the time required by human users

    Active Learning of Regular Expressions for Entity Extraction

    Get PDF
    We consider the automatic synthesis of an entity extractor, in the form of a regular expression, from examples of the desired extractions in an unstructured text stream. This is a long-standing problem for which many different approaches have been proposed, which all require the preliminary construction of a large dataset fully annotated by the user. In this work we propose an active learning approach aimed at minimizing the user annotation effort: the user annotates only one desired extraction and then merely answers extraction queries generated by the system. During the learning process, the system digs into the input text for selecting the most appropriate extraction query to be submitted to the user in order to improve the current extractor. We construct candidate solutions with Genetic Programming and select queries with a form of querying-by-committee, i.e., based on a measure of disagreement within the best candidate solutions. All the components of our system are carefully tailored to the peculiarities of active learning with Genetic Programming and of entity extraction from unstructured text. We evaluate our proposal in depth, on a number of challenging datasets and based on a realistic estimate of the user effort involved in answering each single query. The results demonstrate high accuracy with significant savings in terms of computational effort, annotated characters and execution time over a state-of-the-art baseline

    Automatic Search-and-Replace From Examples With Coevolutionary Genetic Programming

    Get PDF
    We describe the design and implementation of a system for executing search-and-replace text processing tasks automatically, based only on examples of the desired behavior. The examples consist of pairs describing the original string and the desired modified string. Their construction, thus, does not require any specific technical skill. The system constructs a solution to the specified task that can be used unchanged on popular existing software for text processing. The solution consists of a search pattern coupled with a replacement expression: the former is a regular expression which describes both the strings to be replaced and their portions to be reused in the latter, which describes how to build the modified strings. Our proposed system is internally based on genetic programming and implements a form of cooperative coevolution in which two separate populations are evolved independently, one for search patterns and the other for replacement expressions. We assess our proposal on six tasks of realistic complexity obtaining very good results, both in terms of absolute quality of the solutions and with respect to the challenging baselines considered

    Regex-based Entity Extraction with Active Learning and Genetic Programming

    Get PDF
    We consider the long-standing problem of the automatic generation of regular expressions for text extraction, based solely on examples of the desired behavior. We investigate several active learning approaches in which the user annotates only one desired extraction and then merely answers extraction queries generated by the system. The resulting framework is attractive because it is the system, not the user, which digs out the data in search of the samples most suitable to the specific learning task. We tailor our proposals to a state-of-the-art learner based on Genetic Programming and we assess them experimentally on a number of challenging tasks of realistic complexity. The results indicate that active learning is indeed a viable framework in this application domain and may thus significantly decrease the amount of costly annotation effort required

    Personalized, Browser-Based Visual Phishing Detection Based on Deep Learning

    Get PDF
    Phishing defense mechanisms that are close to browsers and that do not rely on any forms of website reputation may be a powerful tool for combating phishing campaigns that are increasingly more targeted and last for increasingly shorter life spans. Browser-based phishing detectors that are specialized for a user-selected set of targeted web sites and that are based only on the overall visual appearance of a target could be a very effective tool in this respect. Approaches of this kind have not been very successful for several reasons, including the difficulty of coping with the large set of genuine pages encountered in normal browser usage without flooding the user with false positives. In this work we intend to investigate whether the power of modern deep learning methodologies for image classification may enable solutions that are more practical and effective. Our experimental assessment of a convolutional neural network resulted in very high classification accuracy for targeted sets of 15 websites (the largest size that we analyzed) even when immersed in a set of login pages taken from 100 websites

    Learning Text Patterns using Separate-and-Conquer Genetic Programming

    Get PDF
    The problem of extracting knowledge from large volumes of unstructured textual information has become increasingly important. We consider the problem of extracting text slices that adhere to a syntactic pattern and propose an approach capable of generating the desired pattern automatically, from a few annotated examples. Our approach is based on Genetic Programming and generates extraction patterns in the form of regular expressions that may be input to existing engines without any post-processing. Key feature of our proposal is its ability of discovering automatically whether the extraction task may be solved by a single pattern, or rather a set of multiple patterns is required. We obtain this property by means of a separate-and-conquer strategy: once a candidate pattern provides adequate performance on a subset of the examples, the pattern is inserted into the set of final solutions and the evolutionary search continues on a smaller set of examples including only those not yet solved adequately. Our proposal outperforms an earlier state-of-the-art approach on three challenging datasets

    Syntactical Similarity Learning by Means of Grammatical Evolution

    Get PDF
    Several research efforts have shown that a similarity function synthesized from examples may capture an application-specific similarity criterion in a way that fits the application needs more effectively than a generic distance definition. In this work, we propose a similarity learning algorithm tailored to problems of syntax-based entity extraction from unstructured text streams. The algorithm takes in input pairs of strings along with an indication of whether they adhere or not adhere to the same syntactic pattern. Our approach is based on Grammatical Evolution and explores systematically a similarity definition space including all functions that may be expressed with a specialized, simple language that we have defined for this purpose. We assessed our proposal on patterns representative of practical applications. The results suggest that the proposed approach is indeed feasible and that the learned similarity function is more effective than the Levenshtein distance and the Jaccard similarity index

    Interactive Example-based Finding of Text Items

    Get PDF
    We consider the problem of identifying within a given document all text items which follow a certain pattern to be specified by a user. In particular, we focus on scenarios in which the task is to be completed very quickly and the user is not able to specify the exact pattern of interest. The key use case corresponds to the interactive exploration of documents in search of snippets that do not fit Boolean, word-based search expressions. We propose an interactive framework in which the user provides examples of the items he is interested in, the system identifies items similar to those provided by the user and progressively refines the similarity criterion by submitting selected queries to the user, in an active learning fashion. The fact that the search is to be executed very quickly places severe requirements on the algorithms that can be used by the system, both for identifying the items and for constructing the queries. We propose and assess experimentally in detail a number of different design options for the components of the learning machinery. The results demonstrate the ability of our approach to achieve effectiveness close to state-of-the-art approaches based on regular expressions, while requiring an execution time which is orders of magnitude shorter

    Inference of Regular Expressions for Text Extraction from Examples

    Get PDF
    A large class of entity extraction tasks from text that is either semistructured or fully unstructured may be addressed by regular expressions, because in many practical cases the relevant entities follow an underlying syntactical pattern and this pattern may be described by a regular expression. In this work we consider the long-standing problem of synthesizing such expressions automatically, based solely on examples of the desired behavior. We present the design and implementation of a system capable of addressing extraction tasks of realistic complexity. Our system is based on an evolutionary procedure carefully tailored to the specific needs of regular expression generation by examples. The procedure executes a search driven by a multiobjective optimization strategy aimed at simultaneously improving multiple performance indexes of candidate solutions while at the same time ensuring an adequate exploration of the huge solution space. We assess our proposal experimentally in great depth, on a number of challenging datasets. The accuracy of the obtained solutions seems to be adequate for practical usage and improves over earlier proposals significantly. Most importantly, our results are highly competitive even with respect to human operators. A prototype is available as a web application at http://regex.inginf.units.it
    corecore