37,003 research outputs found

    String Matching with Variable Length Gaps

    Get PDF
    We consider string matching with variable length gaps. Given a string TT and a pattern PP consisting of strings separated by variable length gaps (arbitrary strings of length in a specified range), the problem is to find all ending positions of substrings in TT that match PP. This problem is a basic primitive in computational biology applications. Let mm and nn be the lengths of PP and TT, respectively, and let kk be the number of strings in PP. We present a new algorithm achieving time O(nlog⁥k+m+ι)O(n\log k + m +\alpha) and space O(m+A)O(m + A), where AA is the sum of the lower bounds of the lengths of the gaps in PP and ι\alpha is the total number of occurrences of the strings in PP within TT. Compared to the previous results this bound essentially achieves the best known time and space complexities simultaneously. Consequently, our algorithm obtains the best known bounds for almost all combinations of mm, nn, kk, AA, and ι\alpha. Our algorithm is surprisingly simple and straightforward to implement. We also present algorithms for finding and encoding the positions of all strings in PP for every match of the pattern.Comment: draft of full version, extended abstract at SPIRE 201

    Dictionary Matching with One Gap

    Full text link
    The dictionary matching with gaps problem is to preprocess a dictionary DD of dd gapped patterns P1,…,PdP_1,\ldots,P_d over alphabet Σ\Sigma, where each gapped pattern PiP_i is a sequence of subpatterns separated by bounded sequences of don't cares. Then, given a query text TT of length nn over alphabet Σ\Sigma, the goal is to output all locations in TT in which a pattern Pi∈DP_i\in D, 1≤i≤d1\leq i\leq d, ends. There is a renewed current interest in the gapped matching problem stemming from cyber security. In this paper we solve the problem where all patterns in the dictionary have one gap with at least α\alpha and at most β\beta don't cares, where α\alpha and β\beta are given parameters. Specifically, we show that the dictionary matching with a single gap problem can be solved in either O(dlog⁡d+∣D∣)O(d\log d + |D|) time and O(dlog⁡εd+∣D∣)O(d\log^{\varepsilon} d + |D|) space, and query time O(n(β−α)log⁡log⁡dlog⁡2min⁡{d,log⁡∣D∣}+occ)O(n(\beta -\alpha )\log\log d \log ^2 \min \{ d, \log |D| \} + occ), where occocc is the number of patterns found, or preprocessing time and space: O(d2+∣D∣)O(d^2 + |D|), and query time O(n(β−α)+occ)O(n(\beta -\alpha ) + occ), where occocc is the number of patterns found. As far as we know, this is the best solution for this setting of the problem, where many overlaps may exist in the dictionary.Comment: A preliminary version was published at CPM 201

    Automated Feedback for 'Fill in the Gap' Programming Exercises

    Get PDF
    Timely feedback is a vital component in the learning process. It is especially important for beginner students in Information Technology since many have not yet formed an effective internal model of a computer that they can use to construct viable knowledge. Research has shown that learning efficiency is increased if immediate feedback is provided for students. Automatic analysis of student programs has the potential to provide immediate feedback for students and to assist teaching staff in the marking process. This paper describes a “fill in the gap” programming analysis framework which tests students’ solutions and gives feedback on their correctness, detects logic errors and provides hints on how to fix these errors. Currently, the framework is being used with the Environment for Learning to Programming (ELP) system at Queensland University of Technology (QUT); however, the framework can be integrated into any existing online learning environment or programming Integrated Development Environment (IDE

    Opaque Service Virtualisation: A Practical Tool for Emulating Endpoint Systems

    Full text link
    Large enterprise software systems make many complex interactions with other services in their environment. Developing and testing for production-like conditions is therefore a very challenging task. Current approaches include emulation of dependent services using either explicit modelling or record-and-replay approaches. Models require deep knowledge of the target services while record-and-replay is limited in accuracy. Both face developmental and scaling issues. We present a new technique that improves the accuracy of record-and-replay approaches, without requiring prior knowledge of the service protocols. The approach uses Multiple Sequence Alignment to derive message prototypes from recorded system interactions and a scheme to match incoming request messages against prototypes to generate response messages. We use a modified Needleman-Wunsch algorithm for distance calculation during message matching. Our approach has shown greater than 99% accuracy for four evaluated enterprise system messaging protocols. The approach has been successfully integrated into the CA Service Virtualization commercial product to complement its existing techniques.Comment: In Proceedings of the 38th International Conference on Software Engineering Companion (pp. 202-211). arXiv admin note: text overlap with arXiv:1510.0142

    Applications of Machine Learning to Threat Intelligence, Intrusion Detection and Malware

    Get PDF
    Artificial Intelligence (AI) and Machine Learning (ML) are emerging technologies with applications to many fields. This paper is a survey of use cases of ML for threat intelligence, intrusion detection, and malware analysis and detection. Threat intelligence, especially attack attribution, can benefit from the use of ML classification. False positives from rule-based intrusion detection systems can be reduced with the use of ML models. Malware analysis and classification can be made easier by developing ML frameworks to distill similarities between the malicious programs. Adversarial machine learning will also be discussed, because while ML can be used to solve problems or reduce analyst workload, it also introduces new attack surfaces

    Which Regular Expression Patterns are Hard to Match?

    Full text link
    Regular expressions constitute a fundamental notion in formal language theory and are frequently used in computer science to define search patterns. A classic algorithm for these problems constructs and simulates a non-deterministic finite automaton corresponding to the expression, resulting in an O(mn)O(mn) running time (where mm is the length of the pattern and nn is the length of the text). This running time can be improved slightly (by a polylogarithmic factor), but no significantly faster solutions are known. At the same time, much faster algorithms exist for various special cases of regular expressions, including dictionary matching, wildcard matching, subset matching, word break problem etc. In this paper, we show that the complexity of regular expression matching can be characterized based on its {\em depth} (when interpreted as a formula). Our results hold for expressions involving concatenation, OR, Kleene star and Kleene plus. For regular expressions of depth two (involving any combination of the above operators), we show the following dichotomy: matching and membership testing can be solved in near-linear time, except for "concatenations of stars", which cannot be solved in strongly sub-quadratic time assuming the Strong Exponential Time Hypothesis (SETH). For regular expressions of depth three the picture is more complex. Nevertheless, we show that all problems can either be solved in strongly sub-quadratic time, or cannot be solved in strongly sub-quadratic time assuming SETH. An intriguing special case of membership testing involves regular expressions of the form "a star of an OR of concatenations", e.g., [a∣ab∣bc]∗[a|ab|bc]^*. This corresponds to the so-called {\em word break} problem, for which a dynamic programming algorithm with a runtime of (roughly) O(nm)O(n\sqrt{m}) is known. We show that the latter bound is not tight and improve the runtime to O(nm0.44…)O(nm^{0.44\ldots})

    CORLEONE - Core Linguistic Entity Online Extraction

    Get PDF
    This report presents CORLEONE (Core Linguistic Entity Online Extraction) - a pool of loosely coupled general-purpose basic lightweight linguistic processing resources, which can be independently used to identify core linguistic entities and their features in free texts. Currently, CORLEONE consists of five processing resources: (a) a basic tokenizer, (b) a tokenizer which performs fine-grained token classification, (c) a component for performing morphological analysis, and (d) a memory-efficient database-like dictionary look-up component, and (e) sentence splitter. Linguistic resources for several languages are provided. Additionally, CORLEONE includes a comprehensive library of string distance metrics relevant for the task of name variant matching. CORLEONE has been developed in the Java programming language and heavily deploys state-of-the-art finite-state techniques. Noteworthy, CORLEONE components are used as basic linguistic processing resources in ExPRESS, a pattern matching engine based on regular expressions over feature structures and in the real-time news event extraction system, which were developed by the Web Mining and Intelligence Group of the Support to External Security Unit of IPSC. This report constitutes an end-user guide for COLREONE and provides scientifically interesting details of how it was implemented.JRC.G.2-Support to external securit
    • …
    corecore