2,496 research outputs found

    Fast and Compact Regular Expression Matching

    Get PDF
    We study 4 problems in string matching, namely, regular expression matching, approximate regular expression matching, string edit distance, and subsequence indexing, on a standard word RAM model of computation that allows logarithmic-sized words to be manipulated in constant time. We show how to improve the space and/or remove a dependency on the alphabet size for each problem using either an improved tabulation technique of an existing algorithm or by combining known algorithms in a new way

    Fast and Tiny Structural Self-Indexes for XML

    Full text link
    XML document markup is highly repetitive and therefore well compressible using dictionary-based methods such as DAGs or grammars. In the context of selectivity estimation, grammar-compressed trees were used before as synopsis for structural XPath queries. Here a fully-fledged index over such grammars is presented. The index allows to execute arbitrary tree algorithms with a slow-down that is comparable to the space improvement. More interestingly, certain algorithms execute much faster over the index (because no decompression occurs). E.g., for structural XPath count queries, evaluating over the index is faster than previous XPath implementations, often by two orders of magnitude. The index also allows to serialize XML results (including texts) faster than previous systems, by a factor of ca. 2-3. This is due to efficient copy handling of grammar repetitions, and because materialization is totally avoided. In order to compare with twig join implementations, we implemented a materializer which writes out pre-order numbers of result nodes, and show its competitiveness.Comment: 13 page

    State minimization problems in finite state automata

    Get PDF
    In this thesis, we analyze the problem of state minimization in 2-MDFAs. The class of 2-MDFAs is an extension of the class of DFAs, allowing a small amount of nondeterminism; specifically two start states. Since nondeterminism allows finite automata to be more succinct, it is worthwhile to investigate the problem of minimizing such finite automata. In the case of unbounded non-determinism, i.e., NFAs, such automata can be exponentially more succinct than DFAs [1], but the corresponding minimization problem is PSPACE-complete [2]. Even in the case of 2-MDFAs, which are only polynomially more succinct than DFAs, the minimization problem remains non-trivial; indeed, [3] shows that the corresponding decision problem is NP-complete. We are concerned with the approximability of the 2-MDFA minimization problem. Our main contribution in the current work is the design of an n-factor approximation algorithm for state minimization in 2-MDFAs

    LexiDB: Patterns & Methods for Corpus Linguistic Database Management

    Get PDF
    LexiDB is a tool for storing, managing and querying corpus data. In contrast to other database management systems (DBMSs), itis designed specifically for text corpora. It improves on other corpus management systems (CMSs) because data can be added anddeleted from corpora on the fly with the ability to add live data to existing corpora. LexiDB sits between these two categories ofDBMSs and CMSs, more specialised to language data than a general-purpose DBMS but more flexible than a traditional static corpusmanagement system. Previous work has demonstrated the scalability of LexiDB in response to the growing need to be able to scale outfor ever-growing corpus datasets. Here, we present the patterns and methods developed in LexiDB for storage, retrieval and querying ofmulti-level annotated corpus data. These techniques are evaluated and compared to an existing CMS (Corpus Workbench CWB - CQP)and indexer (Lucene). We find that LexiDB consistently outperforms existing tools for corpus queries. This is particularly apparent withlarge corpora and when handling queries with large result sets

    A parallel grid-based implementation for real time processing of event log data in collaborative applications

    Get PDF
    Collaborative applications usually register user interaction in the form of semi-structured plain text event log data. Extracting and structuring of data is a prerequisite for later key processes such as the analysis of interactions, assessment of group activity, or the provision of awareness and feedback. Yet, in real situations of online collaborative activity, the processing of log data is usually done offline since structuring event log data is, in general, a computationally costly process and the amount of log data tends to be very large. Techniques to speed and scale up the structuring and processing of log data with minimal impact on the performance of the collaborative application are thus desirable to be able to process log data in real time. In this paper, we present a parallel grid-based implementation for processing in real time the event log data generated in collaborative applications. Our results show the feasibility of using grid middleware to speed and scale up the process of structuring and processing semi-structured event log data. The Grid prototype follows the Master-Worker (MW) paradigm. It is implemented using the Globus Toolkit (GT) and is tested on the Planetlab platform

    An Experiment in Ping-Pong Protocol Verification by Nondeterministic Pushdown Automata

    Get PDF
    An experiment is described that confirms the security of a well-studied class of cryptographic protocols (Dolev-Yao intruder model) can be verified by two-way nondeterministic pushdown automata (2NPDA). A nondeterministic pushdown program checks whether the intersection of a regular language (the protocol to verify) and a given Dyck language containing all canceling words is empty. If it is not, an intruder can reveal secret messages sent between trusted users. The verification is guaranteed to terminate in cubic time at most on a 2NPDA-simulator. The interpretive approach used in this experiment simplifies the verification, by separating the nondeterministic pushdown logic and program control, and makes it more predictable. We describe the interpretive approach and the known transformational solutions, and show they share interesting features. Also noteworthy is how abstract results from automata theory can solve practical problems by programming language means.Comment: In Proceedings MARS/VPT 2018, arXiv:1803.0866

    Subpath Queries on Compressed Graphs: A Survey

    Get PDF
    Text indexing is a classical algorithmic problem that has been studied for over four decades: given a text T, pre-process it off-line so that, later, we can quickly count and locate the occurrences of any string (the query pattern) in T in time proportional to the query’s length. The earliest optimal-time solution to the problem, the suffix tree, dates back to 1973 and requires up to two orders of magnitude more space than the plain text just to be stored. In the year 2000, two breakthrough works showed that efficient queries can be achieved without this space overhead: a fast index be stored in a space proportional to the text’s entropy. These contributions had an enormous impact in bioinformatics: today, virtually any DNA aligner employs compressed indexes. Recent trends considered more powerful compression schemes (dictionary compressors) and generalizations of the problem to labeled graphs: after all, texts can be viewed as labeled directed paths. In turn, since finite state automata can be considered as a particular case of labeled graphs, these findings created a bridge between the fields of compressed indexing and regular language theory, ultimately allowing to index regular languages and promising to shed new light on problems, such as regular expression matching. This survey is a gentle introduction to the main landmarks of the fascinating journey that took us from suffix trees to today’s compressed indexes for labeled graphs and regular languages
    corecore