75 research outputs found

    Online determinization of large mutating automata

    Get PDF
    A mutating finite automaton (MFA) is a nondeterministic finite automaton (NFA) which changes its morphology over discrete time by a sequence of mutations, one mutation at each time instant. A mutation involves the insertion and/or removal of a set of states and/or transitions. This results in a sequence of NFAs, one mutated NFA for each mutation. Some application domains, including model-based diagnosis and monitoring of active systems in artificial intelligence and model-based testing in software engineering, require online determinization of MFAs. Determinizing an MFA online means generating a deterministic finite automaton (DFA) as soon as a mutation occurs, which is equivalent to the mutated NFA. Since the classical Subset Construction determinization algorithm may be inadequate for MFAs, a conservative algorithm is proposed, called Subset Restructuring, that generates the new DFA by restructuring the previous DFA based on the mutation occurred, instead of building it from scratch. Experimental results indicate the effectiveness of the approach, especially so when large MFAs change in time by small mutations

    A Semi-automatic and Low Cost Approach to Build Scalable Lemma-based Lexical Resources for Arabic Verbs

    Get PDF
    International audienceThis work presents a method that enables Arabic NLP community to build scalable lexical resources. The proposed method is low cost and efficient in time in addition to its scalability and extendibility. The latter is reflected in the ability for the method to be incremental in both aspects, processing resources and generating lexicons. Using a corpus; firstly, tokens are drawn from the corpus and lemmatized. Secondly, finite state transducers (FSTs) are generated semi-automatically. Finally, FSTsare used to produce all possible inflected verb forms with their full morphological features. Among the algorithm’s strength is its ability to generate transducers having 184 transitions, which is very cumbersome, if manually designed. The second strength is a new inflection scheme of Arabic verbs; this increases the efficiency of FST generation algorithm. The experimentation uses a representative corpus of Modern Standard Arabic. The number of semi-automatically generated transducers is 171. The resulting open lexical resources coverage is high. Our resources cover more than 70% Arabic verbs. The built resources contain 16,855 verb lemmas and 11,080,355 fully, partially and not vocalized verbal inflected forms. All these resources are being made public and currently used as an open package in the Unitex framework available under the LGPL license

    DFKI finite-state machine toolkit

    Get PDF
    Finite-state devices such as finite-state automata and finite-state transducers have been known since the emergence of computer science and are recently extensively used in many areas of language technology. The use of finite-state devices is mainly motivated by their time and space efficiency. In this paper we present the Finite-State Machine Toolkit for building, combining and optimizing the finite-state machines, developed at the Language Technology Lab of the German Research Center for Artificial Intelligence

    Regular Languages meet Prefix Sorting

    Full text link
    Indexing strings via prefix (or suffix) sorting is, arguably, one of the most successful algorithmic techniques developed in the last decades. Can indexing be extended to languages? The main contribution of this paper is to initiate the study of the sub-class of regular languages accepted by an automaton whose states can be prefix-sorted. Starting from the recent notion of Wheeler graph [Gagie et al., TCS 2017]-which extends naturally the concept of prefix sorting to labeled graphs-we investigate the properties of Wheeler languages, that is, regular languages admitting an accepting Wheeler finite automaton. Interestingly, we characterize this family as the natural extension of regular languages endowed with the co-lexicographic ordering: when sorted, the strings belonging to a Wheeler language are partitioned into a finite number of co-lexicographic intervals, each formed by elements from a single Myhill-Nerode equivalence class. Moreover: (i) We show that every Wheeler NFA (WNFA) with nn states admits an equivalent Wheeler DFA (WDFA) with at most 2n1Σ2n-1-|\Sigma| states that can be computed in O(n3)O(n^3) time. This is in sharp contrast with general NFAs. (ii) We describe a quadratic algorithm to prefix-sort a proper superset of the WDFAs, a O(nlogn)O(n\log n)-time online algorithm to sort acyclic WDFAs, and an optimal linear-time offline algorithm to sort general WDFAs. By contribution (i), our algorithms can also be used to index any WNFA at the moderate price of doubling the automaton's size. (iii) We provide a minimization theorem that characterizes the smallest WDFA recognizing the same language of any input WDFA. The corresponding constructive algorithm runs in optimal linear time in the acyclic case, and in O(nlogn)O(n\log n) time in the general case. (iv) We show how to compute the smallest WDFA equivalent to any acyclic DFA in nearly-optimal time.Comment: added minimization theorems; uploaded submitted version; New version with new results (W-MH theorem, linear determinization), added author: Giovanna D'Agostin

    Quick Subset Construction

    Get PDF
    A finite automaton can be either deterministic (DFA) or nondeterministic (NFA). An automaton-based task is in general more efficient when performed with a DFA rather than an NFA. For any NFA there is an equivalent DFA that can be generated by the classical Subset Construction algorithm. When, however, a large NFA may be transformed into an equivalent DFA by a series of actions operating directly on the NFA, Subset Construction may be unnecessarily expensive in computation, as a (possibly large) deterministic portion of the NFA is regenerated as is, a waste of processing. This is why a conservative algorithm for NFA determinization is proposed, called Quick Subset Construction, which progressively transforms an NFA into an equivalent DFA instead of generating the DFA from scratch, thereby avoiding unnecessary processing. Quick Subset Construction is proven, both formally and empirically, to be equivalent to Subset Construction, inasmuch it generates exactly the same DFA. Experimental results indicate that, the smaller the number of repair actions performed on the NFA, as compared to the size of the equivalent DFA, the faster Quick Subset Construction over Subset Construction

    Formal Languages and Compilation

    Get PDF
    This textbook describes the essential principles and methods used for defining the syntax of artificial languages, and for designing efficient parsing algorithms and syntax-directed translators with semantic attributes. A comprehensive selection of topics is presented within a rigorous, unified framework, illustrated by numerous practical examples. Features and topics: presents a novel conceptual approach to parsing algorithms that applies to extended BNF grammars, together with a parallel parsing algorithm; supplies supplementary teaching tools, including course slides and exercises with solutions, at an associated website; unifies the concepts and notations used in different approaches, enabling an extended coverage of methods with a reduced number of definitions; systematically discusses ambiguous forms, allowing readers to avoid pitfalls when designing grammars; describes all algorithms in pseudocode, so that detailed knowledge of a specific programming language is not necessary; makes extensive usage of theoretical models of automata, transducers and formal grammars; includes concise coverage of algorithms for processing regular expressions and finite automata; and introduces static program analysis based on flow equations. This clearly-written, classroom-tested textbook is an ideal guide to the fundamentals of this field for advanced undergraduate and graduate students in computer science and computer engineering. Some background in programming is required, and readers should also be familiar with basic set theory, algebra and logic

    Constructing minimal acyclic deterministic finite automata

    Get PDF
    This thesis is submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Ph.D) in the FASTAR group of the Department of Computer Science, University of Pretoria, South Africa. I present a number of algorithms for constructing minimal acyclic deterministic finite automata (MADFAs), most of which I originally derived/designed or co-discovered. Being acyclic, such automata represent finite languages and have proven useful in applications such as spellchecking, virus-searching and text indexing. In many of those applications, the automata grow to billions of states, making them difficult to store without using various compression techniques — the most important of which is minimization. Results from the late 1950’s show that minimization yields a unique automaton (for a given language), and later results show that minimization of acyclic automata is possible in time linear in the number of states. These two results make for a rich area of algorithmics research; automata and algorithmics research are relatively old fields of computing science and the discovery/invention of new algorithms in the field is an exciting result. I present both incremental and nonincremental algorithms. With nonincremental techniques, the unminimized acyclic deterministic finite automaton (ADFA) is first constructed and then minimized. As mentioned above, the unminimized ADFA can be very large indeed — often even too large to fit within the virtual memory space of the computer. As a result, incremental techniques for minimization (i.e. the ADFA is minimized during its construction) become interesting. Incremental algorithms frequently have some overhead: if the unminimized ADFA fits easily within physical memory, it may still be faster to use nonincremental techniques. The presentation used in this thesis has a few unusual characteristics: Few other presentations follow a correctness-by-construction style for presenting and deriving algorithms. The presentations given here include correctness arguments or sketches thereof. The presentation is taxonomic — emphasizing the similarities and differences between the algorithms at a fundamental level. While it is possible to present these algorithms in a formal-language-theoretic setting, this thesis remains somewhat closer to the actual implementation issues. In several chapters, new algorithms and interesting new variants of existing algorithms are presented. It gives new presentations of many existing algorithms — all in a common format with common examples. There are extensive links to the existing literature. Thesis (PhD)--University of Pretoria, 2010.Computer Scienceunrestricte
    corecore