65 research outputs found

    Online determinization of large mutating automata

    Get PDF
    A mutating finite automaton (MFA) is a nondeterministic finite automaton (NFA) which changes its morphology over discrete time by a sequence of mutations, one mutation at each time instant. A mutation involves the insertion and/or removal of a set of states and/or transitions. This results in a sequence of NFAs, one mutated NFA for each mutation. Some application domains, including model-based diagnosis and monitoring of active systems in artificial intelligence and model-based testing in software engineering, require online determinization of MFAs. Determinizing an MFA online means generating a deterministic finite automaton (DFA) as soon as a mutation occurs, which is equivalent to the mutated NFA. Since the classical Subset Construction determinization algorithm may be inadequate for MFAs, a conservative algorithm is proposed, called Subset Restructuring, that generates the new DFA by restructuring the previous DFA based on the mutation occurred, instead of building it from scratch. Experimental results indicate the effectiveness of the approach, especially so when large MFAs change in time by small mutations

    A Semi-automatic and Low Cost Approach to Build Scalable Lemma-based Lexical Resources for Arabic Verbs

    Get PDF
    International audienceThis work presents a method that enables Arabic NLP community to build scalable lexical resources. The proposed method is low cost and efficient in time in addition to its scalability and extendibility. The latter is reflected in the ability for the method to be incremental in both aspects, processing resources and generating lexicons. Using a corpus; firstly, tokens are drawn from the corpus and lemmatized. Secondly, finite state transducers (FSTs) are generated semi-automatically. Finally, FSTsare used to produce all possible inflected verb forms with their full morphological features. Among the algorithm’s strength is its ability to generate transducers having 184 transitions, which is very cumbersome, if manually designed. The second strength is a new inflection scheme of Arabic verbs; this increases the efficiency of FST generation algorithm. The experimentation uses a representative corpus of Modern Standard Arabic. The number of semi-automatically generated transducers is 171. The resulting open lexical resources coverage is high. Our resources cover more than 70% Arabic verbs. The built resources contain 16,855 verb lemmas and 11,080,355 fully, partially and not vocalized verbal inflected forms. All these resources are being made public and currently used as an open package in the Unitex framework available under the LGPL license

    Regular Languages meet Prefix Sorting

    Full text link
    Indexing strings via prefix (or suffix) sorting is, arguably, one of the most successful algorithmic techniques developed in the last decades. Can indexing be extended to languages? The main contribution of this paper is to initiate the study of the sub-class of regular languages accepted by an automaton whose states can be prefix-sorted. Starting from the recent notion of Wheeler graph [Gagie et al., TCS 2017]-which extends naturally the concept of prefix sorting to labeled graphs-we investigate the properties of Wheeler languages, that is, regular languages admitting an accepting Wheeler finite automaton. Interestingly, we characterize this family as the natural extension of regular languages endowed with the co-lexicographic ordering: when sorted, the strings belonging to a Wheeler language are partitioned into a finite number of co-lexicographic intervals, each formed by elements from a single Myhill-Nerode equivalence class. Moreover: (i) We show that every Wheeler NFA (WNFA) with nn states admits an equivalent Wheeler DFA (WDFA) with at most 2n1Σ2n-1-|\Sigma| states that can be computed in O(n3)O(n^3) time. This is in sharp contrast with general NFAs. (ii) We describe a quadratic algorithm to prefix-sort a proper superset of the WDFAs, a O(nlogn)O(n\log n)-time online algorithm to sort acyclic WDFAs, and an optimal linear-time offline algorithm to sort general WDFAs. By contribution (i), our algorithms can also be used to index any WNFA at the moderate price of doubling the automaton's size. (iii) We provide a minimization theorem that characterizes the smallest WDFA recognizing the same language of any input WDFA. The corresponding constructive algorithm runs in optimal linear time in the acyclic case, and in O(nlogn)O(n\log n) time in the general case. (iv) We show how to compute the smallest WDFA equivalent to any acyclic DFA in nearly-optimal time.Comment: added minimization theorems; uploaded submitted version; New version with new results (W-MH theorem, linear determinization), added author: Giovanna D'Agostin

    Quick Subset Construction

    Get PDF
    A finite automaton can be either deterministic (DFA) or nondeterministic (NFA). An automaton-based task is in general more efficient when performed with a DFA rather than an NFA. For any NFA there is an equivalent DFA that can be generated by the classical Subset Construction algorithm. When, however, a large NFA may be transformed into an equivalent DFA by a series of actions operating directly on the NFA, Subset Construction may be unnecessarily expensive in computation, as a (possibly large) deterministic portion of the NFA is regenerated as is, a waste of processing. This is why a conservative algorithm for NFA determinization is proposed, called Quick Subset Construction, which progressively transforms an NFA into an equivalent DFA instead of generating the DFA from scratch, thereby avoiding unnecessary processing. Quick Subset Construction is proven, both formally and empirically, to be equivalent to Subset Construction, inasmuch it generates exactly the same DFA. Experimental results indicate that, the smaller the number of repair actions performed on the NFA, as compared to the size of the equivalent DFA, the faster Quick Subset Construction over Subset Construction

    Constructing minimal acyclic deterministic finite automata

    Get PDF
    This thesis is submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Ph.D) in the FASTAR group of the Department of Computer Science, University of Pretoria, South Africa. I present a number of algorithms for constructing minimal acyclic deterministic finite automata (MADFAs), most of which I originally derived/designed or co-discovered. Being acyclic, such automata represent finite languages and have proven useful in applications such as spellchecking, virus-searching and text indexing. In many of those applications, the automata grow to billions of states, making them difficult to store without using various compression techniques — the most important of which is minimization. Results from the late 1950’s show that minimization yields a unique automaton (for a given language), and later results show that minimization of acyclic automata is possible in time linear in the number of states. These two results make for a rich area of algorithmics research; automata and algorithmics research are relatively old fields of computing science and the discovery/invention of new algorithms in the field is an exciting result. I present both incremental and nonincremental algorithms. With nonincremental techniques, the unminimized acyclic deterministic finite automaton (ADFA) is first constructed and then minimized. As mentioned above, the unminimized ADFA can be very large indeed — often even too large to fit within the virtual memory space of the computer. As a result, incremental techniques for minimization (i.e. the ADFA is minimized during its construction) become interesting. Incremental algorithms frequently have some overhead: if the unminimized ADFA fits easily within physical memory, it may still be faster to use nonincremental techniques. The presentation used in this thesis has a few unusual characteristics: Few other presentations follow a correctness-by-construction style for presenting and deriving algorithms. The presentations given here include correctness arguments or sketches thereof. The presentation is taxonomic — emphasizing the similarities and differences between the algorithms at a fundamental level. While it is possible to present these algorithms in a formal-language-theoretic setting, this thesis remains somewhat closer to the actual implementation issues. In several chapters, new algorithms and interesting new variants of existing algorithms are presented. It gives new presentations of many existing algorithms — all in a common format with common examples. There are extensive links to the existing literature. Thesis (PhD)--University of Pretoria, 2010.Computer Scienceunrestricte

    On Dependency Analysis via Contractions and Weighted FSTs

    Get PDF
    Arc contractions in syntactic dependency graphs can be used to decide which graphs are trees. The paper observes that these contractions can be expressed with weighted finite-state transducers (weighted FST) that operate on string-encoded trees. The observation gives rise to a finite-state parsing algorithm that computes the parse forest and extracts the best parses from it. The algorithm is customizable to functional and bilexical dependency parsing, and it can be extended to non-projective parsing via a multi-planar encoding with prior results on high recall. Our experiments support an analysis of projective parsing according to which the worst-case time complexity of the algorithm is quadratic to the sentence length, and linear to the overlapping arcs and the number of functional categories of the arcs. The results suggest several interesting directions towards efficient and highprecision dependency parsing that takes advantage of the flexibility and the demonstrated ambiguity-packing capacity of such a parser.Peer reviewe

    A Fast Compiler for NetKAT

    Full text link
    High-level programming languages play a key role in a growing number of networking platforms, streamlining application development and enabling precise formal reasoning about network behavior. Unfortunately, current compilers only handle "local" programs that specify behavior in terms of hop-by-hop forwarding behavior, or modest extensions such as simple paths. To encode richer "global" behaviors, programmers must add extra state -- something that is tricky to get right and makes programs harder to write and maintain. Making matters worse, existing compilers can take tens of minutes to generate the forwarding state for the network, even on relatively small inputs. This forces programmers to waste time working around performance issues or even revert to using hardware-level APIs. This paper presents a new compiler for the NetKAT language that handles rich features including regular paths and virtual networks, and yet is several orders of magnitude faster than previous compilers. The compiler uses symbolic automata to calculate the extra state needed to implement "global" programs, and an intermediate representation based on binary decision diagrams to dramatically improve performance. We describe the design and implementation of three essential compiler stages: from virtual programs (which specify behavior in terms of virtual topologies) to global programs (which specify network-wide behavior in terms of physical topologies), from global programs to local programs (which specify behavior in terms of single-switch behavior), and from local programs to hardware-level forwarding tables. We present results from experiments on real-world benchmarks that quantify performance in terms of compilation time and forwarding table size

    An Intelligent Text Extraction and Navigation System

    Get PDF
    We present sppc, a high-performance system for intelligent text extraction and navigation from German free text documents. The main purpose of sppc is to extract as much linguistic structure as possible for performing domain-specific processing. sppc consists of a set of domain-independent shallow core components which are realized by means of cascaded weighted finite state machines and generic dynamic tries. All extracted information is represented uniformly in one data structure (called the text chart) in a highly compact and linked form in order to support indexing and navigation through the set of solutions. Germa
    corecore