1,444 research outputs found
Incremental construction of minimal acyclic finite-state automata
In this paper, we describe a new method for constructing minimal,
deterministic, acyclic finite-state automata from a set of strings. Traditional
methods consist of two phases: the first to construct a trie, the second one to
minimize it. Our approach is to construct a minimal automaton in a single phase
by adding new strings one by one and minimizing the resulting automaton
on-the-fly. We present a general algorithm as well as a specialization that
relies upon the lexicographical ordering of the input strings.Comment: 14 pages, 7 figure
Finite Automata for the Sub- and Superword Closure of CFLs: Descriptional and Computational Complexity
We answer two open questions by (Gruber, Holzer, Kutrib, 2009) on the
state-complexity of representing sub- or superword closures of context-free
grammars (CFGs): (1) We prove a (tight) upper bound of on
the size of nondeterministic finite automata (NFAs) representing the subword
closure of a CFG of size . (2) We present a family of CFGs for which the
minimal deterministic finite automata representing their subword closure
matches the upper-bound of following from (1).
Furthermore, we prove that the inequivalence problem for NFAs representing sub-
or superword-closed languages is only NP-complete as opposed to PSPACE-complete
for general NFAs. Finally, we extend our results into an approximation method
to attack inequivalence problems for CFGs
DAFSA: a Python library for Deterministic Acyclic Finite State Automata [Software]
This work describes dafsa, a Python library for computing graphs from lists of strings for identifying, visualizing, and inspecting patterns of substrings. The library is designed for usage by linguists in studies on morphology and formal grammars, and is intended for faster, easier, and simpler generation of visualizations. It collects frequency weights by default, it can condense structures, and it provides several export options. Figure 1 depicts a basic DAFSA, based upon five English words and generated with default settings
Joining Extractions of Regular Expressions
Regular expressions with capture variables, also known as "regex formulas,"
extract relations of spans (interval positions) from text. These relations can
be further manipulated via Relational Algebra as studied in the context of
document spanners, Fagin et al.'s formal framework for information extraction.
We investigate the complexity of querying text by Conjunctive Queries (CQs) and
Unions of CQs (UCQs) on top of regex formulas. We show that the lower bounds
(NP-completeness and W[1]-hardness) from the relational world also hold in our
setting; in particular, hardness hits already single-character text! Yet, the
upper bounds from the relational world do not carry over. Unlike the relational
world, acyclic CQs, and even gamma-acyclic CQs, are hard to compute. The source
of hardness is that it may be intractable to instantiate the relation defined
by a regex formula, simply because it has an exponential number of tuples. Yet,
we are able to establish general upper bounds. In particular, UCQs can be
evaluated with polynomial delay, provided that every CQ has a bounded number of
atoms (while unions and projection can be arbitrary). Furthermore, UCQ
evaluation is solvable with FPT (Fixed-Parameter Tractable) delay when the
parameter is the size of the UCQ
A Semi-automatic and Low Cost Approach to Build Scalable Lemma-based Lexical Resources for Arabic Verbs
International audienceThis work presents a method that enables Arabic NLP community to build scalable lexical resources. The proposed method is low cost and efficient in time in addition to its scalability and extendibility. The latter is reflected in the ability for the method to be incremental in both aspects, processing resources and generating lexicons. Using a corpus; firstly, tokens are drawn from the corpus and lemmatized. Secondly, finite state transducers (FSTs) are generated semi-automatically. Finally, FSTsare used to produce all possible inflected verb forms with their full morphological features. Among the algorithm’s strength is its ability to generate transducers having 184 transitions, which is very cumbersome, if manually designed. The second strength is a new inflection scheme of Arabic verbs; this increases the efficiency of FST generation algorithm. The experimentation uses a representative corpus of Modern Standard Arabic. The number of semi-automatically generated transducers is 171. The resulting open lexical resources coverage is high. Our resources cover more than 70% Arabic verbs. The built resources contain 16,855 verb lemmas and 11,080,355 fully, partially and not vocalized verbal inflected forms. All these resources are being made public and currently used as an open package in the Unitex framework available under the LGPL license
CAIR: Using Formal Languages to Study Routing, Leaking, and Interception in BGP
The Internet routing protocol BGP expresses topological reachability and
policy-based decisions simultaneously in path vectors. A complete view on the
Internet backbone routing is given by the collection of all valid routes, which
is infeasible to obtain due to information hiding of BGP, the lack of
omnipresent collection points, and data complexity. Commonly, graph-based data
models are used to represent the Internet topology from a given set of BGP
routing tables but fall short of explaining policy contexts. As a consequence,
routing anomalies such as route leaks and interception attacks cannot be
explained with graphs.
In this paper, we use formal languages to represent the global routing system
in a rigorous model. Our CAIR framework translates BGP announcements into a
finite route language that allows for the incremental construction of minimal
route automata. CAIR preserves route diversity, is highly efficient, and
well-suited to monitor BGP path changes in real-time. We formally derive
implementable search patterns for route leaks and interception attacks. In
contrast to the state-of-the-art, we can detect these incidents. In practical
experiments, we analyze public BGP data over the last seven years
- …