43 research outputs found
Stream Processing using Grammars and Regular Expressions
In this dissertation we study regular expression based parsing and the use of
grammatical specifications for the synthesis of fast, streaming
string-processing programs.
In the first part we develop two linear-time algorithms for regular
expression based parsing with Perl-style greedy disambiguation. The first
algorithm operates in two passes in a semi-streaming fashion, using a constant
amount of working memory and an auxiliary tape storage which is written in the
first pass and consumed by the second. The second algorithm is a single-pass
and optimally streaming algorithm which outputs as much of the parse tree as is
semantically possible based on the input prefix read so far, and resorts to
buffering as many symbols as is required to resolve the next choice. Optimality
is obtained by performing a PSPACE-complete pre-analysis on the regular
expression.
In the second part we present Kleenex, a language for expressing
high-performance streaming string processing programs as regular grammars with
embedded semantic actions, and its compilation to streaming string transducers
with worst-case linear-time performance. Its underlying theory is based on
transducer decomposition into oracle and action machines, and a finite-state
specialization of the streaming parsing algorithm presented in the first part.
In the second part we also develop a new linear-time streaming parsing
algorithm for parsing expression grammars (PEG) which generalizes the regular
grammars of Kleenex. The algorithm is based on a bottom-up tabulation algorithm
reformulated using least fixed points and evaluated using an instance of the
chaotic iteration scheme by Cousot and Cousot
Comparison-Free Polyregular Functions.
This paper introduces a new automata-theoretic class of string-to-string functions with polynomialgrowth. Several equivalent definitions are provided: a machine model which is a restricted variant ofpebble transducers, and a few inductive definitions that close the class of regular functions undercertain operations. Our motivation for studying this class comes from another characterization,which we merely mention here but prove elsewhere, based on a λ-calculus with a linear type system.As their name suggests, these comparison-free polyregular functions form a subclass of polyregularfunctions; we prove that the inclusion is strict. We also show that they are incomparable withHDT0L transductions, closed under usual function composition – but not under a certain “map”combinator – and satisfy a comparison-free version of the pebble minimization theorem.On the broader topic of polynomial growth transductions, we also consider the recently introducedlayered streaming string transducers (SSTs), or equivalently k-marble transducers. We prove that afunction can be obtained by composing such transducers together if and only if it is polyregular,and that k-layered SSTs (or k-marble transducers) are closed under “map” and equivalent to acorresponding notion of (k + 1)-layered HDT0L systems
Recommended from our members
Symbolic Model Learning: New Algorithms and Applications
In this thesis, we study algorithms which can be used to extract, or learn, formal mathematical models from software systems and then using these models to test whether the given software systems satisfy certain security properties such as robustness against code injection attacks. Specifically, we focus on studying learning algorithms for automata and transducers and the symbolic extensions of these models, namely symbolic finite automata (SFAs). In a high level, this thesis contributes the following results:
1. In the first part of the thesis, we present a unified treatment of many common variations of the seminal L* algorithm for learning deterministic finite automata (DFAs) as a congruence learning algorithm for the underlying Nerode congruence which forms the basis of automata theory. Under this formulation the basic data structures used by different variations are unified as different ways to implement the Nerode congruence using queries.
2. Next, building on the new formulation of L*-style algorithms we proceed to develop new algorithms for learning transducer models. Firstly, we present the first algorithm for learning deterministic partial transducers. Furthermore, we extend my algorithm into non-deterministic models by introducing a novel, generalized congruence relation over string transformations which is able to capture a subclass of string transformations with regular lookahead. We demonstrate that this class is able to capture many practical string transformation from the domain of string sanitizers in Web applications.
3. Classical learning algorithms for automata and transducers operate over finite alphabets and have a query complexity that scales linearly with the size of the alphabet. However, in practice, this dependence on the alphabet size hinders the performance of the algorithms. To address this issue, we develop the MAT* algorithm for learning symbolic finite state automata (SFAs) which operate over infinite alphabets. In practice, the MAT* learning algorithm allow us to plug custom transition learning algorithms which will efficiently infer the predicates in the transitions of the SFA without querying the whole alphabet set.
4. Finally, we use our learning algorithm toolbox as the basis for the development of a set of black-box testing algorithms. More specifically, we present Grammar Oriented Filter Auditing (GOFA), a novel technique which allows one to utilize my learning algorithms to evaluate the robustness of a string sanitizer or filter against a set of attack strings given as a context-free grammar. Furthermore, because such grammars are many times unavailable, we developed sfadiff a differential testing technique based on symbolic automata learning which can be used in order to perform differential testing of two different parser implementations using SFA learning algorithms and we demonstrate how our algorithm can be used to develop program fingerprints. We evaluate our algorithms against state-of-the-art Web Application Firewalls and discover over 15 previously unknown vulnerabilities which result in evading the firewalls and performing code injection attacks in the backend Web application. Finally, we show how our learning algorithms can uncover vulnerabilities which are missed by other black-box methods such as fuzzing and grammar-based testing
Automata for branching and layered temporal structures: An investigation into regularities of infinite transition systems
This manuscript is a revised version of the PhD Thesis I wrote under the supervision of Prof. Angelo Montanari at Udine University. The leitmotif underlying the results herein provided is that, given any infinite complex system (e.g., a computer program) to be verified against a finite set of properties, there often exists a simpler system that satisfies the same properties and, in addition, presents strong regularities (e.g., periodicity) in its structure. Those regularities can then be exploited to decide, in an effective way, which property is satisfied by the system and which is not. Perhaps the most natural and effective way to deal with inherent regularities of infinite systems is through the notion of finite-state automaton. Intuitively, a finite-state automaton is an abstract machine with only a bounded amount of memory at its disposal, which processes an input (e.g., a sequence of symbols) and eventually outputs true or false, depending on the way the machine was designed and on the input itself. The present book focuses precisely on automaton-based approaches that ease the representation of and the reasoning on properties of infinite complex systems. The most simple notion of finite-state automaton, is that of single-string automaton. Such a device outputs true on a single (finite or infinite) sequence of symbols and false on any other sequence. We will show how single-string automata processing infinite sequences of symbols can be successfully applied in various frameworks for temporal representation and reasoning. In particular, we will use them to model single ultimately periodic time granularities, namely, temporal structures that are left-bounded and that, ultimately, periodically group instants of the underlying temporal domain (a simple example of such a structure is given by the partitioning of the temporal domain of days into weeks). The notion of single-string automaton can be further refined by introducing counters in order to compactly represent repeated occurrences of the same subsequence in the given input. By introducing restricted policies of counter update and by exploiting suitable abstractions of the configuration space for the resulting class of automata, we will devise efficient algorithms for reasoning on quasi-periodic time granularities (e.g., the partitioning of the temporal domain of days into years). Similar abstractions can be used when reasoning on infinite branching (temporal) structures. In such a case, one has to consider a generalized notion of automaton, which is able to process labeled branching structures (hereafter called trees), rather than linear sequences of symbols. We will show that sets of trees featuring the same properties can be identified with the equivalence classes induced by a suitable automaton. More precisely, given a property to be verified, one can first define a corresponding automaton that accepts all and only the trees satisfying that property, then introduce a suitable equivalence relation that refines the standard language equivalence and groups all trees being indistinguishable by the automaton, and, finally, exploit such an equivalence to reduce several instances of the verification problem to equivalent simpler instances, which can be eventually decided
Deciding Linear Height and Linear Size-to-Height Increase for Macro Tree Transducers
In this paper we study Macro Tree Transducers (MTT), specifically the Linear
Height Increase ("LHI") and Linear input Size to output Height ("LSHI")
constraints. In order to decide whether a Macro tree transducer (MTT) is of LHI
or LSHI, we define a notion of depth-properness: a MTT is depth-proper if, for
each state, there is no bound to the depth at which it places its argument
trees. We show how to effectively put a MTT in depth-proper form. For MTTs in
Depth-proper form, we characterize the LSH property as equivalent to the
finite-nesting property, and we characterize the LHI property as equivalent to
the finiteness of a new type of nesting which we call Multi-Leaf-nesting (or
ML-nesting). As opposed to regular nesting where we look at the nesting of
states applied to a single input node, we count the nesting of states applied
to nodes that are not ancestors of each other. We use this characterization to
give a decision procedure for the LSHI and LHI properties. Finally we consider
the decision problem of the LSOI (Linear input Size to number of distinct
Output subtrees Increase) property. A long standing open problem is whether MTT
of LSOI are as expressive as Attribute Tree Transducers (ATT), in this paper we
show that deciding whether a MTT is of LSOI is as hard as deciding the
equivalence of ATTs
Programming Using Automata and Transducers
Automata, the simplest model of computation, have proven to be an effective tool in reasoning about programs that operate over strings. Transducers augment automata to produce outputs and have been used to model string and tree transformations such as natural language translations. The success of these models is primarily due to their closure properties and decidable procedures, but good properties come at the price of limited expressiveness. Concretely, most models only support finite alphabets and can only represent small classes of languages and transformations. We focus on addressing these limitations and bridge the gap between the theory of automata and transducers and complex real-world applications: Can we extend automata and transducer models to operate over structured and infinite alphabets? Can we design languages that hide the complexity of these formalisms? Can we define executable models that can process the input efficiently? First, we introduce succinct models of transducers that can operate over large alphabets and design BEX, a language for analysing string coders. We use BEX to prove the correctness of UTF and BASE64 encoders and decoders. Next, we develop a theory of tree transducers over infinite alphabets and design FAST, a language for analysing tree-manipulating programs. We use FAST to detect vulnerabilities in HTML sanitizers, check whether augmented reality taggers conflict, and optimize and analyze functional programs that operate over lists and trees. Finally, we focus on laying the foundations of stream processing of hierarchical data such as XML files and program traces. We introduce two new efficient and executable models that can process the input in a left-to-right linear pass: symbolic visibly pushdown automata and streaming tree transducers. Symbolic visibly pushdown automata are closed under Boolean operations and can specify and efficiently monitor complex properties for hierarchical structures over infinite alphabets. Streaming tree transducers can express and efficiently process complex XML transformations while enjoying decidable procedures
Probabilistic Logic, Probabilistic Regular Expressions, and Constraint Temporal Logic
The classic theorems of BĂĽchi and Kleene state the expressive equivalence of finite automata to monadic second order logic and regular expressions, respectively. These fundamental results enjoy applications in nearly every field of theoretical computer science. Around the same time as BĂĽchi and Kleene, Rabin investigated probabilistic finite automata. This equally well established model has applications ranging from natural language processing to probabilistic model checking.
Here, we give probabilistic extensions BĂĽchi\\\''s theorem and Kleene\\\''s theorem to the probabilistic setting. We obtain a probabilistic MSO logic by adding an expected second order quantifier. In the scope of this quantifier, membership is determined by a Bernoulli process. This approach turns out to be universal and is applicable for finite and infinite words as well as for finite trees. In order to prove the expressive equivalence of this probabilistic MSO logic to probabilistic automata, we show a Nivat-theorem, which decomposes a recognisable function into a regular language, homomorphisms, and a probability measure.
For regular expressions, we build upon existing work to obtain probabilistic regular expressions on finite and infinite words. We show the expressive equivalence between these expressions and probabilistic Muller-automata. To handle Muller-acceptance conditions, we give a new construction from probabilistic regular expressions to Muller-automata. Concerning finite trees, we define probabilistic regular tree expressions using a new iteration operator, called infinity-iteration. Again, we show that these expressions are expressively equivalent to probabilistic tree automata.
On a second track of our research we investigate Constraint LTL over multidimensional data words with data values from the infinite tree. Such LTL formulas are evaluated over infinite words, where every position possesses several data values from the infinite tree. Within Constraint LTL on can compare these values from different positions. We show that the model checking problem for this logic is PSPACE-complete via investigating the emptiness problem of Constraint BĂĽchi automata