33 research outputs found
Learning Automata and Transducers: A Categorical Approach
In this paper, we present a categorical approach to learning automata over words, in the sense of the L*-algorithm of Angluin. This yields a new generic L*-like algorithm which can be instantiated for learning deterministic automata, automata weighted over fields, as well as subsequential transducers. The generic nature of our algorithm is obtained by adopting an approach in which automata are simply functors from a particular category representing words to a "computation category". We establish that the sufficient properties for yielding the existence of minimal automata (that were disclosed in a previous paper), in combination with some additional hypotheses relative to termination, ensure the correctness of our generic algorithm
Minimal Synthesis of String To String Functions From Examples
We study the problem of synthesizing string to string transformations from a
set of input/output examples. The transformations we consider are expressed
using deterministic finite automata (DFA) that read pairs of letters, one
letter from the input and one from the output. The DFA corresponding to these
transformations have additional constraints, ensuring that each input string is
mapped to exactly one output string.
We suggest that, given a set of input/output examples, the smallest DFA
consistent with the examples is a good candidate for the transformation the
user was expecting. We therefore study the problem of, given a set of examples,
finding a minimal DFA consistent with the examples and satisfying the
functionality and totality constraints mentioned above.
We prove that, in general, this problem (the corresponding decision problem)
is NP-complete. This is unlike the standard DFA minimization problem which can
be solved in polynomial time. We provide several NP-hardness proofs that show
the hardness of multiple (independent) variants of the problem.
Finally, we propose an algorithm for finding the minimal DFA consistent with
input/output examples, that uses a reduction to SMT solvers. We implemented the
algorithm, and used it to evaluate the likelihood that the minimal DFA indeed
corresponds to the DFA expected by the user.Comment: SYNT 201
The computational nature of stress assignment
While computational studies of stress patterns as phonotactics have yielded restrictive characterizations of stress (Rogers et al., 2013) with provably correct learning procedures (Heinz, 2009), an outstanding question is the nature of stress assignment as a function which assigns stress to an underlying bare string of syllables. This paper fills this gap by locating stress patterns with respect to the subsequential class of functions (Mohri, 1997), which are argued to be important for phonology in that the vast majority of phonological functions fall within the subsequential boundary (Heinz & Lai, 2013; Chandlee, 2014), with the notable exception of tone and vowel harmony (Jardine, 2016; McCollum et al., under review). The main result is that – while most, if not all quantity insensitive (QI) stress systems are subsequential functions – the same does not hold for quantity sensitive (QS) systems. Counter-intuitively, so-called default-to-opposite QS patterns are subsequential, but default-to-same QS patterns are provably not. It also supports the claim of Jardine (2016) that certain tonal patterns are non-sequential because their suprasegmental nature allows for more a more powerful computation. As stress assignment is also suprasegmental, the existence of non-sequential stress functions adds evidence for this conclusion
Recommended from our members
RedTyp: A Database of Reduplication with Computational Models
Reduplication is a theoretically and typologically well-studied phenomenon, but there is no database of reduplication patterns which include explicit computational models. This paper introduces RedTyp, an SQL database which provides a computational resource that can be used by both theoretical and computational linguists who work on reduplication. It catalogs 138 reduplicative morphemes across 91 languages, which are modeled with 57 distinct finite-state machines. The finite-state machines are 2-way transducers, which provide an explicit, compact, and convenient representation for reduplication patterns, and which arguably capture the linguistic generalizations more directly than the more commonly used 1-way transducers for modeling natural language morphophonology
Recommended from our members
Symbolic Model Learning: New Algorithms and Applications
In this thesis, we study algorithms which can be used to extract, or learn, formal mathematical models from software systems and then using these models to test whether the given software systems satisfy certain security properties such as robustness against code injection attacks. Specifically, we focus on studying learning algorithms for automata and transducers and the symbolic extensions of these models, namely symbolic finite automata (SFAs). In a high level, this thesis contributes the following results:
1. In the first part of the thesis, we present a unified treatment of many common variations of the seminal L* algorithm for learning deterministic finite automata (DFAs) as a congruence learning algorithm for the underlying Nerode congruence which forms the basis of automata theory. Under this formulation the basic data structures used by different variations are unified as different ways to implement the Nerode congruence using queries.
2. Next, building on the new formulation of L*-style algorithms we proceed to develop new algorithms for learning transducer models. Firstly, we present the first algorithm for learning deterministic partial transducers. Furthermore, we extend my algorithm into non-deterministic models by introducing a novel, generalized congruence relation over string transformations which is able to capture a subclass of string transformations with regular lookahead. We demonstrate that this class is able to capture many practical string transformation from the domain of string sanitizers in Web applications.
3. Classical learning algorithms for automata and transducers operate over finite alphabets and have a query complexity that scales linearly with the size of the alphabet. However, in practice, this dependence on the alphabet size hinders the performance of the algorithms. To address this issue, we develop the MAT* algorithm for learning symbolic finite state automata (SFAs) which operate over infinite alphabets. In practice, the MAT* learning algorithm allow us to plug custom transition learning algorithms which will efficiently infer the predicates in the transitions of the SFA without querying the whole alphabet set.
4. Finally, we use our learning algorithm toolbox as the basis for the development of a set of black-box testing algorithms. More specifically, we present Grammar Oriented Filter Auditing (GOFA), a novel technique which allows one to utilize my learning algorithms to evaluate the robustness of a string sanitizer or filter against a set of attack strings given as a context-free grammar. Furthermore, because such grammars are many times unavailable, we developed sfadiff a differential testing technique based on symbolic automata learning which can be used in order to perform differential testing of two different parser implementations using SFA learning algorithms and we demonstrate how our algorithm can be used to develop program fingerprints. We evaluate our algorithms against state-of-the-art Web Application Firewalls and discover over 15 previously unknown vulnerabilities which result in evading the firewalls and performing code injection attacks in the backend Web application. Finally, we show how our learning algorithms can uncover vulnerabilities which are missed by other black-box methods such as fuzzing and grammar-based testing
: Méthodes d'Inférence Symbolique pour les Bases de Données
This dissertation is a summary of a line of research, that I wasactively involved in, on learning in databases from examples. Thisresearch focused on traditional as well as novel database models andlanguages for querying, transforming, and describing the schema of adatabase. In case of schemas our contributions involve proposing anoriginal languages for the emerging data models of Unordered XML andRDF. We have studied learning from examples of schemas for UnorderedXML, schemas for RDF, twig queries for XML, join queries forrelational databases, and XML transformations defined with a novelmodel of tree-to-word transducers.Investigating learnability of the proposed languages required us toexamine closely a number of their fundamental properties, often ofindependent interest, including normal forms, minimization,containment and equivalence, consistency of a set of examples, andfinite characterizability. Good understanding of these propertiesallowed us to devise learning algorithms that explore a possibly largesearch space with the help of a diligently designed set ofgeneralization operations in search of an appropriate solution.Learning (or inference) is a problem that has two parameters: theprecise class of languages we wish to infer and the type of input thatthe user can provide. We focused on the setting where the user inputconsists of positive examples i.e., elements that belong to the goallanguage, and negative examples i.e., elements that do not belong tothe goal language. In general using both negative and positiveexamples allows to learn richer classes of goal languages than usingpositive examples alone. However, using negative examples is oftendifficult because together with positive examples they may cause thesearch space to take a very complex shape and its exploration may turnout to be computationally challenging.Ce mémoire est une courte présentation d’une direction de recherche, à laquelle j’ai activementparticipé, sur l’apprentissage pour les bases de données à partir d’exemples. Cette recherches’est concentrée sur les modèles et les langages, aussi bien traditionnels qu’émergents, pourl’interrogation, la transformation et la description du schéma d’une base de données. Concernantles schémas, nos contributions consistent en plusieurs langages de schémas pour les nouveaumodèles de bases de données que sont XML non-ordonné et RDF. Nous avons ainsi étudiél’apprentissage à partir d’exemples des schémas pour XML non-ordonné, des schémas pour RDF,des requêtes twig pour XML, les requêtes de jointure pour bases de données relationnelles et lestransformations XML définies par un nouveau modèle de transducteurs arbre-à -mot.Pour explorer si les langages proposés peuvent être appris, nous avons été obligés d’examinerde près un certain nombre de leurs propriétés fondamentales, souvent souvent intéressantespar elles-mêmes, y compris les formes normales, la minimisation, l’inclusion et l’équivalence, lacohérence d’un ensemble d’exemples et la caractérisation finie. Une bonne compréhension de cespropriétés nous a permis de concevoir des algorithmes d’apprentissage qui explorent un espace derecherche potentiellement très vaste grâce à un ensemble d’opérations de généralisation adapté à la recherche d’une solution appropriée.L’apprentissage (ou l’inférence) est un problème à deux paramètres : la classe précise delangage que nous souhaitons inférer et le type d’informations que l’utilisateur peut fournir. Nousnous sommes placés dans le cas où l’utilisateur fournit des exemples positifs, c’est-à -dire deséléments qui appartiennent au langage cible, ainsi que des exemples négatifs, c’est-à -dire qui n’enfont pas partie. En général l’utilisation à la fois d’exemples positifs et négatifs permet d’apprendredes classes de langages plus riches que l’utilisation uniquement d’exemples positifs. Toutefois,l’utilisation des exemples négatifs est souvent difficile parce que les exemples positifs et négatifspeuvent rendre la forme de l’espace de recherche très complexe, et par conséquent, son explorationinfaisable
Learning automata with help.
Analitzem i proposem diverses versions millorades d’algorismes existents d'aprenentatge d’autòmats finits deterministes. Les nostres millores pertanyen a la categoria d'aprenentatge amb ajuda, amb l'objectiu d'accelerar o influir en la qualitat del procés d'aprenentatge. Una part considerable del nostre treball es basa en un enfocament prà ctic; per a cada algorisme que tractem, hi ha resultats comparatius obtinguts després de la implementació i posada a prova dels processos d'aprenentatge en conjunts d'autòmats generats aleatòriament. Després de fer un gran nombre d’experiments, presentem grà fics i dades numèriques amb els resultats comparats que hem obtingut.
Estudiem algorismes pertanyents a dos models diferents d'aprenentatge: actiu i passiu. Un augment del nombre de sÃmbols de sortida permet un nombre reduït de preguntes; una orientació al llarg del procés d'aprenentatge dóna una única resposta per a diverses preguntes; la millora de l'estructura d'aprenentatge permet una millor exploració de l'entorn d'aprenentatge.
En el marc de l'aprenentatge actiu, un algorisme d'aprenentatge amb preguntes modificades i amb un etiquetatge útil no trivial és capaç d'aprendre autòmats sense contraexemples.
Es revisen les preguntes de correcció definint-les com un tipus particular d'etiquetatge. Introduïm correccions minimals, maximals i a l'atzar.
Un algorisme clà ssic aprèn autòmats tÃpics amb recorreguts aleatòries dins del marc d'aprenentatge passiu en lÃnia. Per l'algorisme original, no podem estimar el nombre d'assaigs necessaris per aprendre completament un autòmat per alguns casos. Afegint transicions inverses al graf subjacent de l'autòmat, el recorregut aleatori actúa com un recorregut aleatori en un graf no dirigit. L'avantatge és que, per a aquests grafs, hi ha un lÃmit superior polinòmic per al temps de cobertura. El nou algorisme és encara un algorisme eficient amb un lÃmit superior polinòmic pel nombre d'errors per defecte i el nombre d'assaigs.Analizamos y proponemos varias versiones mejoradas de los algoritmos de aprendizaje existentes para autómatas finitos deterministas. Nuestras mejoras pertenecen a la categorÃa de aprendizaje com ayuda, con el objetivo de acelerar, o influir sobre la calidad del proceso de aprendizaje. Una parte considerable de nuestro trabajo se basa en un enfoque práctico; para cada algoritmo que se presenta, existen resultados comparativos obtenidos después de la implementación y puesta a prueba de los procesos de aprendizaje en conjuntos de autómatas generados aleatoriamente. Después de muchos experimentos, presentamos gráficos y datos numéricos con los resultados comparados que hemos obtenido .
Estudiamos algoritmos pertenecientes a dos modelos diferentes de aprendizaje: activo y pasivo. Un aumento del número de sÃmbolos de salida permite un número reducido de consultas; una orientación parcial a lo largo del camino de aprendizaje da los resultados de varias consultas como uno solo; la mejora de la estructura de aprendizaje permite una mejor exploración del entorno de aprendizaje.
En el marco del aprendizaje activo, un algoritmo de aprendizaje com consulta modificada y dotado de un etiquetado de ayuda no trivial es capaz de aprender autómatas sin contraejemplos.
Se revisan las consultas de corrección definiéndolas como tipos particulares de etiquetado. Introducimos correcciones minimales, maximales y al azar.
Un algoritmo clásico aprende autómatas tÃpicos de recorridos aleatorios en el marco del aprendizaje pasivo en lÃnea. Para el algoritmo original, no podemos estimar el número de intentos necesarios para aprender completamente el autómata para algunos casos. Añadiendo transiciones inversas al grafo subyacente del autómata, el recorrido aleatorio actúa como un recorrido aleatorio en un grafo no dirigido. La ventaja es que, para estos grafos, existe un lÃmite superior polinómico para el tiempo de cobertura. El nuevo algoritmo es todavÃa un algoritmo eficiente con lÃmites superiores polinómicos para el número de errores por defecto y el número de intentos.We analyze and propose several enhanced versions of existing learning algorithms for deterministic finite automata. Our improvements belong to the category of helpful learning, aiming to speed up, or to influence the quality of the learning process. A considerable part of our work is based on a practical approach; for each algorithm we discuss, there are comparative results, obtained after the implementation and testing of the learning processes on sets of randomly generated automata. After extended experiments, we present graphs and numerical data with the comparative results that we obtained.
We study algorithms belonging to two different learning models: active and passive. An increased number of output symbols allows a reduced number of queries; some partial guidance along the learning path gives the results of several queries as a single one; enhancing the learning structure permits a better exploration of the learning environment.
In the active learning framework, a modified query learning algorithm benefiting by a nontrivial helpful labeling is able to learn automata without counterexamples.
We review the correction queries defining them as particular types of labeling. We introduce minimal corrections, maximal corrections, and random corrections.
A classic algorithm learns typical automata from random walks in the online passive learning framework. For the original algorithm, we cannot estimate the number of trials needed to learn completely a target automaton for some cases. Adding inverse transitions to the underlying graph of the target automaton, the random walk acts as a random walk on an undirected graph. The advantage is, that for such graphs, there exists a polynomial upper bound for the cover time. The new algorithm is still an efficient algorithm with a polynomial upper bound for the number of default mistakes and the number of trials
Regular Methods for Operator Precedence Languages
The operator precedence languages (OPLs) represent the largest known subclass of the context-free languages which enjoys all desirable closure and decidability properties. This includes the decidability of language inclusion, which is the ultimate verification problem. Operator precedence grammars, automata, and logics have been investigated and used, for example, to verify programs with arithmetic expressions and exceptions (both of which are deterministic pushdown but lie outside the scope of the visibly pushdown languages). In this paper, we complete the picture and give, for the first time, an algebraic characterization of the class of OPLs in the form of a syntactic congruence that has finitely many equivalence classes exactly for the operator precedence languages. This is a generalization of the celebrated Myhill-Nerode theorem for the regular languages to OPLs. As one of the consequences, we show that universality and language inclusion for nondeterministic operator precedence automata can be solved by an antichain algorithm. Antichain algorithms avoid determinization and complementation through an explicit subset construction, by leveraging a quasi-order on words, which allows the pruning of the search space for counterexample words without sacrificing completeness. Antichain algorithms can be implemented symbolically, and these implementations are today the best-performing algorithms in practice for the inclusion of finite automata. We give a generic construction of the quasi-order needed for antichain algorithms from a finite syntactic congruence. This yields the first antichain algorithm for OPLs, an algorithm that solves the ExpTime-hard language inclusion problem for OPLs in exponential time