    A Theory of Formal Synthesis via Inductive Learning

    Formal synthesis is the process of generating a program satisfying a high-level formal specification. In recent times, effective formal synthesis methods have been proposed based on the use of inductive learning. We refer to this class of methods that learn programs from examples as formal inductive synthesis. In this paper, we present a theoretical framework for formal inductive synthesis. We discuss how formal inductive synthesis differs from traditional machine learning. We then describe oracle-guided inductive synthesis (OGIS), a framework that captures a family of synthesizers that operate by iteratively querying an oracle. An instance of OGIS that has had much practical impact is counterexample-guided inductive synthesis (CEGIS). We present a theoretical characterization of CEGIS for learning any program that computes a recursive language. In particular, we analyze the relative power of CEGIS variants where the types of counterexamples generated by the oracle varies. We also consider the impact of bounded versus unbounded memory available to the learning algorithm. In the special case where the universe of candidate programs is finite, we relate the speed of convergence to the notion of teaching dimension studied in machine learning theory. Altogether, the results of the paper take a first step towards a theoretical foundation for the emerging field of formal inductive synthesis

    Classifying the Arithmetical Complexity of Teaching Models

    This paper classifies the complexity of various teaching models by their position in the arithmetical hierarchy. In particular, we determine the arithmetical complexity of the index sets of the following classes: (1) the class of uniformly r.e. families with finite teaching dimension, and (2) the class of uniformly r.e. families with finite positive recursive teaching dimension witnessed by a uniformly r.e. teaching sequence. We also derive the arithmetical complexity of several other decision problems in teaching, such as the problem of deciding, given an effective coding {L0,L1,L2,…}\{\mathcal L_0,\mathcal L_1,\mathcal L_2,\ldots\} of all uniformly r.e. families, any ee such that Le={L0e,L1e,…,}\mathcal L_e = \{L^e_0,L^e_1,\ldots,\}, any ii and dd, whether or not the teaching dimension of LieL^e_i with respect to Le\mathcal L_e is upper bounded by dd.Comment: 15 pages in International Conference on Algorithmic Learning Theory, 201

    Identification of biRFSA languages

    International audienceThe task of identifying a language from a set of its words is not an easy one. For instance, it is not feasible to identify regular languages in the general case. Therefore, looking for subclasses of regular languages that can be identi?ed in this framework is an interesting problem. One of the most classical identi?able classes is the class of reversible languages, introduced by D. Angluin, also called bideterministic languages as they can be represented by deterministic automata (DFA) whose reverse is also deterministic. Residual Finite State Automata (RFSA) on the other hand is a class of non deterministic automata that shares some properties with DFA. In particular, DFA are RFSA and RFSA can be much smaller. We study here learnability of the class of languages that can be represented by biRFSA: RFSA whose reverse are RFSA. We prove that this class is not identi?able in general but we present two subclasses that are learnable, the second one being identi?able in polynomial time

    Learning Residual Finite-State Automata Using Observation Tables

    We define a two-step learner for RFSAs based on an observation table by using an algorithm for minimal DFAs to build a table for the reversal of the language in question and showing that we can derive the minimal RFSA from it after some simple modifications. We compare the algorithm to two other table-based ones of which one (by Bollig et al. 2009) infers a RFSA directly, and the other is another two-step learner proposed by the author. We focus on the criterion of query complexity.Comment: In Proceedings DCFS 2010, arXiv:1008.127

    Are There Good Mistakes? A Theoretical Analysis of CEGIS

    Counterexample-guided inductive synthesis CEGIS is used to synthesize programs from a candidate space of programs. The technique is guaranteed to terminate and synthesize the correct program if the space of candidate programs is finite. But the technique may or may not terminate with the correct program if the candidate space of programs is infinite. In this paper, we perform a theoretical analysis of counterexample-guided inductive synthesis technique. We investigate whether the set of candidate spaces for which the correct program can be synthesized using CEGIS depends on the counterexamples used in inductive synthesis, that is, whether there are good mistakes which would increase the synthesis power. We investigate whether the use of minimal counterexamples instead of arbitrary counterexamples expands the set of candidate spaces of programs for which inductive synthesis can successfully synthesize a correct program. We consider two kinds of counterexamples: minimal counterexamples and history bounded counterexamples. The history bounded counterexample used in any iteration of CEGIS is bounded by the examples used in previous iterations of inductive synthesis. We examine the relative change in power of inductive synthesis in both cases. We show that the synthesis technique using minimal counterexamples MinCEGIS has the same synthesis power as CEGIS but the synthesis technique using history bounded counterexamples HCEGIS has different power than that of CEGIS, but none dominates the other.Comment: In Proceedings SYNT 2014, arXiv:1407.493

    A Grammatical Inference Approach to Language-Based Anomaly Detection in XML

    False-positives are a problem in anomaly-based intrusion detection systems. To counter this issue, we discuss anomaly detection for the eXtensible Markup Language (XML) in a language-theoretic view. We argue that many XML-based attacks target the syntactic level, i.e. the tree structure or element content, and syntax validation of XML documents reduces the attack surface. XML offers so-called schemas for validation, but in real world, schemas are often unavailable, ignored or too general. In this work-in-progress paper we describe a grammatical inference approach to learn an automaton from example XML documents for detecting documents with anomalous syntax. We discuss properties and expressiveness of XML to understand limits of learnability. Our contributions are an XML Schema compatible lexical datatype system to abstract content in XML and an algorithm to learn visibly pushdown automata (VPA) directly from a set of examples. The proposed algorithm does not require the tree representation of XML, so it can process large documents or streams. The resulting deterministic VPA then allows stream validation of documents to recognize deviations in the underlying tree structure or datatypes.Comment: Paper accepted at First Int. Workshop on Emerging Cyberthreats and Countermeasures ECTCM 201

    Inferring Symbolic Automata

    We study the learnability of symbolic finite state automata, a model shown useful in many applications in software verification. The state-of-the-art literature on this topic follows the query learning paradigm, and so far all obtained results are positive. We provide a necessary condition for efficient learnability of SFAs in this paradigm, from which we obtain the first negative result. The main focus of our work lies in the learnability of SFAs under the paradigm of identification in the limit using polynomial time and data. We provide a necessary condition and a sufficient condition for efficient learnability of SFAs in this paradigm, from which we derive a positive and a negative result

    Polynomial Identification of omega-Automata

    We study identification in the limit using polynomial time and data for models of omega-automata. On the negative side we show that non-deterministic omega-automata (of types Buchi, coBuchi, Parity, Rabin, Street, or Muller) cannot be polynomially learned in the limit. On the positive side we show that the omega-language classes IB, IC, IP, IR, IS, and IM, which are defined by deterministic Buchi, coBuchi, Parity, Rabin, Streett, and Muller acceptors that are isomorphic to their right-congruence automata, are identifiable in the limit using polynomial time and data. We give polynomial time inclusion and equivalence algorithms for deterministic Buchi, coBuchi, Parity, Rabin, Streett, and Muller acceptors, which are used to show that the characteristic samples for IB, IC, IP, IR, IS, and IM can be constructed in polynomial time. We also provide polynomial time algorithms to test whether a given deterministic automaton of type X (for X in {B, C, P, R, S, M})is in the class IX (i.e. recognizes a language that has a deterministic automaton that is isomorphic to its right congruence automaton).Comment: This is an extended version of a paper with the same name that appeared in TACAS2
