191,633 research outputs found

    FlashProfile: A Framework for Synthesizing Data Profiles

    Get PDF
    We address the problem of learning a syntactic profile for a collection of strings, i.e. a set of regex-like patterns that succinctly describe the syntactic variations in the strings. Real-world datasets, typically curated from multiple sources, often contain data in various syntactic formats. Thus, any data processing task is preceded by the critical step of data format identification. However, manual inspection of data to identify the different formats is infeasible in standard big-data scenarios. Prior techniques are restricted to a small set of pre-defined patterns (e.g. digits, letters, words, etc.), and provide no control over granularity of profiles. We define syntactic profiling as a problem of clustering strings based on syntactic similarity, followed by identifying patterns that succinctly describe each cluster. We present a technique for synthesizing such profiles over a given language of patterns, that also allows for interactive refinement by requesting a desired number of clusters. Using a state-of-the-art inductive synthesis framework, PROSE, we have implemented our technique as FlashProfile. Across 153153 tasks over 7575 large real datasets, we observe a median profiling time of only ∼ 0.7 \sim\,0.7\,s. Furthermore, we show that access to syntactic profiles may allow for more accurate synthesis of programs, i.e. using fewer examples, in programming-by-example (PBE) workflows such as FlashFill.Comment: 28 pages, SPLASH (OOPSLA) 201

    Inductive logic programming as satisfiability modulo theories

    Get PDF
    At the intersection of machine learning, program synthesis and automated reasoning lies the field of Inductive Logic Programming (ILP). The aim of ILP is to automatically learn relational programs from input/output examples, typically through logic-based techniques. Inspired by Karl Popper’s falsification perspective on science, this dissertation sets out a new approach to ILP: Learning From Failures (LFF). In science, starting from a huge set of a priori viable hypotheses, we select a hypothesis to test. This hypothesis typically gets falsified due to failing in some specific way. By examining the failure we learn that an entire space of related hypotheses is now ruled out. Having thus reduced our set of viable hypotheses, we subsequently select from just those that remain. LFF applies this methodology to program induction, codifying it as a three-stage loop. The generate stage maintains a formula whose satisfying assignments correspond to the set of viable hypotheses. The test stage takes a satisfying assignment, interprets it as a logic program and tests it against training examples – imperfect fit is considered a failure. The constrain stage turns a failure into constraints to add to the generate stage’s formula, thereby eliminating a class of hypotheses which will fail for the same reason. The thesis of this dissertation is three-fold. The first claim is that LFF frames the ILP problem as one of Satisfiability Modulo Theories (SMT). With the search for viable hypotheses handed-off to a SAT-solver, the test stage can be regarded as a theory solver collaborating with the SAT-solver, effectively making ILP’s notion of background knowledge into a (Horn) background theory. The second claim is that LFF’s three-stage loop is an effective basis for falsification-based program induction. Chapter 4 develops the above correspondence into a feature-rich and flexible three-stage ILP system. Experimental evidence is provided for this system going beyond the state-of-the-art in ILP, e.g., by supporting large hypothesis spaces and large domains. The third claim is that the LFF-as-SMT-perspective helps apply satisfiability solving techniques to ILP, in particular to reduce hypothesis space exploration. In Chapter 5, we show that SMT-based techniques for explaining conflicts have a natural analog in terms of explaining which parts of a hypothesis are responsible for its failure. In Chapter 6, we incorporate other theory solvers alongside the test stage to reason about the (satisfiability of) over-approximating properties of hypotheses. We show that both of these techniques can significantly reduce the number of iterations of the three-stage loop

    Program Synthesis using Natural Language

    Get PDF
    Interacting with computers is a ubiquitous activity for millions of people. Repetitive or specialized tasks often require creation of small, often one-off, programs. End-users struggle with learning and using the myriad of domain-specific languages (DSLs) to effectively accomplish these tasks. We present a general framework for constructing program synthesizers that take natural language (NL) inputs and produce expressions in a target DSL. The framework takes as input a DSL definition and training data consisting of NL/DSL pairs. From these it constructs a synthesizer by learning optimal weights and classifiers (using NLP features) that rank the outputs of a keyword-programming based translation. We applied our framework to three domains: repetitive text editing, an intelligent tutoring system, and flight information queries. On 1200+ English descriptions, the respective synthesizers rank the desired program as the top-1 and top-3 for 80% and 90% descriptions respectively

    Overfitting in Synthesis: Theory and Practice (Extended Version)

    Get PDF
    In syntax-guided synthesis (SyGuS), a synthesizer's goal is to automatically generate a program belonging to a grammar of possible implementations that meets a logical specification. We investigate a common limitation across state-of-the-art SyGuS tools that perform counterexample-guided inductive synthesis (CEGIS). We empirically observe that as the expressiveness of the provided grammar increases, the performance of these tools degrades significantly. We claim that this degradation is not only due to a larger search space, but also due to overfitting. We formally define this phenomenon and prove no-free-lunch theorems for SyGuS, which reveal a fundamental tradeoff between synthesizer performance and grammar expressiveness. A standard approach to mitigate overfitting in machine learning is to run multiple learners with varying expressiveness in parallel. We demonstrate that this insight can immediately benefit existing SyGuS tools. We also propose a novel single-threaded technique called hybrid enumeration that interleaves different grammars and outperforms the winner of the 2018 SyGuS competition (Inv track), solving more problems and achieving a 5×5\times mean speedup.Comment: 24 pages (5 pages of appendices), 7 figures, includes proofs of theorem

    Learning a Static Analyzer from Data

    Full text link
    To be practically useful, modern static analyzers must precisely model the effect of both, statements in the programming language as well as frameworks used by the program under analysis. While important, manually addressing these challenges is difficult for at least two reasons: (i) the effects on the overall analysis can be non-trivial, and (ii) as the size and complexity of modern libraries increase, so is the number of cases the analysis must handle. In this paper we present a new, automated approach for creating static analyzers: instead of manually providing the various inference rules of the analyzer, the key idea is to learn these rules from a dataset of programs. Our method consists of two ingredients: (i) a synthesis algorithm capable of learning a candidate analyzer from a given dataset, and (ii) a counter-example guided learning procedure which generates new programs beyond those in the initial dataset, critical for discovering corner cases and ensuring the learned analysis generalizes to unseen programs. We implemented and instantiated our approach to the task of learning JavaScript static analysis rules for a subset of points-to analysis and for allocation sites analysis. These are challenging yet important problems that have received significant research attention. We show that our approach is effective: our system automatically discovered practical and useful inference rules for many cases that are tricky to manually identify and are missed by state-of-the-art, manually tuned analyzers

    Neural Programming by Example

    Full text link
    Programming by Example (PBE) targets at automatically inferring a computer program for accomplishing a certain task from sample input and output. In this paper, we propose a deep neural networks (DNN) based PBE model called Neural Programming by Example (NPBE), which can learn from input-output strings and induce programs that solve the string manipulation problems. Our NPBE model has four neural network based components: a string encoder, an input-output analyzer, a program generator, and a symbol selector. We demonstrate the effectiveness of NPBE by training it end-to-end to solve some common string manipulation problems in spreadsheet systems. The results show that our model can induce string manipulation programs effectively. Our work is one step towards teaching DNN to generate computer programs.Comment: 7 pages, Association for the Advancement of Artificial Intelligence (AAAI
    • …
    corecore