47 research outputs found

    Teaching Large Language Models to Self-Debug

    Full text link
    Large language models (LLMs) have achieved impressive performance on code generation. However, for complex programming tasks, generating the correct solution in one go becomes challenging, thus some prior works have designed program repair approaches to improve code generation performance. In this work, we propose Self-Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations. In particular, we demonstrate that Self-Debugging can teach the large language model to perform rubber duck debugging; i.e., without any feedback on the code correctness or error messages, the model is able to identify its mistakes by explaining the generated code in natural language. Self-Debugging achieves the state-of-the-art performance on several code generation benchmarks, including the Spider dataset for text-to-SQL generation, TransCoder for C++-to-Python translation, and MBPP for text-to-Python generation. On the Spider benchmark where there are no unit tests to verify the correctness of predictions, Self-Debugging with code explanation consistently improves the baseline by 2-3%, and improves the prediction accuracy on problems of the hardest label by 9%. On TransCoder and MBPP where unit tests are available, Self-Debugging improves the baseline accuracy by up to 12%. Meanwhile, by leveraging feedback messages and reusing failed predictions, Self-Debugging notably improves sample efficiency, and can match or outperform baseline models that generate more than 10x candidate programs

    *-CFQ: Analyzing the Scalability of Machine Learning on a Compositional Task

    Full text link
    We present *-CFQ ("star-CFQ"): a suite of large-scale datasets of varying scope based on the CFQ semantic parsing benchmark, designed for principled investigation of the scalability of machine learning systems in a realistic compositional task setting. Using this suite, we conduct a series of experiments investigating the ability of Transformers to benefit from increased training size under conditions of fixed computational cost. We show that compositional generalization remains a challenge at all training sizes, and we show that increasing the scope of natural language leads to consistently higher error rates, which are only partially offset by increased training data. We further show that while additional training data from a related domain improves the accuracy in data-starved situations, this improvement is limited and diminishes as the distance from the related domain to the target domain increases.Comment: Accepted, AAAI-2

    Large Language Models Can Be Easily Distracted by Irrelevant Context

    Full text link
    Large language models have achieved impressive performance on various natural language processing tasks. However, so far they have been evaluated primarily on benchmarks where all information in the input context is relevant for solving the task. In this work, we investigate the distractibility of large language models, i.e., how the model problem-solving accuracy can be influenced by irrelevant context. In particular, we introduce Grade-School Math with Irrelevant Context (GSM-IC), an arithmetic reasoning dataset with irrelevant information in the problem description. We use this benchmark to measure the distractibility of cutting-edge prompting techniques for large language models, and find that the model performance is dramatically decreased when irrelevant information is included. We also identify several approaches for mitigating this deficiency, such as decoding with self-consistency and adding to the prompt an instruction that tells the language model to ignore the irrelevant information

    Compositional Semantic Parsing with Large Language Models

    Full text link
    Humans can reason compositionally when presented with new tasks. Previous research shows that appropriate prompting techniques enable large language models (LLMs) to solve artificial compositional generalization tasks such as SCAN. In this work, we identify additional challenges in more realistic semantic parsing tasks with larger vocabulary and refine these prompting techniques to address them. Our best method is based on least-to-most prompting: it decomposes the problem using prompting-based syntactic parsing, then uses this decomposition to select appropriate exemplars and to sequentially generate the semantic parse. This method allows us to set a new state of the art for CFQ while requiring only 1% of the training data used by traditional approaches. Due to the general nature of our approach, we expect similar efforts will lead to new results in other tasks and domains, especially for knowledge-intensive applications.Comment: Fixed metadata. No other change

    Virtualization Support for Dynamic Core Library Update

    Get PDF
    International audienceDynamically updating language runtime and core libraries such as collections and threading is challenging since the update mechanism uses such libraries at the same time that it modifies them. To tackle this challenge, we present Dynamic Core Library Update (DCU) as an extension of Dynamic Software Update (DSU) and our approach based on a virtualization architecture. Our solution supports the update of core libraries as any other normal library, avoiding the circular dependencies between the updater and the core libraries. Our benchmarks show that there is no evident performance overhead in comparison with a default execution. Finally, we show that our approach can be applied to real life scenario by introducing a critical update inside a web application with 20 simulated concurrent users. Acknowledgments We thank the European Smalltalk User Group for their support (www.esug.org)

    Traits: Composing Classes from Behavioral Building Blocks

    No full text
    Inheritance is well-known and accepted as a fundamental mechanism for reuse in object-oriented languages. Unfortunately, the main variants —- single inheritance, multiple inheritance, and mixin inheritance —- all suffer from conceptual and practical problems related to software reuse and robustness with respect to changes. In a rst part of this thesis, we identify and illustrate these problems. To overcome these problems, we then present traits, a simple compositional model that extends single inheritance. A trait is essentially a (parameterized) set of methods; it serves as a behavioral building block for classes and is the primitive unit of code reuse. We develop a formal model of traits that establishes how traits can be composed to form other traits or classes, and we describe how we implemented traits in Squeak Smalltalk by bootstrapping a new language kernel. We present our experimental validation in which we apply traits to refactor parts of the Smalltalk kernel and library, and we develop a programming methodology around the usage of traits and the trait browser, the tool that we implemented to take full advantage of the availability of traits in the Squeak programming environment

    Object Encapsulation for Dynamically Typed Languages

    No full text
    A version of this paper has been submitted to the 200
    corecore