47 research outputs found
Teaching Large Language Models to Self-Debug
Large language models (LLMs) have achieved impressive performance on code
generation. However, for complex programming tasks, generating the correct
solution in one go becomes challenging, thus some prior works have designed
program repair approaches to improve code generation performance. In this work,
we propose Self-Debugging, which teaches a large language model to debug its
predicted program via few-shot demonstrations. In particular, we demonstrate
that Self-Debugging can teach the large language model to perform rubber duck
debugging; i.e., without any feedback on the code correctness or error
messages, the model is able to identify its mistakes by explaining the
generated code in natural language. Self-Debugging achieves the
state-of-the-art performance on several code generation benchmarks, including
the Spider dataset for text-to-SQL generation, TransCoder for C++-to-Python
translation, and MBPP for text-to-Python generation. On the Spider benchmark
where there are no unit tests to verify the correctness of predictions,
Self-Debugging with code explanation consistently improves the baseline by
2-3%, and improves the prediction accuracy on problems of the hardest label by
9%. On TransCoder and MBPP where unit tests are available, Self-Debugging
improves the baseline accuracy by up to 12%. Meanwhile, by leveraging feedback
messages and reusing failed predictions, Self-Debugging notably improves sample
efficiency, and can match or outperform baseline models that generate more than
10x candidate programs
*-CFQ: Analyzing the Scalability of Machine Learning on a Compositional Task
We present *-CFQ ("star-CFQ"): a suite of large-scale datasets of varying
scope based on the CFQ semantic parsing benchmark, designed for principled
investigation of the scalability of machine learning systems in a realistic
compositional task setting. Using this suite, we conduct a series of
experiments investigating the ability of Transformers to benefit from increased
training size under conditions of fixed computational cost. We show that
compositional generalization remains a challenge at all training sizes, and we
show that increasing the scope of natural language leads to consistently higher
error rates, which are only partially offset by increased training data. We
further show that while additional training data from a related domain improves
the accuracy in data-starved situations, this improvement is limited and
diminishes as the distance from the related domain to the target domain
increases.Comment: Accepted, AAAI-2
Large Language Models Can Be Easily Distracted by Irrelevant Context
Large language models have achieved impressive performance on various natural
language processing tasks. However, so far they have been evaluated primarily
on benchmarks where all information in the input context is relevant for
solving the task. In this work, we investigate the distractibility of large
language models, i.e., how the model problem-solving accuracy can be influenced
by irrelevant context. In particular, we introduce Grade-School Math with
Irrelevant Context (GSM-IC), an arithmetic reasoning dataset with irrelevant
information in the problem description. We use this benchmark to measure the
distractibility of cutting-edge prompting techniques for large language models,
and find that the model performance is dramatically decreased when irrelevant
information is included. We also identify several approaches for mitigating
this deficiency, such as decoding with self-consistency and adding to the
prompt an instruction that tells the language model to ignore the irrelevant
information
Compositional Semantic Parsing with Large Language Models
Humans can reason compositionally when presented with new tasks. Previous
research shows that appropriate prompting techniques enable large language
models (LLMs) to solve artificial compositional generalization tasks such as
SCAN. In this work, we identify additional challenges in more realistic
semantic parsing tasks with larger vocabulary and refine these prompting
techniques to address them. Our best method is based on least-to-most
prompting: it decomposes the problem using prompting-based syntactic parsing,
then uses this decomposition to select appropriate exemplars and to
sequentially generate the semantic parse. This method allows us to set a new
state of the art for CFQ while requiring only 1% of the training data used by
traditional approaches. Due to the general nature of our approach, we expect
similar efforts will lead to new results in other tasks and domains, especially
for knowledge-intensive applications.Comment: Fixed metadata. No other change
Virtualization Support for Dynamic Core Library Update
International audienceDynamically updating language runtime and core libraries such as collections and threading is challenging since the update mechanism uses such libraries at the same time that it modifies them. To tackle this challenge, we present Dynamic Core Library Update (DCU) as an extension of Dynamic Software Update (DSU) and our approach based on a virtualization architecture. Our solution supports the update of core libraries as any other normal library, avoiding the circular dependencies between the updater and the core libraries. Our benchmarks show that there is no evident performance overhead in comparison with a default execution. Finally, we show that our approach can be applied to real life scenario by introducing a critical update inside a web application with 20 simulated concurrent users. Acknowledgments We thank the European Smalltalk User Group for their support (www.esug.org)
Traits: Composing Classes from Behavioral Building Blocks
Inheritance is well-known and accepted as a fundamental mechanism for reuse in object-oriented languages. Unfortunately, the main variants —- single inheritance, multiple inheritance, and mixin inheritance —- all suffer from conceptual and practical problems related to software reuse and robustness with respect to changes. In a rst part of this thesis, we identify and illustrate these problems. To overcome these problems, we then present traits, a simple compositional model that extends single inheritance. A trait is essentially a (parameterized) set of methods; it serves as a behavioral building block for classes and is the primitive unit of code reuse. We develop a formal model of traits that establishes how traits can be composed to form other traits or classes, and we describe how we implemented traits in Squeak Smalltalk by bootstrapping a new language kernel. We present our experimental validation in which we apply traits to refactor parts of the Smalltalk kernel and library, and we develop a programming methodology around the usage of traits and the trait browser, the tool that we implemented to take full advantage of the availability of traits in the Squeak programming environment
Object Encapsulation for Dynamically Typed Languages
A version of this paper has been submitted to the 200