4 research outputs found

    No Zombie Types: Liveness-Based Justification For Monotonic Gradual Types

    Get PDF
    Gradual type systems with the monotonic dynamic semantics, such as HiggsCheck implementing SafeTypeScript, are able to achieve decent performance, making them a viable option for JavaScript programmers seeking run-time-checkable type annotations. However, the type restrictions for objects in the monotonic dynamic semantics are, as the name suggests, monotonic. Once a typed reference is defined or assigned to refer to an object, the contract carrying the type obligation of the reference is part of the object for the remainder of execution. In some cases, such contracts become "zombies": the reference that justifies a contract is out of scope, yet the object still retains the type obligation. In this thesis, we propose a novel idea of contract liveness and its implementation. Briefly speaking, contracts must be justified by live stack references defined with associated type obligations. Our implementation, taking inspiration from how garbage collectors approximate object liveness by reachability of objects, approximates contract liveness by reachability of contracts. Then, to achieve a much closer approximation to contract liveness, we introduce a poisoning process: we nullify the stack references justifying the violated contract, and associate the location that triggered the contract violation with a poisoned reference for blame. The implementation is compared with the original implementation of HiggsCheck. The comparison shows our system is fully compatible with code that raised no errors, with a small performance penalty of 8.14% average slowdown. We also discuss the performance of the contract removal process, and possible worst cases for the liveness-based system. Also, the semantics of HiggsCheck SafeTypeScript is modified to formalize the liveness-based type system. Our work proves that relaxations of contractual obligations in a gradually typed system with the monotonic semantics are viable and realistic

    A Scalable and Extensible Approach to Benchmarking NL2Code for 18 Programming Languages

    Full text link
    Large language models have demonstrated the ability to condition on and generate both natural language and programming language text. Such models open up the possibility of multi-language code generation: could code generation models generalize knowledge from one language to another? Although contemporary code generation models can generate semantically correct Python code, little is known about their abilities with other languages. We facilitate the exploration of this topic by proposing MultiPL-E, the first multi-language parallel benchmark for natural-language-to-code-generation. MultiPL-E extends the HumanEval benchmark (Chen et al, 2021) to support 18 more programming languages, encompassing a range of programming paradigms and popularity. We evaluate two state-of-the-art code generation models on MultiPL-E: Codex and InCoder. We find that on several languages, Codex matches and even exceeds its performance on Python. The range of programming languages represented in MultiPL-E allow us to explore the impact of language frequency and language features on model performance. Finally, the MultiPL-E approach of compiling code generation benchmarks to new programming languages is both scalable and extensible. We describe a general approach for easily adding support for new benchmarks and languages to MultiPL-E

    MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation

    No full text
    Large language models have demonstrated the ability to generate both natural language and programming language text. Although contemporary code generation models are trained on corpora with several programming languages, they are tested using benchmarks that are typically monolingual. The most widely used code generation benchmarks only target Python, so there is little quantitative evidence of how code generation models perform on other programming languages. We propose MultiPL-E, a system for translating unit test-driven code generation benchmarks to new languages. We create the first massively multilingual code generation benchmark by using MultiPL-E to translate two popular Python code generation benchmarks to 18 additional programming languages. We use MultiPL-E to extend the HumanEval benchmark (Chen et al., 2021) and MBPP benchmark (Austin et al., 2021) to 18 languages that encompass a range of programming paradigms and popularity. Using these new parallel benchmarks, we evaluate the multi-language performance of three state-of-the-art code generation models: Codex (Chen et al., 2021), CodeGen (Nijkamp et al., 2022) and InCoder (Fried et al., 2022). We find that Codex matches or even exceeds its performance on Python for several other languages. The range of programming languages represented in MultiPL-E allow us to explore the impact of language frequency and language features on model performance. Finally, the MultiPL-E approach of compiling code generation benchmarks to new programming languages is both scalable and extensible, making it straightforward to evaluate new models, benchmarks, and languages

    StarCoder: may the source be with you!

    Full text link
    The BigCode community, an open-scientific collaboration working on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder and StarCoderBase: 15.5B parameter models with 8K context length, infilling capabilities and fast large-batch inference enabled by multi-query attention. StarCoderBase is trained on 1 trillion tokens sourced from The Stack, a large collection of permissively licensed GitHub repositories with inspection tools and an opt-out process. We fine-tuned StarCoderBase on 35B Python tokens, resulting in the creation of StarCoder. We perform the most comprehensive evaluation of Code LLMs to date and show that StarCoderBase outperforms every open Code LLM that supports multiple programming languages and matches or outperforms the OpenAI code-cushman-001 model. Furthermore, StarCoder outperforms every model that is fine-tuned on Python, can be prompted to achieve 40\% pass@1 on HumanEval, and still retains its performance on other programming languages. We take several important steps towards a safe open-access model release, including an improved PII redaction pipeline and a novel attribution tracing tool, and make the StarCoder models publicly available under a more commercially viable version of the Open Responsible AI Model license
    corecore