4 research outputs found
No Zombie Types: Liveness-Based Justification For Monotonic Gradual Types
Gradual type systems with the monotonic dynamic semantics, such as HiggsCheck implementing SafeTypeScript, are able to achieve decent performance, making them a viable option for JavaScript programmers seeking run-time-checkable type annotations.
However, the type restrictions for objects in the monotonic dynamic semantics are, as the name suggests, monotonic.
Once a typed reference is defined or assigned to refer to an object, the contract carrying the type obligation of the reference is part of the object for the remainder of execution.
In some cases, such contracts become "zombies": the reference that justifies a contract is out of scope, yet the object still retains the type obligation.
In this thesis, we propose a novel idea of contract liveness and its implementation.
Briefly speaking, contracts must be justified by live stack references defined with associated type obligations.
Our implementation, taking inspiration from how garbage collectors approximate object liveness by reachability of objects, approximates contract liveness by reachability of contracts.
Then, to achieve a much closer approximation to contract liveness, we introduce a poisoning process:
we nullify the stack references justifying the violated contract, and associate the location that triggered the contract violation with a poisoned reference for blame.
The implementation is compared with the original implementation of HiggsCheck. The comparison shows our system is fully compatible with code that raised no errors, with a small performance penalty of 8.14% average slowdown.
We also discuss the performance of the contract removal process, and possible worst cases for the liveness-based system.
Also, the semantics of HiggsCheck SafeTypeScript is modified to formalize the liveness-based type system.
Our work proves that relaxations of contractual obligations in a gradually typed system with the monotonic semantics are viable and realistic
A Scalable and Extensible Approach to Benchmarking NL2Code for 18 Programming Languages
Large language models have demonstrated the ability to condition on and
generate both natural language and programming language text. Such models open
up the possibility of multi-language code generation: could code generation
models generalize knowledge from one language to another? Although contemporary
code generation models can generate semantically correct Python code, little is
known about their abilities with other languages. We facilitate the exploration
of this topic by proposing MultiPL-E, the first multi-language parallel
benchmark for natural-language-to-code-generation.
MultiPL-E extends the HumanEval benchmark (Chen et al, 2021) to support 18
more programming languages, encompassing a range of programming paradigms and
popularity. We evaluate two state-of-the-art code generation models on
MultiPL-E: Codex and InCoder. We find that on several languages, Codex matches
and even exceeds its performance on Python. The range of programming languages
represented in MultiPL-E allow us to explore the impact of language frequency
and language features on model performance. Finally, the MultiPL-E approach of
compiling code generation benchmarks to new programming languages is both
scalable and extensible. We describe a general approach for easily adding
support for new benchmarks and languages to MultiPL-E
MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation
Large language models have demonstrated the ability to generate both natural language and programming language text. Although contemporary code generation models are trained on corpora with several programming languages, they are tested using benchmarks that are typically monolingual. The most widely used code generation benchmarks only target Python, so there is little quantitative evidence of how code generation models perform on other programming languages. We propose MultiPL-E, a system for translating unit test-driven code generation benchmarks to new languages. We create the first massively multilingual code generation benchmark by using MultiPL-E to translate two popular Python code generation benchmarks to 18 additional programming languages. We use MultiPL-E to extend the HumanEval benchmark (Chen et al., 2021) and MBPP benchmark (Austin et al., 2021) to 18 languages that encompass a range of programming paradigms and popularity. Using these new parallel benchmarks, we evaluate the multi-language performance of three state-of-the-art code generation models: Codex (Chen et al., 2021), CodeGen (Nijkamp et al., 2022) and InCoder (Fried et al., 2022). We find that Codex matches or even exceeds its performance on Python for several other languages. The range of programming languages represented in MultiPL-E allow us to explore the impact of language frequency and language features on model performance. Finally, the MultiPL-E approach of compiling code generation benchmarks to new programming languages is both scalable and extensible, making it straightforward to evaluate new models, benchmarks, and languages
StarCoder: may the source be with you!
The BigCode community, an open-scientific collaboration working on the
responsible development of Large Language Models for Code (Code LLMs),
introduces StarCoder and StarCoderBase: 15.5B parameter models with 8K context
length, infilling capabilities and fast large-batch inference enabled by
multi-query attention. StarCoderBase is trained on 1 trillion tokens sourced
from The Stack, a large collection of permissively licensed GitHub repositories
with inspection tools and an opt-out process. We fine-tuned StarCoderBase on
35B Python tokens, resulting in the creation of StarCoder. We perform the most
comprehensive evaluation of Code LLMs to date and show that StarCoderBase
outperforms every open Code LLM that supports multiple programming languages
and matches or outperforms the OpenAI code-cushman-001 model. Furthermore,
StarCoder outperforms every model that is fine-tuned on Python, can be prompted
to achieve 40\% pass@1 on HumanEval, and still retains its performance on other
programming languages. We take several important steps towards a safe
open-access model release, including an improved PII redaction pipeline and a
novel attribution tracing tool, and make the StarCoder models publicly
available under a more commercially viable version of the Open Responsible AI
Model license