202 research outputs found
Generation Probabilities Are Not Enough: Exploring the Effectiveness of Uncertainty Highlighting in AI-Powered Code Completions
Large-scale generative models enabled the development of AI-powered code
completion tools to assist programmers in writing code. However, much like
other AI-powered tools, AI-powered code completions are not always accurate,
potentially introducing bugs or even security vulnerabilities into code if not
properly detected and corrected by a human programmer. One technique that has
been proposed and implemented to help programmers identify potential errors is
to highlight uncertain tokens. However, there have been no empirical studies
exploring the effectiveness of this technique-- nor investigating the different
and not-yet-agreed-upon notions of uncertainty in the context of generative
models. We explore the question of whether conveying information about
uncertainty enables programmers to more quickly and accurately produce code
when collaborating with an AI-powered code completion tool, and if so, what
measure of uncertainty best fits programmers' needs. Through a mixed-methods
study with 30 programmers, we compare three conditions: providing the AI
system's code completion alone, highlighting tokens with the lowest likelihood
of being generated by the underlying generative model, and highlighting tokens
with the highest predicted likelihood of being edited by a programmer. We find
that highlighting tokens with the highest predicted likelihood of being edited
leads to faster task completion and more targeted edits, and is subjectively
preferred by study participants. In contrast, highlighting tokens according to
their probability of being generated does not provide any benefit over the
baseline with no highlighting. We further explore the design space of how to
convey uncertainty in AI-powered code completion tools, and find that
programmers prefer highlights that are granular, informative, interpretable,
and not overwhelming
Large Language Models of Code Fail at Completing Code with Potential Bugs
Large language models of code (Code-LLMs) have recently brought tremendous
advances to code completion, a fundamental feature of programming assistance
and code intelligence. However, most existing works ignore the possible
presence of bugs in the code context for generation, which are inevitable in
software development. Therefore, we introduce and study the buggy-code
completion problem, inspired by the realistic scenario of real-time code
suggestion where the code context contains potential bugs -- anti-patterns that
can become bugs in the completed program. To systematically study the task, we
introduce two datasets: one with synthetic bugs derived from semantics-altering
operator changes (buggy-HumanEval) and one with realistic bugs derived from
user submissions to coding problems (buggy-FixEval). We find that the presence
of potential bugs significantly degrades the generation performance of the
high-performing Code-LLMs. For instance, the passing rates of CodeGen-2B-mono
on test cases of buggy-HumanEval drop more than 50% given a single potential
bug in the context. Finally, we investigate several post-hoc methods for
mitigating the adverse effect of potential bugs and find that there remains a
large gap in post-mitigation performance.Comment: 25 page
Syntax-Aware On-the-Fly Code Completion
Code completion aims to help improve developers' productivity by suggesting
the next code tokens from a given context. Various approaches have been
proposed to incorporate abstract syntax tree (AST) information for model
training, ensuring that code completion is aware of the syntax of the
programming languages. However, existing syntax-aware code completion
approaches are not on-the-fly, as we found that for every two-thirds of
characters that developers type, AST fails to be extracted because it requires
the syntactically correct source code, limiting its practicality in real-world
scenarios. On the other hand, existing on-the-fly code completion does not
consider syntactic information yet. In this paper, we propose PyCoder to
leverage token types, a kind of lightweight syntactic information, which is
readily available and aligns with the natural order of source code. Our PyCoder
is trained in a multi-task training manner so that by learning the supporting
task of predicting token types during the training phase, the models achieve
better performance on predicting tokens and lines of code without the need for
token types in the inference phase. Comprehensive experiments show that PyCoder
achieves the first rank on the CodeXGLUE leaderboard with an accuracy of 77.12%
for the token-level predictions, which is 0.43%-24.25% more accurate than
baselines. In addition, PyCoder achieves an exact match of 43.37% for the
line-level predictions, which is 3.63%-84.73% more accurate than baselines.
These results lead us to conclude that token type information (an alternative
to syntactic information) that is rarely used in the past can greatly improve
the performance of code completion approaches, without requiring the
syntactically correct source code like AST-based approaches do. Our PyCoder is
publicly available on HuggingFace.Comment: 14 pages, Under Review at IEEE Transactions on Software Engineerin
An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation
Unit tests play a key role in ensuring the correctness of software. However,
manually creating unit tests is a laborious task, motivating the need for
automation. Large Language Models (LLMs) have recently been applied to this
problem, utilizing additional training or few-shot learning on examples of
existing tests. This paper presents a large-scale empirical evaluation on the
effectiveness of LLMs for automated unit test generation without additional
training or manual effort, providing the LLM with the signature and
implementation of the function under test, along with usage examples extracted
from documentation. We also attempt to repair failed generated tests by
re-prompting the model with the failing test and error message. We implement
our approach in TestPilot, a test generation tool for JavaScript that
automatically generates unit tests for all API functions in an npm package. We
evaluate TestPilot using OpenAI's gpt3.5-turbo LLM on 25 npm packages with a
total of 1,684 API functions. The generated tests achieve a median statement
coverage of 70.2% and branch coverage of 52.8%, significantly improving on
Nessie, a recent feedback-directed JavaScript test generation technique, which
achieves only 51.3% statement coverage and 25.6% branch coverage. We also find
that 92.8% of TestPilot's generated tests have no more than 50% similarity with
existing tests (as measured by normalized edit distance), with none of them
being exact copies. Finally, we run TestPilot with two additional LLMs,
OpenAI's older code-cushman-002 LLM and the open LLM StarCoder. Overall, we
observed similar results with the former (68.2% median statement coverage), and
somewhat worse results with the latter (54.0% median statement coverage),
suggesting that the effectiveness of the approach is influenced by the size and
training set of the LLM, but does not fundamentally depend on the specific
model
Knowledge Transfer from High-Resource to Low-Resource Programming Languages for Code LLMs
Over the past few years, Large Language Models of Code (Code LLMs) have
started to have a significant impact on programming practice. Code LLMs are
also emerging as a building block for research in programming languages and
software engineering. However, the quality of code produced by a Code LLM
varies significantly by programming languages. Code LLMs produce impressive
results on programming languages that are well represented in their training
data (e.g., Java, Python, or JavaScript), but struggle with low-resource
languages, like OCaml and Racket.
This paper presents an effective approach for boosting the performance of
Code LLMs on low-resource languages using semi-synthetic data. Our approach
generates high-quality datasets for low-resource languages, which can then be
used to fine-tune any pretrained Code LLM. Our approach, called MultiPL-T,
translates training data from high-resource languages into training data for
low-resource languages. We apply our approach to generate tens of thousands of
new, validated training items for Racket, OCaml, and Lua from Python. Moreover,
we use an open dataset (The Stack) and model (StarCoderBase), which allow us to
decontaminate benchmarks and train models on this data without violating the
model license.
With MultiPL-T generated data, we present fine-tuned versions of
StarCoderBase that achieve state-of-the-art performance for Racket, OCaml, and
Lua on benchmark problems. For Lua, our fine-tuned model achieves the same
performance as StarCoderBase as Python -- a very high-resource language -- on
the MultiPL-E benchmarks. For Racket and OCaml, we double their performance on
MultiPL-E, bringing their performance close to higher-resource languages such
as Ruby and C#
- …