5 research outputs found
Learning Transfers over Several Programming Languages
Large language models (LLMs) have recently become remarkably good at
improving developer productivity for high-resource programming languages. These
models use two kinds of data: large amounts of unlabeled code samples for
pretraining and relatively smaller amounts of labeled code samples for
fine-tuning or in-context learning. Unfortunately, many programming languages
are low-resource, lacking labeled samples for most tasks and often even lacking
unlabeled samples. Therefore, users of low-resource languages (e.g., legacy or
new languages) miss out on the benefits of LLMs. Cross-lingual transfer
learning uses data from a source language to improve model performance on a
target language. It has been well-studied for natural languages, but has
received little attention for programming languages. This paper reports
extensive experiments on four tasks using a transformer-based LLM and 11 to 41
programming languages to explore the following questions. First, how well
cross-lingual transfer works for a given task across different language pairs.
Second, given a task and target language, how to best choose a source language.
Third, the characteristics of a language pair that are predictive of transfer
performance, and fourth, how that depends on the given task.Comment: 16 pages, 5 figures, 5 table
Beyond Accuracy: Evaluating Self-Consistency of Code Large Language Models with IdentityChain
Code Large Language Models (Code LLMs) are being increasingly employed in
real-life applications, so evaluating them is critical. While the general
accuracy of Code LLMs on individual tasks has been extensively evaluated, their
self-consistency across different tasks is overlooked. Intuitively, a
trustworthy model should be self-consistent when generating natural language
specifications for its own code and generating code for its own specifications.
Failure to preserve self-consistency reveals a lack of understanding of the
shared semantics underlying natural language and programming language, and
therefore undermines the trustworthiness of a model. In this paper, we first
formally define the self-consistency of Code LLMs and then design a framework,
IdentityChain, which effectively and efficiently evaluates the self-consistency
and general accuracy of a model at the same time. We study eleven Code LLMs and
show that they fail to preserve self-consistency, which is indeed a distinct
aspect from general accuracy. Furthermore, we show that IdentityChain can be
used as a model debugging tool to expose weaknesses of Code LLMs by
demonstrating three major weaknesses that we identify in current models using
IdentityChain. Our code is available at
https://github.com/marcusm117/IdentityChain.Comment: Code available at https://github.com/marcusm117/IdentityChai
The TechQA Dataset
We introduce TechQA, a domain-adaptation question answering dataset for the
technical support domain. The TechQA corpus highlights two real-world issues
from the automated customer support domain. First, it contains actual questions
posed by users on a technical forum, rather than questions generated
specifically for a competition or a task. Second, it has a real-world size --
600 training, 310 dev, and 490 evaluation question/answer pairs -- thus
reflecting the cost of creating large labeled datasets with actual data.
Consequently, TechQA is meant to stimulate research in domain adaptation rather
than being a resource to build QA systems from scratch. The dataset was
obtained by crawling the IBM Developer and IBM DeveloperWorks forums for
questions with accepted answers that appear in a published IBM Technote---a
technical document that addresses a specific technical issue. We also release a
collection of the 801,998 publicly available Technotes as of April 4, 2019 as a
companion resource that might be used for pretraining, to learn representations
of the IT domain language.Comment: Long version of conference paper to be submitte
Automated Code generation for Information Technology Tasks in YAML through Large Language Models
The recent improvement in code generation capabilities due to the use of
large language models has mainly benefited general purpose programming
languages. Domain specific languages, such as the ones used for IT Automation,
have received far less attention, despite involving many active developers and
being an essential component of modern cloud platforms. This work focuses on
the generation of Ansible-YAML, a widely used markup language for IT
Automation. We present Ansible Wisdom, a natural-language to Ansible-YAML code
generation tool, aimed at improving IT automation productivity. Ansible Wisdom
is a transformer-based model, extended by training with a new dataset
containing Ansible-YAML. We also develop two novel performance metrics for YAML
and Ansible to capture the specific characteristics of this domain. Results
show that Ansible Wisdom can accurately generate Ansible script from natural
language prompts with performance comparable or better than existing state of
the art code generation models