13 research outputs found
Topical: Learning Repository Embeddings from Source Code using Attention
Machine learning on source code (MLOnCode) promises to transform how software
is delivered. By mining the context and relationship between software
artefacts, MLOnCode augments the software developers capabilities with code
auto-generation, code recommendation, code auto-tagging and other data-driven
enhancements. For many of these tasks a script level representation of code is
sufficient, however, in many cases a repository level representation that takes
into account various dependencies and repository structure is imperative, for
example, auto-tagging repositories with topics or auto-documentation of
repository code etc. Existing methods for computing repository level
representations suffer from (a) reliance on natural language documentation of
code (for example, README files) (b) naive aggregation of method/script-level
representation, for example, by concatenation or averaging. This paper
introduces Topical a deep neural network to generate repository level
embeddings of publicly available GitHub code repositories directly from source
code. Topical incorporates an attention mechanism that projects the source
code, the full dependency graph and the script level textual information into a
dense repository-level representation. To compute the repository-level
representations, Topical is trained to predict the topics associated with a
repository, on a dataset of publicly available GitHub repositories that were
crawled along with their ground truth topic tags. Our experiments show that the
embeddings computed by Topical are able to outperform multiple baselines,
including baselines that naively combine the method-level representations
through averaging or concatenation at the task of repository auto-tagging.Comment: Pre-print, under revie
Dev2vec: Representing Domain Expertise of Developers in an Embedding Space
Accurate assessment of the domain expertise of developers is important for
assigning the proper candidate to contribute to a project or to attend a job
role. Since the potential candidate can come from a large pool, the automated
assessment of this domain expertise is a desirable goal. While previous methods
have had some success within a single software project, the assessment of a
developer's domain expertise from contributions across multiple projects is
more challenging. In this paper, we employ doc2vec to represent the domain
expertise of developers as embedding vectors. These vectors are derived from
different sources that contain evidence of developers' expertise, such as the
description of repositories that they contributed, their issue resolving
history, and API calls in their commits. We name it dev2vec and demonstrate its
effectiveness in representing the technical specialization of developers. Our
results indicate that encoding the expertise of developers in an embedding
vector outperforms state-of-the-art methods and improves the F1-score up to
21%. Moreover, our findings suggest that ``issue resolving history'' of
developers is the most informative source of information to represent the
domain expertise of developers in embedding spaces.Comment: 30 pages, 5 figure
Antipatterns in software classification taxonomies
Empirical results in software engineering have long started to show that findings are unlikely to be applicable to all software systems, or any domain: results need to be evaluated in specified contexts, and limited to the type of systems that they were extracted from. This is a known issue, and requires the establishment of a classification of software types. This paper makes two contributions: the first is to evaluate the quality of the current software classifications landscape. The second is to perform a case study showing how to create a classification of software types using a curated set of software systems. Our contributions show that existing, and very likely even new, classification attempts are deemed to fail for one or more issues, that we named as the ‘antipatterns’ of software classification tasks. We collected 7 of these antipatterns that emerge from both our case study, and the existing classifications. These antipatterns represent recurring issues in a classification, so we discuss practical ways to help researchers avoid these pitfalls. It becomes clear that classification attempts must also face the daunting task of formulating a taxonomy of software types, with the objective of establishing a hierarchy of categories in a classification
Antipatterns in Software Classification Taxonomies
Empirical results in software engineering have long started to show that
findings are unlikely to be applicable to all software systems, or any domain:
results need to be evaluated in specified contexts, and limited to the type of
systems that they were extracted from. This is a known issue, and requires the
establishment of a classification of software types.
This paper makes two contributions: the first is to evaluate the quality of
the current software classifications landscape. The second is to perform a case
study showing how to create a classification of software types using a curated
set of software systems.
Our contributions show that existing, and very likely even new,
classification attempts are deemed to fail for one or more issues, that we
named as the `antipatterns' of software classification tasks. We collected 7 of
these antipatterns that emerge from both our case study, and the existing
classifications.
These antipatterns represent recurring issues in a classification, so we
discuss practical ways to help researchers avoid these pitfalls. It becomes
clear that classification attempts must also face the daunting task of
formulating a taxonomy of software types, with the objective of establishing a
hierarchy of categories in a classification.Comment: Accepted for publish at the Journal of Systems and Softwar
So Much in So Little: Creating Lightweight Embeddings of Python Libraries
In software engineering, different approaches and machine learning models
leverage different types of data: source code, textual information, historical
data. An important part of any project is its dependencies. The list of
dependencies is relatively small but carries a lot of semantics with it, which
can be used to compare projects or make judgements about them.
In this paper, we focus on Python projects and their PyPi dependencies in the
form of requirements.txt files. We compile a dataset of 7,132 Python projects
and their dependencies, as well as use Git to pull their versions from previous
years. Using this data, we build 32-dimensional embeddings of libraries by
applying Singular Value Decomposition to the co-occurrence matrix of projects
and libraries. We then cluster the embeddings and study their semantic
relations.
To showcase the usefulness of such lightweight library embeddings, we
introduce a prototype tool for suggesting relevant libraries to a given
project. The tool computes project embeddings and uses dependencies of projects
with similar embeddings to form suggestions. To compare different library
recommenders, we have created a benchmark based on the evolution of dependency
sets in open-source projects. Approaches based on the created embeddings
significantly outperform the baseline of showing the most popular libraries in
a given year. We have also conducted a user study that showed that the
suggestions differ in quality for different project domains and that even
relevant suggestions might be not particularly useful. Finally, to facilitate
potentially more useful recommendations, we extended the recommender system
with an option to suggest rarer libraries.Comment: The work was carried out at the end of 2020. 11 pages, 4 figure
Deep learning applied to the assessment of online student programming exercises
Massive online open courses (MOOCs) teaching coding are increasing in number and popularity. They commonly include homework assignments in which the students must write code that is evaluated by
functional tests. Functional testing may to some extent be automated
however provision of more qualitative evaluation and feedback may
be prohibitively labor-intensive. Provision of qualitative evaluation at
scale, automatically, is the subject of much research effort.
In this thesis, deep learning is applied to the task of performing
automatic assessment of source code, with a focus on provision of
qualitative feedback. Four tasks: language modeling, detecting idiomatic code, semantic code search, and predicting variable names are
considered in detail.
First, deep learning models are applied to the task of language modeling source code. A comparison is made between the performance of
different deep learning language models, and it is shown how language
models can be used for source code auto-completion. It is also demonstrated how language models trained on source code can be used for
transfer learning, providing improved performance on other tasks.
Next, an analysis is made on how the language models from the
previous task can be used to detect idiomatic code. It is shown that
these language models are able to locate where a student has deviated
from correct code idioms. These locations can be highlighted to the
student in order to provide qualitative feedback.
Then, results are shown on semantic code search, again comparing
the performance across a variety of deep learning models. It is demonstrated how semantic code search can be used to reduce the time taken
for qualitative evaluation, by automatically pairing a student submission with an instructor’s hand-written feedback.
Finally, it is examined how deep learning can be used to predict
variable names within source code. These models can be used in a
qualitative evaluation setting where the deep learning models can be
used to suggest more appropriate variable names. It is also shown that
these models can even be used to predict the presence of functional
errors.
Novel experimental results show that: fine-tuning a pre-trained
language model is an effective way to improve performance across a
variety of tasks on source code, improving performance by 5% on average; pre-trained language models can be used as zero-shot learners across a variety of tasks, with the zero-shot performance of some architectures outperforming the fine-tuned performance of others; and
that language models can be used to detect both semantic and syntactic errors. Other novel findings include: removing the non-variable
tokens within source code has negligible impact on the performance of
models, and that these remaining tokens can be shuffled with only a
minimal decrease in performance.Engineering and Physical Sciences Research Council (EPSRC) fundin
Predicting Imports in Java Code with Graph Neural Networks
Programmers tend to split their code into multiple files or sub-modules. When a program is executed, these sub-modules interact to produce the desired effect. One can, therefore, represent programs with graphs, where each node corresponds to some file and each edge corresponds to some relationship between files, such as two files being located in the same package or one file importing the content of another. This project trains Graph Neural Networks on such graphs to learn to predict future imports in Java programs and shows that Graph Neural Networks outperform various baseline methods by a wide margin