94 research outputs found
Semantic Source Code Models Using Identifier Embeddings
The emergence of online open source repositories in the recent years has led
to an explosion in the volume of openly available source code, coupled with
metadata that relate to a variety of software development activities. As an
effect, in line with recent advances in machine learning research, software
maintenance activities are switching from symbolic formal methods to
data-driven methods. In this context, the rich semantics hidden in source code
identifiers provide opportunities for building semantic representations of code
which can assist tasks of code search and reuse. To this end, we deliver in the
form of pretrained vector space models, distributed code representations for
six popular programming languages, namely, Java, Python, PHP, C, C++, and C#.
The models are produced using fastText, a state-of-the-art library for learning
word representations. Each model is trained on data from a single programming
language; the code mined for producing all models amounts to over 13.000
repositories. We indicate dissimilarities between natural language and source
code, as well as variations in coding conventions in between the different
programming languages we processed. We describe how these heterogeneities
guided the data preprocessing decisions we took and the selection of the
training parameters in the released models. Finally, we propose potential
applications of the models and discuss limitations of the models.Comment: 16th International Conference on Mining Software Repositories (MSR
2019): Data Showcase Trac
Looking Over the Research Literature on Software Engineering from 2016 to 2018
This paper carries out a bibliometric analysis to detect (i) what is the most influential research on software engineering at the moment, (ii) where is being published that relevant research, (iii) what are the most commonly researched topics, (iv) and where is being undertaken that research (i.e., in which countries and institutions). For that, 6,365 software engineering articles, published from 2016 to 2018 on a variety of conferences and journals, are examined.This work has been funded by the Spanish Ministry of Science, Innovation, and Universities under Project
DPI2016-77677-P, the Community of Madrid under Grant RoboCity2030-DIH-CM P2018/NMT-4331, and grant
TIN2016-75850-R from the FEDER funds
Open Vocabulary Learning on Source Code with a Graph-Structured Cache
Machine learning models that take computer program source code as input
typically use Natural Language Processing (NLP) techniques. However, a major
challenge is that code is written using an open, rapidly changing vocabulary
due to, e.g., the coinage of new variable and method names. Reasoning over such
a vocabulary is not something for which most NLP methods are designed. We
introduce a Graph-Structured Cache to address this problem; this cache contains
a node for each new word the model encounters with edges connecting each word
to its occurrences in the code. We find that combining this graph-structured
cache strategy with recent Graph-Neural-Network-based models for supervised
learning on code improves the models' performance on a code completion task and
a variable naming task --- with over relative improvement on the latter
--- at the cost of a moderate increase in computation time.Comment: Published in the International Conference on Machine Learning (ICML
2019), 13 page
500+ Times Faster Than Deep Learning (A Case Study Exploring Faster Methods for Text Mining StackOverflow)
Deep learning methods are useful for high-dimensional data and are becoming
widely used in many areas of software engineering. Deep learners utilizes
extensive computational power and can take a long time to train-- making it
difficult to widely validate and repeat and improve their results. Further,
they are not the best solution in all domains. For example, recent results show
that for finding related Stack Overflow posts, a tuned SVM performs similarly
to a deep learner, but is significantly faster to train. This paper extends
that recent result by clustering the dataset, then tuning very learners within
each cluster. This approach is over 500 times faster than deep learning (and
over 900 times faster if we use all the cores on a standard laptop computer).
Significantly, this faster approach generates classifiers nearly as good
(within 2\% F1 Score) as the much slower deep learning method. Hence we
recommend this faster methods since it is much easier to reproduce and utilizes
far fewer CPU resources. More generally, we recommend that before researchers
release research results, that they compare their supposedly sophisticated
methods against simpler alternatives (e.g applying simpler learners to build
local models)
A Mocktail of Source Code Representations
Efficient representation of source code is essential for various software
engineering tasks such as code search and code clone detection. One such
technique for representing source code involves extracting paths from the AST
and using a learning model to capture program properties. Code2vec is a
commonly used path-based approach that uses an attention-based neural network
to learn code embeddings which can then be used for various software
engineering tasks. However, this approach uses only ASTs and does not leverage
other graph structures such as Control Flow Graphs (CFG) and Program Dependency
Graphs (PDG). Similarly, most recent approaches for representing source code
still use AST and do not leverage semantic graph structures. Even though there
exists an integrated graph approach (Code Property Graph) for representing
source code, it has only been explored in the domain of software security.
Moreover, it does not leverage the paths from the individual graphs. In our
work, we extend the path-based approach code2vec to include semantic graphs,
CFG, and PDG, along with AST, which is still largely unexplored in the domain
of software engineering. We evaluate our approach on the task of MethodNaming
using a custom C dataset of 730K methods collected from 16 C projects from
GitHub. In comparison to code2vec, our approach improves the F1 Score by 11% on
the full dataset and up to 100% with individual projects. We show that semantic
features from the CFG and PDG paths are indeed helpful. We envision that
looking at a mocktail of source code representations for various software
engineering tasks can lay the foundation for a new line of research and a
re-haul of existing research
- …