220 research outputs found
On the naturalness of software
Natural languages like English are rich, complex, and powerful. The highly creative and graceful use of languages like English and Tamil, by masters like Shakespeare and Avvaiyar, can certainly delight and inspire. But in practice, given cognitive constraints and the exigencies of daily life, most human utterances are far simpler and much more repetitive and predictable. In fact, these utterances can be very usefully modeled using modern statistical methods. This fact has led to the phenomenal success of statistical approaches to speech recognition, natural language translation, question-answering, and text mining and comprehension.
We begin with the conjecture that most software is also natural, in the sense that it is created by humans at work, with all the attendant constraints and limitations---and thus, like natural language, it is also likely to be repetitive and predictable. We then proceed to ask whether (a) code can be usefully modeled by statistical language models and (b) such models can be leveraged to support software engineers. Using the widely adopted n-gram model, we provide empirical evidence supportive of a positive answer to both these questions. We show that code is also very regular, and, in fact, even more so than natural languages. As an example use of the model, we have developed a simple code completion engine for Java that, despite its simplicity, already improves Eclipse's completion capability. We conclude the paper by laying out a vision for future research in this area
Do the Fix Ingredients Already Exist? An Empirical Inquiry into the Redundancy Assumptions of Program Repair Approaches
Much initial research on automatic program repair has focused on experimental
results to probe their potential to find patches and reduce development effort.
Relatively less effort has been put into understanding the hows and whys of
such approaches. For example, a critical assumption of the GenProg technique is
that certain bugs can be fixed by copying and re-arranging existing code. In
other words, GenProg assumes that the fix ingredients already exist elsewhere
in the code. In this paper, we formalize these assumptions around the concept
of ''temporal redundancy''. A temporally redundant commit is only composed of
what has already existed in previous commits. Our experiments show that a large
proportion of commits that add existing code are temporally redundant. This
validates the fundamental redundancy assumption of GenProg.Comment: ICSE - 36th IEEE International Conference on Software Engineering
(2014
A Neural Architecture for Generating Natural Language Descriptions from Source Code Changes
We propose a model to automatically describe changes introduced in the source
code of a program using natural language. Our method receives as input a set of
code commits, which contains both the modifications and message introduced by
an user. These two modalities are used to train an encoder-decoder
architecture. We evaluated our approach on twelve real world open source
projects from four different programming languages. Quantitative and
qualitative results showed that the proposed approach can generate feasible and
semantically sound descriptions not only in standard in-project settings, but
also in a cross-project setting.Comment: Accepted at ACL 201
Semantic Source Code Models Using Identifier Embeddings
The emergence of online open source repositories in the recent years has led
to an explosion in the volume of openly available source code, coupled with
metadata that relate to a variety of software development activities. As an
effect, in line with recent advances in machine learning research, software
maintenance activities are switching from symbolic formal methods to
data-driven methods. In this context, the rich semantics hidden in source code
identifiers provide opportunities for building semantic representations of code
which can assist tasks of code search and reuse. To this end, we deliver in the
form of pretrained vector space models, distributed code representations for
six popular programming languages, namely, Java, Python, PHP, C, C++, and C#.
The models are produced using fastText, a state-of-the-art library for learning
word representations. Each model is trained on data from a single programming
language; the code mined for producing all models amounts to over 13.000
repositories. We indicate dissimilarities between natural language and source
code, as well as variations in coding conventions in between the different
programming languages we processed. We describe how these heterogeneities
guided the data preprocessing decisions we took and the selection of the
training parameters in the released models. Finally, we propose potential
applications of the models and discuss limitations of the models.Comment: 16th International Conference on Mining Software Repositories (MSR
2019): Data Showcase Trac
Classifying Web Exploits with Topic Modeling
This short empirical paper investigates how well topic modeling and database
meta-data characteristics can classify web and other proof-of-concept (PoC)
exploits for publicly disclosed software vulnerabilities. By using a dataset
comprised of over 36 thousand PoC exploits, near a 0.9 accuracy rate is
obtained in the empirical experiment. Text mining and topic modeling are a
significant boost factor behind this classification performance. In addition to
these empirical results, the paper contributes to the research tradition of
enhancing software vulnerability information with text mining, providing also a
few scholarly observations about the potential for semi-automatic
classification of exploits in the existing tracking infrastructures.Comment: Proceedings of the 2017 28th International Workshop on Database and
Expert Systems Applications (DEXA).
http://ieeexplore.ieee.org/abstract/document/8049693
- …