7,472 research outputs found
Learning natural coding conventions
Coding conventions are ubiquitous in software engineering practice. Maintaining a uniform
coding style allows software development teams to communicate through code by
making the code clear and, thus, readable and maintainable—two important properties
of good code since developers spend the majority of their time maintaining software
systems. This dissertation introduces a set of probabilistic machine learning models
of source code that learn coding conventions directly from source code written in a
mostly conventional style. This alleviates the coding convention enforcement problem,
where conventions need to first be formulated clearly into unambiguous rules and then
be coded in order to be enforced; a tedious and costly process.
First, we introduce the problem of inferring a variable’s name given its usage context
and address this problem by creating Naturalize — a machine learning framework
that learns to suggest conventional variable names. Two machine learning models, a
simple n-gram language model and a specialized neural log-bilinear context model are
trained to understand the role and function of each variable and suggest new stylistically
consistent variable names. The neural log-bilinear model can even suggest previously
unseen names by composing them from subtokens (i.e. sub-components of code identifiers).
The suggestions of the models achieve 90% accuracy when suggesting variable
names at the top 20% most confident locations, rendering the suggestion system usable
in practice.
We then turn our attention to the significantly harder method naming problem.
Learning to name methods, by looking only at the code tokens within their body, requires
a good understating of the semantics of the code contained in a single method.
To achieve this, we introduce a novel neural convolutional attention network that learns
to generate the name of a method by sequentially predicting its subtokens. This is
achieved by focusing on different parts of the code and potentially directly using body
(sub)tokens even when they have never been seen before. This model achieves an F1
score of 51% on the top five suggestions when naming methods of real-world open-source
projects.
Learning about naming code conventions uses the syntactic structure of the code
to infer names that implicitly relate to code semantics. However, syntactic similarities
and differences obscure code semantics. Therefore, to capture features of semantic
operations with machine learning, we need methods that learn semantic continuous
logical representations. To achieve this ambitious goal, we focus our investigation on
logic and algebraic symbolic expressions and design a neural equivalence network architecture
that learns semantic vector representations of expressions in a syntax-driven
way, while solely retaining semantics. We show that equivalence networks learn significantly
better semantic vector representations compared to other, existing, neural
network architectures.
Finally, we present an unsupervised machine learning model for mining syntactic
and semantic code idioms. Code idioms are conventional “mental chunks” of code that
serve a single semantic purpose and are commonly used by practitioners. To achieve
this, we employ Bayesian nonparametric inference on tree substitution grammars. We
present a wide range of evidence that the resulting syntactic idioms are meaningful,
demonstrating that they do indeed recur across software projects and that they occur
more frequently in illustrative code examples collected from a Q&A site. These syntactic
idioms can be used as a form of automatic documentation of coding practices
of a programming language or an API. We also mine semantic loop idioms, i.e. highly
abstracted but semantic-preserving idioms of loop operations. We show that semantic
idioms provide data-driven guidance during the creation of software engineering tools
by mining common semantic patterns, such as candidate refactoring locations. This
gives data-based evidence to tool, API and language designers about general, domain
and project-specific coding patterns, who instead of relying solely on their intuition, can
use semantic idioms to achieve greater coverage of their tool or new API or language
feature. We demonstrate this by creating a tool that suggests loop refactorings into
functional constructs in LINQ. Semantic loop idioms also provide data-driven evidence
for introducing new APIs or programming language features
Using Historical Data From Source Code Revision Histories to Detect Source Code Properties
In this dissertation, we describe several techniques for using historical data mined from the source code revision histories of software projects to determine important properties of the source code. These properties are then used to improve the results of various bug-finding techniques as well as to provide documentation to the developer. We describe a method to mine source code revision histories, in this case CVS repositories, to extract relevant information to be fed into a static source code bug finder for use in improving the results generated by the bug finding tool. We apply this technique to the CVS repositories of two widely used open source software projects, Apache httpd and Wine. We show how source code revision history can be used to reduce false positives from a static source code checker that identifies the misuse of values returned from a function call. A method of mining source code revision histories for the purpose of learning about project specific idioms is then discussed. Specifically, we show how source code revision history can be used to identify patterns of calling sequences that describe how functions in the software should be used in relation to each other. With this data, we are able to find bugs in the source code, document API usage and identify refactoring events. In short, this dissertation shows that it is possible to automatically determine meaningful properties of the source code from studying source code changes cataloged in the software revision history
On the Feasibility of Malware Authorship Attribution
There are many occasions in which the security community is interested to
discover the authorship of malware binaries, either for digital forensics
analysis of malware corpora or for thwarting live threats of malware invasion.
Such a discovery of authorship might be possible due to stylistic features
inherent to software codes written by human programmers. Existing studies of
authorship attribution of general purpose software mainly focus on source code,
which is typically based on the style of programs and environment. However,
those features critically depend on the availability of the program source
code, which is usually not the case when dealing with malware binaries. Such
program binaries often do not retain many semantic or stylistic features due to
the compilation process. Therefore, authorship attribution in the domain of
malware binaries based on features and styles that will survive the compilation
process is challenging. This paper provides the state of the art in this
literature. Further, we analyze the features involved in those techniques. By
using a case study, we identify features that can survive the compilation
process. Finally, we analyze existing works on binary authorship attribution
and study their applicability to real malware binaries.Comment: FPS 201
Mining Application-Specific Coding Patterns for Software Maintenance
LATE '08 Proceedings of the 2008 AOSD workshop on Linking aspect technology and evolutio
- …