2 research outputs found
Neural Networks for Modeling Source Code Edits
Programming languages are emerging as a challenging and interesting domain
for machine learning. A core task, which has received significant attention in
recent years, is building generative models of source code. However, to our
knowledge, previous generative models have always been framed in terms of
generating static snapshots of code. In this work, we instead treat source code
as a dynamic object and tackle the problem of modeling the edits that software
developers make to source code files. This requires extracting intent from
previous edits and leveraging it to generate subsequent edits. We develop
several neural networks and use synthetic data to test their ability to learn
challenging edit patterns that require strong generalization. We then collect
and train our models on a large-scale dataset of Google source code, consisting
of millions of fine-grained edits from thousands of Python developers. From the
modeling perspective, our main conclusion is that a new composition of
attentional and pointer network components provides the best overall
performance and scalability. From the application perspective, our results
provide preliminary evidence of the feasibility of developing tools that learn
to predict future edits.Comment: Deanonymized version of ICLR 2019 submissio
Latent Variable Models for Predicting File Dependencies in Large-Scale Software Development
When software developers modify one or more files in a large code base, they must also identify and update other related files. Many file dependencies can be detected by mining the development history of the code base: in essence, groups of related files are revealed by the logs of previous workflows. From data of this form, we show how to detect dependent files by solving a problem in binary matrix completion. We explore different latent variable models (LVMs) for this problem, including Bernoulli mixture models, exponential family PCA, restricted Boltzmann machines, and fully Bayesian approaches. We evaluate these models on the development histories of three large, open-source software systems: Mozilla Firefox, Eclipse Subversive, and Gimp. In all of these applications, we find that LVMs improve the performance of related file prediction over current leading methods.