20 research outputs found
Towards Automatic Generation of Short Summaries of Commits
Committing to a version control system means submitting a software change to
the system. Each commit can have a message to describe the submission. Several
approaches have been proposed to automatically generate the content of such
messages. However, the quality of the automatically generated messages falls
far short of what humans write. In studying the differences between
auto-generated and human-written messages, we found that 82% of the
human-written messages have only one sentence, while the automatically
generated messages often have multiple lines. Furthermore, we found that the
commit messages often begin with a verb followed by an direct object. This
finding inspired us to use a "verb+object" format in this paper to generate
short commit summaries. We split the approach into two parts: verb generation
and object generation. As our first try, we trained a classifier to classify a
diff to a verb. We are seeking feedback from the community before we continue
to work on generating direct objects for the commits.Comment: 4 pages, accepted in ICPC 2017 ERA Trac
ARENA: An Approach for the Automated Generation of Release Notes
Release notes document corrections, enhancements, and, in general, changes that were implemented in a new release of a software project. They are usually created manually and may include hundreds of different items, such as descriptions of new features, bug fixes, structural changes, new or deprecated APIs, and changes to software licenses. Thus, producing them can be a time-consuming and daunting task. This paper describes ARENA ( A utomatic RE lease N otes gener A tor), an approach for the automatic generation of release notes. ARENA extracts changes from the source code, summarizes them, and integrates them with information from versioning systems and issue trackers. ARENA was designed based on the manual analysis of 990 existing release notes. In order to evaluate the quality of the release notes automatically generated by ARENA, we performed four empirical studies involving a total of 56 participants (48 professional developers and eight students). The obtained results indicate that the generated release notes are very good approximations of the ones manually produced by developers and often include important information that is missing in the manually created release notes
Eye of the Mind: Image Processing for Social Coding
Developers are increasingly sharing images in social coding environments
alongside the growth in visual interactions within social networks. The
analysis of the ratio between the textual and visual content of Mozilla's
change requests and in Q/As of StackOverflow programming revealed a steady
increase in sharing images over the past five years. Developers' shared images
are meaningful and are providing complementary information compared to their
associated text. Often, the shared images are essential in understanding the
change requests, questions, or the responses submitted. Relying on these
observations, we delve into the potential of automatic completion of textual
software artifacts with visual content.Comment: This is the author's version of ICSE 2020 pape
Exploring and Evaluating Personalized Models for Code Generation
Large Transformer models achieved the state-of-the-art status for Natural
Language Understanding tasks and are increasingly becoming the baseline model
architecture for modeling source code. Transformers are usually pre-trained on
large unsupervised corpora, learning token representations and transformations
relevant to modeling generally available text, and are then fine-tuned on a
particular downstream task of interest. While fine-tuning is a tried-and-true
method for adapting a model to a new domain -- for example, question-answering
on a given topic -- generalization remains an on-going challenge. In this
paper, we explore and evaluate transformer model fine-tuning for
personalization. In the context of generating unit tests for Java methods, we
evaluate learning to personalize to a specific software project using several
personalization techniques. We consider three key approaches: (i) custom
fine-tuning, which allows all the model parameters to be tuned; (ii)
lightweight fine-tuning, which freezes most of the model's parameters, allowing
tuning of the token embeddings and softmax layer only or the final layer alone;
(iii) prefix tuning, which keeps model parameters frozen, but optimizes a small
project-specific prefix vector. Each of these techniques offers a trade-off in
total compute cost and predictive performance, which we evaluate by code and
task-specific metrics, training time, and total computational operations. We
compare these fine-tuning strategies for code generation and discuss the
potential generalization and cost benefits of each in various deployment
scenarios.Comment: Accepted to the ACM Joint European Software Engineering Conference
and Symposium on the Foundations of Software Engineering (ESEC/FSE 2022),
Industry Track - Singapore, November 14-18, 2022, to appear 9 page
Label Smoothing Improves Neural Source Code Summarization
Label smoothing is a regularization technique for neural networks. Normally
neural models are trained to an output distribution that is a vector with a
single 1 for the correct prediction, and 0 for all other elements. Label
smoothing converts the correct prediction location to something slightly less
than 1, then distributes the remainder to the other elements such that they are
slightly greater than 0. A conceptual explanation behind label smoothing is
that it helps prevent a neural model from becoming "overconfident" by forcing
it to consider alternatives, even if only slightly. Label smoothing has been
shown to help several areas of language generation, yet typically requires
considerable tuning and testing to achieve the optimal results. This tuning and
testing has not been reported for neural source code summarization - a growing
research area in software engineering that seeks to generate natural language
descriptions of source code behavior. In this paper, we demonstrate the effect
of label smoothing on several baselines in neural code summarization, and
conduct an experiment to find good parameters for label smoothing and make
recommendations for its use
Data-Driven Decisions and Actions in Today’s Software Development
Today’s software development is all about data: data about the software product itself, about the process and its different stages, about the customers and markets, about the development, the testing, the integration, the deployment, or the runtime aspects in the cloud. We use static and dynamic data of various kinds and quantities to analyze market feedback, feature impact, code quality, architectural design alternatives, or effects of performance optimizations. Development environments are no longer limited to IDEs in a desktop application or the like but span the Internet using live programming environments such as Cloud9 or large-volume repositories such as BitBucket, GitHub, GitLab, or StackOverflow. Software development has become “live” in the cloud, be it the coding, the testing, or the experimentation with different product options on the Internet. The inherent complexity puts a further burden on developers, since they need to stay alert when constantly switching between tasks in different phases. Research has been analyzing the development process, its data and stakeholders, for decades and is working on various tools that can help developers in their daily tasks to improve the quality of their work and their productivity. In this chapter, we critically reflect on the challenges faced by developers in a typical release cycle, identify inherent problems of the individual phases, and present the current state of the research that can help overcome these issues
Image-based Communication on Social Coding Platforms
Visual content in the form of images and videos has taken over
general-purpose social networks in a variety of ways, streamlining and
enriching online communications. We are interested to understand if and to what
extent the use of images is popular and helpful in social coding platforms. We
mined nine years of data from two popular software developers' platforms: the
Mozilla issue tracking system, i.e., Bugzilla, and the most well-known platform
for developers' Q/A, i.e., Stack Overflow. We further triangulated and extended
our mining results by performing a survey with 168 software developers. We
observed that, between 2013 and 2022, the number of posts containing image data
on Bugzilla and Stack Overflow doubled. Furthermore, we found that sharing
images makes other developers engage more and faster with the content. In the
majority of cases in which an image is included in a developer's post, the
information in that image is complementary to the text provided. Finally, our
results showed that when an image is shared, understanding the content without
the information in the image is unlikely for 86.9\% of the cases. Based on
these observations, we discuss the importance of considering visual content
when analyzing developers and designing automation tools