4 research outputs found

    Towards Automatic Generation of Short Summaries of Commits

    Full text link
    Committing to a version control system means submitting a software change to the system. Each commit can have a message to describe the submission. Several approaches have been proposed to automatically generate the content of such messages. However, the quality of the automatically generated messages falls far short of what humans write. In studying the differences between auto-generated and human-written messages, we found that 82% of the human-written messages have only one sentence, while the automatically generated messages often have multiple lines. Furthermore, we found that the commit messages often begin with a verb followed by an direct object. This finding inspired us to use a "verb+object" format in this paper to generate short commit summaries. We split the approach into two parts: verb generation and object generation. As our first try, we trained a classifier to classify a diff to a verb. We are seeking feedback from the community before we continue to work on generating direct objects for the commits.Comment: 4 pages, accepted in ICPC 2017 ERA Trac

    Toward the Automatic Classification of Self-Affirmed Refactoring

    Get PDF
    The concept of Self-Affirmed Refactoring (SAR) was introduced to explore how developers document their refactoring activities in commit messages, i.e., developers explicit documentation of refactoring operations intentionally introduced during a code change. In our previous study, we have manually identified refactoring patterns and defined three main common quality improvement categories including internal quality attributes, external quality attributes, and code smells, by only considering refactoring-related commits. However, this approach heavily depends on the manual inspection of commit messages. In this paper, we propose a two-step approach to first identify whether a commit describes developer-related refactoring events, then to classify it according to the refactoring common quality improvement categories. Specifically, we combine the N-Gram TF-IDF feature selection with binary and multiclass classifiers to build a new model to automate the classification of refactorings based on their quality improvement categories. We challenge our model using a total of 2,867 commit messages extracted from well engineered open-source Java projects. Our findings show that (1) our model is able to accurately classify SAR commits, outperforming the pattern-based and random classifier approaches, and allowing the discovery of 40 more relevent SAR patterns, and (2) our model reaches an F-measure of up to 90% even with a relatively small training datase

    Vers une nouvelle approche basée sur l'apprentissage profond pour la classification des changements du code source par activités de maintenance

    Get PDF
    « Le domaine du développement logiciel possède une vraie mine d'information qui est sous forme d'historique de changements appliqués aux logiciels pendant leur cycle de vie. En effet, cet historique dont une partie importante est publiquement accessible à partir des systèmes de contrôle de versions fait l'objet d'exploration et d'analyse scientifique à travers le domaine du forage des référentiels de logiciels (MSR pour Mining Software Repositories en anglais) dont le but est d'améliorer plusieurs aspects rencontrés par les parties prenantes pendant le développement d'un logiciel. Dans ce travail, nous nous sommes intéressés à la détermination des types d'activité de maintenance qui sont présents dans une modification du code source. Plusieurs études se sont intéressées à ce sujet, et l'ont traité en exploitant les informations fournies par un programmeur comme le message décrivant les changements effectués ainsi que le code modifié qui est sous forme d'ajout et suppression de lignes de code. Cependant, la majorité d'entre elles considèrent qu'un changement comprend un seul type d'activité de maintenance, ce qui n'est pas toujours vrai en pratique. Ensuite, dans leurs exploitations des données textuelles, ces études se limitent au message alors que ce dernier comprend souvent seulement une description du code modifié et non la raison du changement. Et puis, dans leurs approches, elles se limitent à étudier des projets utilisant le même langage de programmation. À travers cette étude, nous répondons à ces enjeux en proposant un modèle de classification par activités de maintenance basé sur des modèles en apprentissage profond, qui seront également responsables de l'extraction de caractéristiques, que ce soit à partir d'une information textuelle (le message et la proposition de changement) ou du code modifié, indépendamment de son langage de programmation. Nous proposons également un nouveau jeu de données pour cette tâche afin de répondre à un autre enjeu qui est la rareté des jeux de données disponibles. Ce jeu de données tient compte du fait qu'un changement peut appartenir à plusieurs classes de changements. L'architecture de notre modèle est composée d'un modèle préentrainé permettant la génération des représentations distribuées des données textuelles, en plus d'un classificateur sous forme d'un réseau de neurones qui prendra en entrée la sortie du modèle préentrainé en plus des caractéristiques qui concernent le code modifié. Notre approche, dont l'entraînement est basé sur un apprentissage par transfert, a donné des résultats encourageants non seulement sur notre jeu de données, mais aussi en ce qui concerne le support des jeux de données des travaux reliés.-- Mots-clés : Activités de maintenance, systèmes de contrôle de version, forage des référentiels de logiciels, apprentissage. »-- « Software development has a wealth of information in the form of a history of changes applied to software during its life cycle. Indeed, a part of this history, publicly accessible from version control systems, is the subject of exploration and scientific analysis through mining software repositories (MSR). MSR aims to facilitate and improve several aspects stakeholders encounter during software development. In this work, we are interested in determining the types of maintenance activity present in modifying the source code. Several studies have been interested in this subject and have dealt with it by exploiting the information provided by a programmer, like a message describing the changes made and the modified code in the form of added and removed lines of code. However, most consider that a change includes only one type of maintenance activity, which is not always accurate in practice. Also, in using textual data, these studies limit themselves to the message, which often includes only a description of the modified code and not the reason for the change. Additionally, their approaches limit themselves to studying projects that use the same programming language. Through this study, we respond to these challenges by proposing a classification model by maintenance activities based on deep learning models. It will also be responsible for feature extraction, whether from textual information (message and issue description) or modified code, regardless of its programming language. We also provide a new dataset for this task to address another issue: the scarcity of available datasets. This dataset takes into account the fact that a change can belong to several classes of changes. The architecture of our model is composed of a pre-trained model allowing the generation of distributed representations of textual data, in addition to a classifier in the form of a neural network. This network inputs are the output of the pre-trained model and the characteristics related to the modified code. Our approach, whose training is based on transfer learning, has given encouraging results not only on our dataset but also on the support of related work datasets.-- Keywords : Maintenance activities, version control systems, software repository mining, deep learning, transfer learning, distributed representation, classification. »-

    Software Design Change Artifacts Generation through Software Architectural Change Detection and Categorisation

    Get PDF
    Software is solely designed, implemented, tested, and inspected by expert people, unlike other engineering projects where they are mostly implemented by workers (non-experts) after designing by engineers. Researchers and practitioners have linked software bugs, security holes, problematic integration of changes, complex-to-understand codebase, unwarranted mental pressure, and so on in software development and maintenance to inconsistent and complex design and a lack of ways to easily understand what is going on and what to plan in a software system. The unavailability of proper information and insights needed by the development teams to make good decisions makes these challenges worse. Therefore, software design documents and other insightful information extraction are essential to reduce the above mentioned anomalies. Moreover, architectural design artifacts extraction is required to create the developer’s profile to be available to the market for many crucial scenarios. To that end, architectural change detection, categorization, and change description generation are crucial because they are the primary artifacts to trace other software artifacts. However, it is not feasible for humans to analyze all the changes for a single release for detecting change and impact because it is time-consuming, laborious, costly, and inconsistent. In this thesis, we conduct six studies considering the mentioned challenges to automate the architectural change information extraction and document generation that could potentially assist the development and maintenance teams. In particular, (1) we detect architectural changes using lightweight techniques leveraging textual and codebase properties, (2) categorize them considering intelligent perspectives, and (3) generate design change documents by exploiting precise contexts of components’ relations and change purposes which were previously unexplored. Our experiment using 4000+ architectural change samples and 200+ design change documents suggests that our proposed approaches are promising in accuracy and scalability to deploy frequently. Our proposed change detection approach can detect up to 100% of the architectural change instances (and is very scalable). On the other hand, our proposed change classifier’s F1 score is 70%, which is promising given the challenges. Finally, our proposed system can produce descriptive design change artifacts with 75% significance. Since most of our studies are foundational, our approaches and prepared datasets can be used as baselines for advancing research in design change information extraction and documentation
    corecore