15 research outputs found

    Delving into Commit-Issue Correlation to Enhance Commit Message Generation Models

    Full text link
    Commit message generation (CMG) is a challenging task in automated software engineering that aims to generate natural language descriptions of code changes for commits. Previous methods all start from the modified code snippets, outputting commit messages through template-based, retrieval-based, or learning-based models. While these methods can summarize what is modified from the perspective of code, they struggle to provide reasons for the commit. The correlation between commits and issues that could be a critical factor for generating rational commit messages is still unexplored. In this work, we delve into the correlation between commits and issues from the perspective of dataset and methodology. We construct the first dataset anchored on combining correlated commits and issues. The dataset consists of an unlabeled commit-issue parallel part and a labeled part in which each example is provided with human-annotated rational information in the issue. Furthermore, we propose \tool (\underline{Ex}traction, \underline{Gro}unding, \underline{Fi}ne-tuning), a novel paradigm that can introduce the correlation between commits and issues into the training phase of models. To evaluate whether it is effective, we perform comprehensive experiments with various state-of-the-art CMG models. The results show that compared with the original models, the performance of \tool-enhanced models is significantly improved.Comment: ASE2023 accepted pape

    Ccbert:Self-Supervised Code Change Representation Learning

    Get PDF
    Numerous code changes are made by developers in their daily work, and a superior representation of code changes is desired for effective code change analysis. Recently, Hoang et al. proposed CC2Vec, a neural network-based approach that learns a distributed representation of code changes to capture the semantic intent of the changes. Despite demonstrated effectiveness in multiple tasks, CC2Vec has several limitations: 1) it considers only coarse-grained information about code changes, and 2) it relies on log messages rather than the self-contained content of the code changes. In this work, we propose CCBERT (\underline{C}ode \underline{C}hange \underline{BERT}), a new Transformer-based pre-trained model that learns a generic representation of code changes based on a large-scale dataset containing massive unlabeled code changes. CCBERT is pre-trained on four proposed self-supervised objectives that are specialized for learning code change representations based on the contents of code changes. CCBERT perceives fine-grained code changes at the token level by learning from the old and new versions of the content, along with the edit actions. Our experiments demonstrate that CCBERT significantly outperforms CC2Vec or the state-of-the-art approaches of the downstream tasks by 7.7\%--14.0\% in terms of different metrics and tasks. CCBERT consistently outperforms large pre-trained code models, such as CodeBERT, while requiring 6--10×\times less training time, 5--30×\times less inference time, and 7.9×\times less GPU memory

    Automating Self Evaluations for Software Engineers

    Get PDF
    Software engineers frequently compose self-evaluations as part of employee perfor- mance reviews. These evaluations can be a key artifact for assessing a software engineer’s contributions to a team and organization, and for generating useful feed- back. Self-evaluations can be challenging because a) they can be time consuming, b) individuals may forget about important contributions especially when the review period is long such as a full year, c) some individuals can consciously or unconsciously overstate their contributions, and d) some individuals can be reluctant to describe their contributions for fear of appearing too proud [24]. UNBIASED, Useful New Basic Interactive Automated Self-Evaluation Demon- stration, is a web application designed to tackle the challenges of performing a self- evaluation by automatically gathering data from existing third party APIs, perform- ing an analysis on the data, and generating a self-evaluation starting point for soft- ware engineers to build off. The third party APIs currently supported are: Bitbucket, Gmail, Google Calendar, GitHub, and JIRA

    Aide-mémoire: Improving a Project’s Collective Memory via Pull Request–Issue Links

    Get PDF
    Links between pull request and the issues they address document and accelerate the development of a software project but are often omitted. We present a new tool, Aide-mémoire, to suggest such links when a developer submits a pull request or closes an issue, smoothly integrating into existing workflows. In contrast to previous state-of-the-art approaches that repair related commit histories, Aide-mémoire is designed for continuous, real-time, and long-term use, employing Mondrian forest to adapt over a project’s lifetime and continuously improve traceability. Aide-mémoire is tailored for two specific instances of the general traceability problem—namely, commit to issue and pull request to issue links, with a focus on the latter—and exploits data inherent to these two problems to outperform tools for general purpose link recovery. Our approach is online, language-agnostic, and scalable. We evaluate over a corpus of 213 projects and six programming languages, achieving a mean average precision of 0.95. Adopting Aide-mémoire is both efficient and effective: A programmer need only evaluate a single suggested link 94% of the time, and 16% of all discovered links were originally missed by developers

    Pitfalls and Guidelines for Using Time-Based Git Data

    Get PDF
    Many software engineering research papers rely on time-based data (e.g., commit timestamps, issue report creation/update/close dates, release dates). Like most real-world data however, time-based data is often dirty. To date, there are no studies that quantify how frequently such data is used by the software engineering research community, or investigate sources of and quantify how often such data is dirty. Depending on the research task and method used, including such dirty data could aect the research results. This paper presents an extended survey of papers that utilize time-based data, published in the Mining Software Repositories (MSR) conference series. Out of the 754 technical track and data papers published in MSR 2004{2021, we saw at least 290 (38%) papers utilized time-based data. We also observed that most time-based data used in research papers comes in the form of Git commits, often from GitHub. Based on those results, we then used the Boa and Software Heritage infrastructures to help identify and quantify several sources of dirty Git timestamp data. Finally we provide guidelines/best practices for researchers utilizing time-based data from Git repositories

    Changeset-based Retrieval of Source Code Artifacts for Bug Localization

    Get PDF
    Modern software development is extremely collaborative and agile, with unprecedented speed and scale of activity. Popular trends like continuous delivery and continuous deployment aim at building, fixing, and releasing software with greater speed and frequency. Bug localization, which aims to automatically localize bug reports to relevant software artifacts, has the potential to improve software developer efficiency by reducing the time spent on debugging and examining code. To date, this problem has been primarily addressed by applying information retrieval techniques based on static code elements, which are intrinsically unable to reflect how software evolves over time. Furthermore, as prior approaches frequently rely on exact term matching to measure relatedness between a bug report and a software artifact, they are prone to be affected by the lexical gap that exists between natural and programming language. This thesis explores using software changes (i.e., changesets), instead of static code elements, as the primary data unit to construct an information retrieval model toward bug localization. Changesets, which represent the differences between two consecutive versions of the source code, provide a natural representation of a software change, and allow to capture both the semantics of the source code, and the semantics of the code modification. To bridge the lexical gap between source code and natural language, this thesis investigates using topic modeling and deep learning architectures that enable creating semantically rich data representation with the goal of identifying latent connection between bug reports and source code. To show the feasibility of the proposed approaches, this thesis also investigates practical aspects related to using a bug localization tool, such retrieval delay and training data availability. The results indicate that the proposed techniques effectively leverage historical data about bugs and their related source code components to improve retrieval accuracy, especially for bug reports that are expressed in natural language, with little to no explicit code references. Further improvement in accuracy is observed when the size of the training dataset is increased through data augmentation and data balancing strategies proposed in this thesis, although depending on the model architecture the magnitude of the improvement varies. In terms of retrieval delay, the results indicate that the proposed deep learning architecture significantly outperforms prior work, and scales up with respect to search space size

    Improving Software Project Health Using Machine Learning

    Get PDF
    In recent years, systems that would previously live on different platforms have been integrated under a single umbrella. The increased use of GitHub, which offers pull-requests, issue trackingand version history, and its integration with other solutions such as Gerrit, or Travis, as well as theresponse from competitors, created development environments that favour agile methodologiesby increasingly automating non-coding tasks: automated build systems, automated issue triagingetc. In essence, source-code hosting platforms shifted to continuous integration/continuousdelivery (CI/CD) as a service. This facilitated a shift in development paradigms, adherents ofagile methodology can now adopt a CI/CD infrastructure more easily. This has also created large,publicly accessible sources of source-code together with related project artefacts: GHTorrent andsimilar datasets now offer programmatic access to the whole of GitHub. Project health encompasses traceability, documentation, adherence to coding conventions,tasks that reduce maintenance costs and increase accountability, but may not directly impactfeatures. Overfocus on health can slow velocity (new feature delivery) so the Agile Manifestosuggests developers should travel light — forgo tasks focused on a project health in favourof higher feature velocity. Obviously, injudiciously following this suggestion can undermine aproject’s chances for success. Simultaneously, this shift to CI/CD has allowed the proliferation of Natural Language orNatural Language and Formal Language textual artefacts that are programmatically accessible:GitHub and their competitors allow API access to their infrastructure to enable the creation ofCI/CD bots. This suggests that approaches from Natural Language Processing and MachineLearning are now feasible and indeed desirable. This thesis aims to (semi-)automate tasks forthis new paradigm and its attendant infrastructure by bringing to the foreground the relevant NLPand ML techniques. Under this umbrella, I focus on three synergistic tasks from this domain: (1) improving theissue-pull-request traceability, which can aid existing systems to automatically curate the issuebacklog as pull-requests are merged; (2) untangling commits in a version history, which canaid the beforementioned traceability task as well as improve the usability of determining a faultintroducing commit, or cherry-picking via tools such as git bisect; (3) mixed-text parsing, to allowbetter API mining and open new avenues for project-specific code-recommendation tools
    corecore