92 research outputs found

    Mining unstructured software data

    Get PDF
    Our thesis is that the analysis of unstructured data supports software understanding and evolution analysis, and complements the data mined from structured sources. To this aim, we implemented the necessary toolset and investigated methods for exploring, exposing, and exploiting unstructured data.To validate our thesis, we focused on development email data. We found two main challenges in using it to support program comprehension and software development: The disconnection between emails and code artifacts and the noisy and mixed-language nature of email content. We tackle these challenges proposing novel approaches. First, we devise lightweight techniques for linking email data to code artifacts. We use these techniques for creating a tool to support program comprehension with email data, and to create a new set of email based metrics to improve existing defect prediction approaches. Subsequently, we devise techniques for giving a structure to the content of email and we use this structure to conduct novel software analyses to support program comprehension. In this dissertation we show that unstructured data, in the form of development emails, is a valuable addition to structured data and, if correctly mined, can be used successfully to support software engineering activities

    Software security during modern code review: The developer’s perspective

    Full text link
    To avoid software vulnerabilities, organizations are shifting security to earlier stages of the software development, such as at code review time. In this paper, we aim to understand the developers’ perspective on assessing software security during code review, the challenges they encounter, and the support that companies and projects provide. To this end, we conduct a two-step investigation: we interview 10 professional developers and survey 182 practitioners about software security assessment during code review. The outcome is an overview of how developers perceive software security during code review and a set of identified challenges. Our study revealed that most developers do not immediately report to focus on security issues during code review. Only after being asked about software security, developers state to always consider it during review and acknowledge its importance. Most companies do not provide security training, yet expect developers to still ensure security during reviews. Accordingly, developers report the lack of training and security knowledge as the main challenges they face when checking for security issues. In addition, they have challenges with third-party libraries and to identify interactions between parts of code that could have security implications. Moreover, security may be disregarded during reviews due to developers’ assumptions about the security dynamic of the application they develop

    The effects of change decomposition on code review -- a controlled experiment

    Get PDF
    Background: Code review is a cognitively demanding and time-consuming process. Previous qualitative studies hinted at how decomposing change sets into multiple yet internally coherent ones would improve the reviewing process. So far, literature provided no quantitative analysis of this hypothesis. Aims: (1) Quantitatively measure the effects of change decomposition on the outcome of code review (in terms of number of found defects, wrongly reported issues, suggested improvements, time, and understanding); (2) Qualitatively analyze how subjects approach the review and navigate the code, building knowledge and addressing existing issues, in large vs. decomposed changes. Method: Controlled experiment using the pull-based development model involving 28 software developers among professionals and graduate students. Results: Change decomposition leads to fewer wrongly reported issues, influences how subjects approach and conduct the review activity (by increasing context-seeking), yet impacts neither understanding the change rationale nor the number of found defects. Conclusions: Change decomposition reduces the noise for subsequent data analyses but also significantly supports the tasks of the developers in charge of reviewing the changes. As such, commits belonging to different concepts should be separated, adopting this as a best practice in software engineering

    Toward Eliminating Hallucinations: GPT-based Explanatory AI for Intelligent Textbooks and Documentation

    Get PDF
    Traditional explanatory resources, such as user manuals and textbooks, often contain content that may not cater to the diverse backgrounds and information needs of users. Yet, developing intuitive, user-centered methods to effectively explain complex or large amounts of information is still an open research challenge. In this paper we present ExplanatoryGPT, an approach we devised and implemented to transform textual documents into interactive, intelligent resources, capable of offering dynamic, personalized explanations. Our approach uses state-of-the-art question-answering technology to generate on-demand, expandable explanations, with the aim of allowing readers to efficiently navigate and comprehend static materials. ExplanatoryGPT integrates ChatGPT, a state-of-the-art language model, with Achinstein’s philosophical theory of explanations. By combining question generation and answer retrieval algorithms with ChatGPT, our method generates interactive, user-centered explanations, while mitigating common issues associated with ChatGPT, such as hallucinations and memory shortcomings. To showcase the effectiveness of our Explanatory AI, we conducted tests using a variety of sources, including a legal textbook and documentation of some health and financial software. Specifically, we provide several examples that illustrate how ExplanatoryGPT excels over ChatGPT in generating more precise explanations, accomplished through thoughtful macro-planning of explanation content. Notably, our approach also avoids the need to provide the entire context of the explanation as a prompt to ChatGPT, a process that is often not feasible due to common memory constraints

    Interpersonal Conflicts During Code Review

    Full text link
    Code review consists of manual inspection, discussion, and judgment of source code by developers other than the code's author. Due to discussions around competing ideas and group decision-making processes, interpersonal conflicts during code reviews are expected. This study systematically investigates how developers perceive code review conflicts and addresses interpersonal conflicts during code reviews as a theoretical construct. Through the thematic analysis of interviews conducted with 22 developers, we confirm that conflicts during code reviews are commonplace, anticipated and seen as normal by developers. Even though conflicts do happen and carry a negative impact for the review, conflicts-if resolved constructively-can also create value and bring improvement. Moreover, the analysis provided insights on how strongly conflicts during code review and its context (i.e., code, developer, team, organization) are intertwined. Finally, there are aspects specific to code review conflicts that call for the research and application of customized conflict resolution and management techniques, some of which are discussed in this paper. Data and material: https://doi.org/10.5281/zenodo.584879

    The evolution of the code during review: an investigation on review changes

    Full text link
    Code review is a software engineering practice in which reviewers manually inspect the code written by a fellow developer and propose any change that is deemed necessary or useful. The main goal of code review is to improve the quality of the code under review. Despite the widespread use of code review, only a few studies focused on the investigation of its outcomes, for example, investigating the code changes that happen to the code under review. The goal of this paper is to expand our knowledge on the outcome of code review while re-evaluating results from previous work. To this aim, we analyze changes that happened during the review process, which we define as review changes. Considering three popular open-source software projects, we investigate the types of review changes (based on existing taxonomies) and what triggers them; also, we study which code factors in a code review are most related to the number of review changes. Our results show that the majority of changes relate to evolvability concerns, with a strong prevalence of documentation and structure changes at type-level. Furthermore, differently from past work, we found that the majority of review changes are not triggered by reviewers’ comments. Finally, we find that the number of review changes in a code review is related to the size of the initial patch as well as the new lines of code that it adds. However, other factors, such as lines deleted or the author of the review patchset, do not always show an empirically supported relationship with the number of changes

    A Security Perspective on Code Review: The Case of Chromium

    Full text link
    Modern Code Review (MCR) is an established software development process that aims to improve software quality. Although evidence showed that higher levels of review coverage relates to less post-release bugs, it remains unknown the effectiveness of MCR at specifically finding security issues. We present a work we conduct aiming to fill that gap by exploring the MCR process in the Chromium open source project. We manually analyzed large sets of registered (114 cases) and missed (71 cases) security issues by backtracking in the project's issue, review, and code histories. This enabled us to qualify MCR in Chromium from the security perspective from several angles: Are security issues being discussed frequently? What categories of security issues are often missed or found? What characteristics of code reviews appear relevant to the discovery rate? Within the cases we analyzed, MCR in Chromium addresses security issues at a rate of 1% of reviewers' comments. Chromium code reviews mostly tend to miss language-specific issues (e.g., C++ issues and buffer overflows) and domain-specific ones (such as Cross-Site Scripting), when code reviews address issues, mostly they address those that pertain to the latter type. Initial evidence points to reviews conducted by more than 2 reviewers being more successful at finding security issues

    Workflow analysis of data science code in public GitHub repositories

    Full text link
    Despite the ubiquity of data science, we are far from rigorously understanding how coding in data science is performed. Even though the scientific literature has hinted at the iterative and explorative nature of data science coding, we need further empirical evidence to understand this practice and its workflows in detail. Such understanding is critical to recognise the needs of data scientists and, for instance, inform tooling support. To obtain a deeper understanding of the iterative and explorative nature of data science coding, we analysed 470 Jupyter notebooks publicly available in GitHub repositories. We focused on the extent to which data scientists transition between different types of data science activities, or steps (such as data preprocessing and modelling), as well as the frequency and co-occurrence of such transitions. For our analysis, we developed a dataset with the help of five data science experts, who manually annotated the data science steps for each code cell within the aforementioned 470 notebooks. Using the first-order Markov chain model, we extracted the transitions and analysed the transition probabilities between the different steps. In addition to providing deeper insights into the implementation practices of data science coding, our results provide evidence that the steps in a data science workflow are indeed iterative and reveal specific patterns. We also evaluated the use of the annotated dataset to train machine-learning classifiers to predict the data science step(s) of a given code cell. We investigate the representativeness of the classification by comparing the workflow analysis applied to (a) the predicted data set and (b) the data set labelled by experts, finding an F1-score of about 71% for the 10-class data science step prediction problem

    Visualising data science workflows to support third-party notebook comprehension: an empirical study

    Full text link
    Data science is an exploratory and iterative process that often leads to complex and unstructured code. This code is usually poorly documented and, consequently, hard to understand by a third party. In this paper, we first collect empirical evidence for the non-linearity of data science code from real-world Jupyter notebooks, confirming the need for new approaches that aid in data science code interaction and comprehension. Second, we propose a visualisation method that elucidates implicit workflow information in data science code and assists data scientists in navigating the so-called garden of forking paths in non-linear code. The visualisation also provides information such as the rationale and the identification of the data science pipeline step based on cell annotations. We conducted a user experiment with data scientists to evaluate the proposed method, assessing the influence of (i) different workflow visualisations and (ii) cell annotations on code comprehension. Our results show that visualising the exploration helps the users obtain an overview of the notebook, significantly improving code comprehension. Furthermore, our qualitative analysis provides more insights into the difficulties faced during data science code comprehension
    • …
    corecore