63 research outputs found

    Aide-mĂ©moire: Improving a Project’s Collective Memory via Pull Request–Issue Links

    Get PDF
    Links between pull request and the issues they address document and accelerate the development of a software project but are often omitted. We present a new tool, Aide-mĂ©moire, to suggest such links when a developer submits a pull request or closes an issue, smoothly integrating into existing workflows. In contrast to previous state-of-the-art approaches that repair related commit histories, Aide-mĂ©moire is designed for continuous, real-time, and long-term use, employing Mondrian forest to adapt over a project’s lifetime and continuously improve traceability. Aide-mĂ©moire is tailored for two specific instances of the general traceability problem—namely, commit to issue and pull request to issue links, with a focus on the latter—and exploits data inherent to these two problems to outperform tools for general purpose link recovery. Our approach is online, language-agnostic, and scalable. We evaluate over a corpus of 213 projects and six programming languages, achieving a mean average precision of 0.95. Adopting Aide-mĂ©moire is both efficient and effective: A programmer need only evaluate a single suggested link 94% of the time, and 16% of all discovered links were originally missed by developers

    Avatud lĂ€htekoodiga tarkvaraprojektide vearaportite ja tehniliste sĂ”ltuvuste haldamise analĂŒĂŒsimine

    Get PDF
    NĂŒĂŒdisaegses tarkvaraarenduses kasutatakse avatud lĂ€htekoodiga tarkvara komponente, et vĂ€hendada korratava töö hulka. Tarkvaraarendajad lisavad vaba lĂ€htekoodiga komponente oma projektidesse, omamata ĂŒlevaadet kasutatud komponentide arendamisest ja hooldamisest. Selle töö eesmĂ€rk on analĂŒĂŒsida tarkvaraprojektide vearaporteid ja sĂ”ltuvuste haldamist ning arendada vĂ€lja kohased meetodid. Tarkvaraprojektides kasutatakse töö organiseerimiseks veahaldussĂŒsteeme, mille abil hallatakse tĂ¶Ă¶ĂŒlesandeid, vearaporteid ja uusi kasutajanĂ”udeid. Enamat kui 4000 avatud lĂ€htekoodiga projekti analĂŒĂŒsides selgus, et paljud vearaportid jÀÀvad pikaks ajaks lahendamata. Muu hulgas vĂ”ib nii ka mĂ”ni kriitiline turvaviga parandamata jÀÀda. Doktoritöös arendatakse vĂ€lja meetod, mis vĂ”imaldab automaatselt hinnata vearaporti lahendamiseks kuluvat aega. Meetod pĂ”hineb veahaldussĂŒsteemi talletunud andmete analĂŒĂŒsil. Vearaporti eluaja hindamine aitab projektiosalistel prioriseerida tĂ¶Ă¶ĂŒlesandeid ja planeerida ressursse. Töö teises osas uuritakse, kuidas avatud lĂ€htekoodiga projektide koodis kolmanda poole komponente kasutatakse. Tarkvaraarendajad kasutavad varem vĂ€ljaarendatud komponente, et kiirendada arendust ja vĂ€hendada korratava töö hulka. Samamoodi kasutavad spetsiifilised komponendid veel omakorda teisi komponente, mislĂ€bi moodustub komponentide vaheliste seoste kaudu sĂ”ltuvuslik vĂ”rgustik. Selles doktoritöös analĂŒĂŒsitakse sĂ”ltuvuste vĂ”rgustikku populaarsete programmeerimiskeelte nĂ€idetel. Töö kĂ€igus arendatud meetod on rakendatav sĂ”ltuvuste vĂ”rgustiku struktuuri ja kasvu analĂŒĂŒsimiseks. Töös demonstreeritakse, kuidas vĂ”rgustiku struktuuri analĂŒĂŒsi abil saab hinnata tarkvaraprojektide riski hĂ”lmata sĂ”ltuvusahela kaudu mĂ”ni turvaviga. Doktoritöös arendatud meetodid ja tulemused aitavad avatud lĂ€htekoodiga projektide vearaportite ja tehniliste sĂ”ltuvuste haldamise praktikat lĂ€bipaistvamaks muuta.Modern software development relies on open-source software to facilitate reuse and reduce redundant work. Software developers use open-source packages in their projects without having insights into how these components are being developed and maintained. The aim of this thesis is to develop approaches for analyzing issue and dependency management in software projects. Software projects organize their work with issue trackers, tools for tracking issues such as development tasks, bug reports, and feature requests. By analyzing issue handling in more than 4,000 open-source projects, we found that many issues are left open for long periods of time, which can result in bugs and vulnerabilities not being fixed in a timely manner. This thesis proposes a method for predicting the amount of time it takes to resolve an issue by using the historical data available in issue trackers. Methods for predicting issue lifetime can help software project managers to prioritize issues and allocate resources accordingly. Another problem studied in this thesis is how software dependencies are used. Software developers often include third-party open-source software packages in their project code as a dependency. The included dependencies can also have their own dependencies. A complex network of dependency relationships exists among open-source software packages. This thesis analyzes the structure and the evolution of dependency networks of three popular programming languages. We propose an approach to measure the growth and the evolution of dependency networks. This thesis demonstrates that dependency network analysis can quantify what is the likelihood of acquiring vulnerabilities through software packages and how it changes over time. The approaches and findings developed here could help to bring transparency into open-source projects with respect to how issues are handled, or dependencies are updated

    “Won’t we fix this issue?” : qualitative characterization and automated identification of wontfix issues on GitHub

    Get PDF
    Context: Addressing user requests in the form of bug reports and Github issues represents a crucial task of any successful software project. However, user-submitted issue reports tend to widely differ in their quality, and developers spend a considerable amount of time handling them. Objective: By collecting a dataset of around 6,000 issues of 279 GitHub projects, we observe that developers take significant time (i.e., about five months, on average) before labeling an issue as a wontfix. For this reason, in this paper, we empirically investigate the nature of wontfix issues and methods to facilitate issue management process. Method: We first manually analyze a sample of 667 wontfix issues, extracted from heterogeneous projects, investigating the common reasons behind a “wontfix decision”, the main characteristics of wontfix issues and the potential factors that could be connected with the time to close them. Furthermore, we experiment with approaches enabling the prediction of wontfix issues by analyzing the titles and descriptions of reported issues when submitted. Results and conclusion: Our investigation sheds some light on the wontfix issues’ characteristics, as well as the potential factors that may affect the time required to make a “wontfix decision”. Our results also demonstrate that it is possible to perform prediction of wontfix issues with high average values of precision, recall, and F-measure (90%-93%)

    Exploring the characteristics of issue-related behaviors in GitHub using visualization techniques

    Get PDF

    Predicting Software Fault Proneness Using Machine Learning

    Get PDF
    Context: Continuous Integration (CI) is a DevOps technique which is widely used in practice. Studies show that its adoption rates will increase even further. At the same time, it is argued that maintaining product quality requires extensive and time consuming, testing and code reviews. In this context, if not done properly, shorter sprint cycles and agile practices entail higher risk for the quality of the product. It has been reported in literature [68], that lack of proper test strategies, poor test quality and team dependencies are some of the major challenges encountered in continuous integration and deployment. Objective: The objective of this thesis, is to bridge the process discontinuity that exists between development teams and testing teams, due to continuous deployments and shorter sprint cycles, by providing a list of potentially buggy or high risk files, which can be used by testers to prioritize code inspection and testing, reducing thus the time between development and release. Approach: Out approach is based on a five step process. The first step is to select a set of systems, a set of code metrics, a set of repository metrics, and a set of machine learning techniques to consider for training and evaluation purposes. The second step is to devise appropriate client programs to extract and denote information obtained from GitHub repositories and source code analyzers. The third step is to use this information to train the models using the selected machine learning techniques. This step allowed to identify the best performing machine learning techniques out of the initially selected in the first step. The fourth step is to apply the models with a voting classifier (with equal weights) and provide answers to five research questions pertaining to the prediction capability and generality of the obtained fault proneness prediction framework. The fifth step is to select the best performing predictors and apply it to two systems written in a completely different language (C++) in order to evaluate the performance of the predictors in a new environment. Obtained Results: The obtained results indicate that a) The best models were the ones applied on the same system as the one trained on; b) The models trained using repository metrics outperformed the ones trained using code metrics; c) The models trained using code metrics were proven not adequate for predicting fault prone modules; d) The use of machine learning as a tool for building fault-proneness prediction models is promising, but still there is work to be done as the models show weak to moderate prediction capability. Conclusion: This thesis provides insights into how machine learning can be used to predict whether a source code file contains one or more faults that may contribute to a major system failure. The proposed approach is utilizing information extracted both from the system’s source code, such as code metrics, and from a series of DevOps tools, such as bug repositories, version control systems and, testing automation frameworks. The study involved five Java and five Python systems and indicated that machine learning techniques have potential towards building models for alerting developers about failure prone code

    Improving Software Project Health Using Machine Learning

    Get PDF
    In recent years, systems that would previously live on different platforms have been integrated under a single umbrella. The increased use of GitHub, which offers pull-requests, issue trackingand version history, and its integration with other solutions such as Gerrit, or Travis, as well as theresponse from competitors, created development environments that favour agile methodologiesby increasingly automating non-coding tasks: automated build systems, automated issue triagingetc. In essence, source-code hosting platforms shifted to continuous integration/continuousdelivery (CI/CD) as a service. This facilitated a shift in development paradigms, adherents ofagile methodology can now adopt a CI/CD infrastructure more easily. This has also created large,publicly accessible sources of source-code together with related project artefacts: GHTorrent andsimilar datasets now offer programmatic access to the whole of GitHub. Project health encompasses traceability, documentation, adherence to coding conventions,tasks that reduce maintenance costs and increase accountability, but may not directly impactfeatures. Overfocus on health can slow velocity (new feature delivery) so the Agile Manifestosuggests developers should travel light — forgo tasks focused on a project health in favourof higher feature velocity. Obviously, injudiciously following this suggestion can undermine aproject’s chances for success. Simultaneously, this shift to CI/CD has allowed the proliferation of Natural Language orNatural Language and Formal Language textual artefacts that are programmatically accessible:GitHub and their competitors allow API access to their infrastructure to enable the creation ofCI/CD bots. This suggests that approaches from Natural Language Processing and MachineLearning are now feasible and indeed desirable. This thesis aims to (semi-)automate tasks forthis new paradigm and its attendant infrastructure by bringing to the foreground the relevant NLPand ML techniques. Under this umbrella, I focus on three synergistic tasks from this domain: (1) improving theissue-pull-request traceability, which can aid existing systems to automatically curate the issuebacklog as pull-requests are merged; (2) untangling commits in a version history, which canaid the beforementioned traceability task as well as improve the usability of determining a faultintroducing commit, or cherry-picking via tools such as git bisect; (3) mixed-text parsing, to allowbetter API mining and open new avenues for project-specific code-recommendation tools

    Investigating the Impact of Continuous Integration Practices on the Productivity and Quality of Open-Source Projects

    Full text link
    Background: Much research has been conducted to investigate the impact of Continuous Integration (CI) on the productivity and quality of open-source projects. Most of studies have analyzed the impact of adopting a CI server service (e.g, Travis-CI) but did not analyze CI sub-practices. Aims: We aim to evaluate the impact of five CI sub-practices with respect to the productivity and quality of GitHub open-source projects. Method: We collect CI sub-practices of 90 relevant open-source projects for a period of 2 years. We use regression models to analyze whether projects upholding the CI sub-practices are more productive and/or generate fewer bugs. We also perform a qualitative document analysis to understand whether CI best practices are related to a higher quality of projects. Results: Our findings reveal a correlation between the Build Activity and Commit Activity sub-practices and the number of merged pull requests. We also observe a correlation between the Build Activity, Build Health and Time to Fix Broken Builds sub-practices and number of bug-related issues. The qualitative analysis reveals that projects with the best values for CI sub-practices face fewer CI-related problems compared to projects that exhibit the worst values for CI sub-practices. Conclusions: We recommend that projects should strive to uphold the several CI sub-practices as they can impact in the productivity and quality of projects.Comment: Paper accepted for publication by The ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM

    Image-based Communication on Social Coding Platforms

    Full text link
    Visual content in the form of images and videos has taken over general-purpose social networks in a variety of ways, streamlining and enriching online communications. We are interested to understand if and to what extent the use of images is popular and helpful in social coding platforms. We mined nine years of data from two popular software developers' platforms: the Mozilla issue tracking system, i.e., Bugzilla, and the most well-known platform for developers' Q/A, i.e., Stack Overflow. We further triangulated and extended our mining results by performing a survey with 168 software developers. We observed that, between 2013 and 2022, the number of posts containing image data on Bugzilla and Stack Overflow doubled. Furthermore, we found that sharing images makes other developers engage more and faster with the content. In the majority of cases in which an image is included in a developer's post, the information in that image is complementary to the text provided. Finally, our results showed that when an image is shared, understanding the content without the information in the image is unlikely for 86.9\% of the cases. Based on these observations, we discuss the importance of considering visual content when analyzing developers and designing automation tools

    An Event-based Analysis Framework for Open Source Software Development Projects

    Get PDF
    The increasing popularity and success of Open Source Software (OSS) development projects has drawn significant attention of academics and open source participants over the last two decades. As one of the key areas in OSS research, assessing and predicting OSS performance is of great value to both OSS communities and organizations who are interested in investing in OSS projects. Most existing research, however, has considered OSS project performance as the outcome of static cross-sectional factors such as number of developers, project activity level, and license choice. While variance studies can identify some predictors of project outcomes, they tend to neglect the actual process of development. Without a closer examination of how events occur, an understanding of OSS projects is incomplete. This dissertation aims to combine both process and variance strategy, to investigate how OSS projects change over time through their development processes; and to explore how these changes affect project performance. I design, instantiate, and evaluate a framework and an artifact, EventMiner, to analyze OSS projects’ evolution through development activities. This framework integrates concepts from various theories such as distributed cognition (DCog) and complexity theory, applying data mining techniques such as decision trees, motif analysis, and hidden Markov modeling to automatically analyze and interpret the trace data of 103 OSS projects from an open source repository. The results support the construction of process theories on OSS development. The study contributes to literature in DCog, design routines, OSS development, and OSS performance. The resulting framework allows OSS researchers who are interested in OSS development processes to share and reuse data and data analysis processes in an open-source manner

    Action-based recommendation in pull-request development

    Get PDF
    Pull requests (PRs) selection is a challenging task faced by integrators in pull-based development (PbD), with hundreds of PRs submitted on a daily basis to large open-source projects. Managing these PRs manually consumes integrators' time and resources and may lead to delays in the acceptance, response, or rejection of PRs that can propose bug fixes or feature enhancements. On the one hand, well-known platforms for performing PbD, like GitHub, do not provide built-in recommendation mechanisms for facilitating the management of PRs. On the other hand, prior research on PRs recommendation has focused on the likelihood of either a PR being accepted or receive a response by the integrator. In this paper, we consider both those likelihoods, this to help integrators in the PRs selection process by suggesting to them the appropriate actions to undertake on each specific PR. To this aim, we propose an approach, called CARTESIAN (aCceptance And Response classificaTion-based requESt IdentificAtioN) modeling the PRs recommendation according to PR actions. In particular, CARTESIAN is able to recommend three types of PR actions: accept, respond, and reject. We evaluated CARTESIAN on the PRs of 19 popular GitHub projects. The results of our study demonstrate that our approach can identify PR actions with an average precision and recall of about 86%. Moreover, our findings also highlight that CARTESIAN outperforms the results of two baseline approaches in the task of PRs selection
    • 

    corecore