3 research outputs found

    Pitfalls and Guidelines for Using Time-Based Git Data

    Get PDF
    Many software engineering research papers rely on time-based data (e.g., commit timestamps, issue report creation/update/close dates, release dates). Like most real-world data however, time-based data is often dirty. To date, there are no studies that quantify how frequently such data is used by the software engineering research community, or investigate sources of and quantify how often such data is dirty. Depending on the research task and method used, including such dirty data could aect the research results. This paper presents an extended survey of papers that utilize time-based data, published in the Mining Software Repositories (MSR) conference series. Out of the 754 technical track and data papers published in MSR 2004{2021, we saw at least 290 (38%) papers utilized time-based data. We also observed that most time-based data used in research papers comes in the form of Git commits, often from GitHub. Based on those results, we then used the Boa and Software Heritage infrastructures to help identify and quantify several sources of dirty Git timestamp data. Finally we provide guidelines/best practices for researchers utilizing time-based data from Git repositories

    Characterizing and Improving Logging Practices in Java-based Open Source Software Projects - A Large-scale Case Study in Apache Software Foundation

    Get PDF
    Log messages (generated by logging code) contain rich information about the runtime behavior of software systems. Although more logging code can provide more context of the system's behavior, it is undesirable to include too much logging code. Yuan et al. performed the first empirical study on characterizing the logging. In the first part of the thesis, we conduct a large-scale replication study on characterizing the logging practices on Java-based open source projects. A significantly higher portion of log updates are for enhancing the quality rather than co-changes with feature implementations. However, there are no well-defined coding guidelines for performing effective logging. In the second part, we studied the problem of characterizing and detecting the anti-patterns in the logging code. We have encoded these anti-patterns into a static code analysis tool, LCAnalyzer. Case studies show that LCAnalyzer has an average recall of 95% and precision of 60%
    corecore