675 research outputs found
Analysis and Detection of Information Types of Open Source Software Issue Discussions
Most modern Issue Tracking Systems (ITSs) for open source software (OSS)
projects allow users to add comments to issues. Over time, these comments
accumulate into discussion threads embedded with rich information about the
software project, which can potentially satisfy the diverse needs of OSS
stakeholders. However, discovering and retrieving relevant information from the
discussion threads is a challenging task, especially when the discussions are
lengthy and the number of issues in ITSs are vast. In this paper, we address
this challenge by identifying the information types presented in OSS issue
discussions. Through qualitative content analysis of 15 complex issue threads
across three projects hosted on GitHub, we uncovered 16 information types and
created a labeled corpus containing 4656 sentences. Our investigation of
supervised, automated classification techniques indicated that, when prior
knowledge about the issue is available, Random Forest can effectively detect
most sentence types using conversational features such as the sentence length
and its position. When classifying sentences from new issues, Logistic
Regression can yield satisfactory performance using textual features for
certain information types, while falling short on others. Our work represents a
nontrivial first step towards tools and techniques for identifying and
obtaining the rich information recorded in the ITSs to support various software
engineering activities and to satisfy the diverse needs of OSS stakeholders.Comment: 41st ACM/IEEE International Conference on Software Engineering
(ICSE2019
On Wasted Contributions: Understanding the Dynamics of Contributor-Abandoned Pull Requests
Pull-based development has enabled numerous volunteers to contribute to
open-source projects with fewer barriers. Nevertheless, a considerable amount
of pull requests (PRs) with valid contributions are abandoned by their
contributors, wasting the effort and time put in by both the contributors and
maintainers. To better understand the underlying dynamics of
contributor-abandoned PRs, we conduct a mixed-methods study using both
quantitative and qualitative methods. We curate a dataset consisting of 265,325
PRs including 4,450 abandoned ones from ten popular and mature GitHub projects
and measure 16 features characterizing PRs, contributors, review processes, and
projects. Using statistical and machine learning techniques, we find that
complex PRs, novice contributors, and lengthy reviews have a higher probability
of abandonment and the rate of PR abandonment fluctuates alongside the
projects' maturity or workload. To identify why contributors abandon their PRs,
we also manually examine a random sample of 354 abandoned PRs. We observe that
the most frequent abandonment reasons are related to the obstacles faced by
contributors, followed by the hurdles imposed by maintainers during the review
process. Finally, we survey the top core maintainers of the studied projects to
understand their perspectives on dealing with PR abandonment and on our
findings.Comment: Manuscript accepted for publication in ACM Transactions on Software
Engineering and Methodology (TOSEM
A Dataset for GitHub Repository Deduplication: Extended Description
GitHub projects can be easily replicated through the site's fork process or
through a Git clone-push sequence. This is a problem for empirical software
engineering, because it can lead to skewed results or mistrained machine
learning models. We provide a dataset of 10.6 million GitHub projects that are
copies of others, and link each record with the project's ultimate parent. The
ultimate parents were derived from a ranking along six metrics. The related
projects were calculated as the connected components of an 18.2 million node
and 12 million edge denoised graph created by directing edges to ultimate
parents. The graph was created by filtering out more than 30 hand-picked and
2.3 million pattern-matched clumping projects. Projects that introduced
unwanted clumping were identified by repeatedly visualizing shortest path
distances between unrelated important projects. Our dataset identified 30
thousand duplicate projects in an existing popular reference dataset of 1.8
million projects. An evaluation of our dataset against another created
independently with different methods found a significant overlap, but also
differences attributed to the operational definition of what projects are
considered as related.Comment: 33 pages, 33 figures, 17 listing
- …