139 research outputs found
The Role of Data Filtering in Open Source Software Ranking and Selection
Faced with over 100M open source projects most empirical investigations
select a subset. Most research papers in leading venues investigated filtering
projects by some measure of popularity with explicit or implicit arguments that
unpopular projects are not of interest, may not even represent "real" software
projects, or that less popular projects are not worthy of study. However, such
filtering may have enormous effects on the results of the studies if and
precisely because the sought-out response or prediction is in any way related
to the filtering criteria.
We exemplify the impact of this practice on research outcomes: how filtering
of projects listed on GitHub affects the assessment of their popularity. We
randomly sample over 100,000 repositories and use multiple regression to model
the number of stars (a proxy for popularity) based on the number of commits,
the duration of the project, the number of authors, and the number of core
developers. Comparing control with the entire dataset with a filtered model
projects having ten or more authors we find that while certain characteristics
of the repository consistently predict popularity, the filtering process
significantly alters the relation ships between these characteristics and the
response. The number of commits exhibited a positive correlation with
popularity in the control sample but showed a negative correlation in the
filtered sample. These findings highlight the potential biases introduced by
data filtering and emphasize the need for careful sample selection in empirical
research of mining software repositories. We recommend that empirical work
should either analyze complete datasets such as World of Code, or employ
stratified random sampling from a complete dataset to ensure that filtering is
not biasing the results.Comment: International Workshop on Methodological Issues with Empirical
Studies in Software Engineering (WSESE 2024
- …