84 research outputs found
Automatic Detection of Public Development Projects in Large Open Source Ecosystems: An Exploratory Study on GitHub
Hosting over 10 million of software projects, GitHub is one of the most
important data sources to study behavior of developers and software projects.
However, with the increase of the size of open source datasets, the potential
threats to mining these datasets have also grown. As the dataset grows, it
becomes gradually unrealistic for human to confirm quality of all samples. Some
studies have investigated this problem and provided solutions to avoid threats
in sample selection, but some of these solutions (e.g., finding development
projects) require human intervention. When the amount of data to be processed
increases, these semi-automatic solutions become less useful since the effort
in need for human intervention is far beyond affordable. To solve this problem,
we investigated the GHTorrent dataset and proposed a method to detect public
development projects. The results show that our method can effectively improve
the sample selection process in two ways: (1) We provide a simple model to
automatically select samples (with 0.827 precision and 0.947 recall); (2) We
also offer a complex model to help researchers carefully screen samples (with
63.2% less effort than manually confirming all samples, and can achieve 0.926
precision and 0.959 recall).Comment: Accepted by the SEKE2018 Conferenc
Firms on SourceForge
This paper explores empirically what factors influence a firm’s decision to contribute and to take leadership in open source projects. Increasing firms’ participation in the development of open source software (OSS) is generally perceived as a puzzle. Assuming that firms face a ”Make-or-Buy” decision before using OSS, we argue that contribution is in fact the best way for them to keep control of their supplier in a context where incomplete open source licenses govern transactions. Building on this proposition, we derive predictions on the drivers of firms’ contribution and leadership in open source projects, and test them on a unique dataset of 4,808 open source projects extracted from Sourceforge. Our empirical findings confirm the predictions and lend support to our hypotheses.Open source; transaction cost; governance; firm boundaries; software
Structural Complexity and Decay in FLOSS Systems: An Inter-Repository Study
Past software engineering literature has firmly established that software architectures and the associated code decay over time. Architectural decay is, potentially, a major issue in Free/Libre/Open Source Software (FLOSS) projects, since developers sporadically joining FLOSS projects do not always have a clear understanding of the underlying architecture, and may break the overall conceptual structure by several small changes to the code base.
This paper investigates whether the structure of a FLOSS system and its decay can also be influenced by the repository in which it is retained: specifically,
two FLOSS repositories are studied to understand whether the complexity of the software structure in the sampled projects is comparable, or one repository hosts more complex systems than the other. It is also studied
whether the effort to counteract this complexity is dependent on the repository, and the governance it gives to the hosted projects.
The results of the paper are two-fold: on one side, it is shown that the repository hosting larger and more active projects presents more complex structures. On the other side, these larger and more complex systems benefit
from more anti-regressive work to reduce this complexity
Pitfalls and Guidelines for Using Time-Based Git Data
Many software engineering research papers rely on time-based data (e.g., commit timestamps, issue report creation/update/close dates, release dates). Like most real-world data however, time-based data is often dirty. To date, there are no studies that quantify how frequently such data is used by the software engineering research community, or investigate sources of and quantify how often such data is dirty. Depending on the research task and method used, including such dirty data could aect the research results. This paper presents an extended survey of papers that utilize time-based data, published in the Mining Software Repositories (MSR) conference series. Out of the 754 technical track and data papers published in MSR 2004{2021, we saw at least 290 (38%) papers utilized time-based data. We also observed that most time-based data used in research papers comes in the form of Git commits, often from GitHub. Based on those results, we then used the Boa and Software Heritage infrastructures to help identify and quantify several sources of dirty Git timestamp data. Finally we provide guidelines/best practices for researchers utilizing time-based data from Git repositories
Recommended from our members
Similarities, challenges and opportunities of wikipedia content and open source projects
Copyright @ 2012 John Wiley & Sons, Ltd.Several years of research and evidence have demonstrated that Open Source Software (OSS) portals often contain a large amount of software projects that simply do not evolve, developed by relatively small communities, struggling to attract a sustained number of contributors. These portals have started to
increasingly act as a storage for abandoned projects, and researchers and practitioners should try and point out how to take advantage of such content. Similarly, other online content portals (like Wikipedia) could be harvested for valuable content. In this paper we argue that, even with differences in the requested expertise, many projects reliant on content and contributions by users undergo a similar evolution, and follow similar patterns: when a project fails to attract contributors, it appears to be not evolving, or abandoned. Far from a negative finding, even those projects could provide valuable content that should be harvested and identified based on common characteristics: by using the attributes of “usefulness” and “modularity” we isolate valuable content in both Wikipedia pages and OSS projects
An Introduction to Software Ecosystems
This chapter defines and presents different kinds of software ecosystems. The
focus is on the development, tooling and analytics aspects of software
ecosystems, i.e., communities of software developers and the interconnected
software components (e.g., projects, libraries, packages, repositories,
plug-ins, apps) they are developing and maintaining. The technical and social
dependencies between these developers and software components form a
socio-technical dependency network, and the dynamics of this network change
over time. We classify and provide several examples of such ecosystems. The
chapter also introduces and clarifies the relevant terms needed to understand
and analyse these ecosystems, as well as the techniques and research methods
that can be used to analyse different aspects of these ecosystems.Comment: Preprint of chapter "An Introduction to Software Ecosystems" by Tom
Mens and Coen De Roover, published in the book "Software Ecosystems: Tooling
and Analytics" (eds. T. Mens, C. De Roover, A. Cleve), 2023, ISBN
978-3-031-36059-6, reproduced with permission of Springer. The final
authenticated version of the book and this chapter is available online at:
https://doi.org/10.1007/978-3-031-36060-
Dynamics of Innovation in an “Open Source” Collaboration Environment: Lurking, Laboring and Launching FLOSS Projects on SourceForge
A systems analysis perspective is adopted to examine the critical properties of the Free/Libre/Open Source Software (FLOSS) mode of innovation, as reflected on the SourceForge platform (SF.net). This approach re-scales March’s (1991) framework and applies it to characterize the “innovation system” of a “distributed organization” of interacting agents in a virtual collaboration environment. The innovation system of the virtual collaboration environment is an emergent property of two “coupled” processes: one involves interactions among agents searching for information to use in designing novel software products, and the other involves the mobilization of individual capabilities for application in the software development projects. Micro-dynamics of this system are studied empirically by constructing transition probability matrices representing movements of 222,835 SF.net users among 7 different activity states. Estimated probabilities are found to form first-order Markov chains describing ergodic processes. This makes it possible to computate the equilibrium distribution of agents among the states, thereby suppressing transient effects and revealing persisting patterns of project-joining and project-launching.innovation systems, collaborative development environments, industrial districts, exploration and exploitation dynamics, open source software, FLOSS, SourceForge, project-joining, project-founding, Markov chain analysis.
Antipatterns in software classification taxonomies
Empirical results in software engineering have long started to show that findings are unlikely to be applicable to all software systems, or any domain: results need to be evaluated in specified contexts, and limited to the type of systems that they were extracted from. This is a known issue, and requires the establishment of a classification of software types. This paper makes two contributions: the first is to evaluate the quality of the current software classifications landscape. The second is to perform a case study showing how to create a classification of software types using a curated set of software systems. Our contributions show that existing, and very likely even new, classification attempts are deemed to fail for one or more issues, that we named as the ‘antipatterns’ of software classification tasks. We collected 7 of these antipatterns that emerge from both our case study, and the existing classifications. These antipatterns represent recurring issues in a classification, so we discuss practical ways to help researchers avoid these pitfalls. It becomes clear that classification attempts must also face the daunting task of formulating a taxonomy of software types, with the objective of establishing a hierarchy of categories in a classification
- …