2 research outputs found
A Complete Set of Related Git Repositories Identified via Community Detection Approaches Based on Shared Commits
In order to understand the state and evolution of the entirety of open source
software we need to get a handle on the set of distinct software projects. Most
of open source projects presently utilize Git, which is a distributed version
control system allowing easy creation of clones and resulting in numerous
repositories that are almost entirely based on some parent repository from
which they were cloned. Git commits are based on Merkle Tree and two commits
are highly unlikely to be produced independently. Shared commits, therefore,
appear like an excellent way to group cloned repositories and obtain an
accurate map for such repositories. We use World of Code infrastructure
containing approximately 2B commits and 100M repositories to create and share
such a map. We discover that the largest group contains almost 14M repositories
most of which are unrelated to each other. As it turns out, the developers can
push git object to an arbitrary repository or pull objects from unrelated
repositories, thus linking unrelated repositories. To address this, we apply
Louvain community detection algorithm to this very large graph consisting of
links between commits and projects. The approach successfully reduces the size
of the megacluster with the largest group of highly interconnected projects
containing under 100K repositories. We expect the tools that the resulting map
of related projects as well as tools and methods to handle the very large graph
will serve as a reference set for mining software projects and other
applications. Further work is needed to determine different types of
relationships among projects induced by shared commits and other relationships,
for example, by shared source code or similar filenames.Comment: 5 page
Code Reuse in Open Source Software Development: Quantitative Evidence, Drivers, and Impediments
The focus of existing open source software (OSS) research has been on how and why individuals and firms add to the commons of public OSS code—that is, on the “giving” side of this open innovation process. In contrast, research on the corresponding “receiving” side of the innovation process is scarce. We address this gap, studying how existing OSS code is reused and serves as an input to further OSS development. Our findings are based on a survey with 686 responses from OSS developers. As the most interesting results, our multivariate analyses of developers’ code reuse behavior point out that developers with larger personal networks within the OSS community and those who have experience in a greater number of OSS projects reuse more, presumably because both network size and a broad project experience facilitate local search for reusable artifacts. Moreover, we find that a development paradigm that calls for releasing an initial functioning version of the software early—as the “credible promise” in OSS—leads to increased reuse. Finally, we identify developers’ interest in tackling difficult technical challenges as detrimental to efficient reuse-based innovation. Beyond OSS, we discuss the relevance of our findings for companies developing software and for the receiving side of open innovation processes, in general