2 research outputs found

    A Complete Set of Related Git Repositories Identified via Community Detection Approaches Based on Shared Commits

    Full text link
    In order to understand the state and evolution of the entirety of open source software we need to get a handle on the set of distinct software projects. Most of open source projects presently utilize Git, which is a distributed version control system allowing easy creation of clones and resulting in numerous repositories that are almost entirely based on some parent repository from which they were cloned. Git commits are based on Merkle Tree and two commits are highly unlikely to be produced independently. Shared commits, therefore, appear like an excellent way to group cloned repositories and obtain an accurate map for such repositories. We use World of Code infrastructure containing approximately 2B commits and 100M repositories to create and share such a map. We discover that the largest group contains almost 14M repositories most of which are unrelated to each other. As it turns out, the developers can push git object to an arbitrary repository or pull objects from unrelated repositories, thus linking unrelated repositories. To address this, we apply Louvain community detection algorithm to this very large graph consisting of links between commits and projects. The approach successfully reduces the size of the megacluster with the largest group of highly interconnected projects containing under 100K repositories. We expect the tools that the resulting map of related projects as well as tools and methods to handle the very large graph will serve as a reference set for mining software projects and other applications. Further work is needed to determine different types of relationships among projects induced by shared commits and other relationships, for example, by shared source code or similar filenames.Comment: 5 page

    Code Reuse in Open Source Software Development: Quantitative Evidence, Drivers, and Impediments

    Get PDF
    The focus of existing open source software (OSS) research has been on how and why individuals and firms add to the commons of public OSS code—that is, on the “giving” side of this open innovation process. In contrast, research on the corresponding “receiving” side of the innovation process is scarce. We address this gap, studying how existing OSS code is reused and serves as an input to further OSS development. Our findings are based on a survey with 686 responses from OSS developers. As the most interesting results, our multivariate analyses of developers’ code reuse behavior point out that developers with larger personal networks within the OSS community and those who have experience in a greater number of OSS projects reuse more, presumably because both network size and a broad project experience facilitate local search for reusable artifacts. Moreover, we find that a development paradigm that calls for releasing an initial functioning version of the software early—as the “credible promise” in OSS—leads to increased reuse. Finally, we identify developers’ interest in tackling difficult technical challenges as detrimental to efficient reuse-based innovation. Beyond OSS, we discuss the relevance of our findings for companies developing software and for the receiving side of open innovation processes, in general
    corecore