61 research outputs found
A Dataset and an Approach for Identity Resolution of 38 Million Author IDs extracted from 2B Git Commits
The data collected from open source projects provide means to model large
software ecosystems, but often suffer from data quality issues, specifically,
multiple author identification strings in code commits might actually be
associated with one developer. While many methods have been proposed for
addressing this problem, they are either heuristics requiring manual tweaking,
or require too much calculation time to do pairwise comparisons for 38M author
IDs in, for example, the World of Code collection. In this paper, we propose a
method that finds all author IDs belonging to a single developer in this entire
dataset, and share the list of all author IDs that were found to have aliases.
To do this, we first create blocks of potentially connected author IDs and then
use a machine learning model to predict which of these potentially related IDs
belong to the same developer. We processed around 38 million author IDs and
found around 14.8 million IDs to have an alias, which belong to 5.4 million
different developers, with the median number of aliases being 2 per developer.
This dataset can be used to create more accurate models of developer behaviour
at the entire OSS ecosystem level and can be used to provide a service to
rapidly resolve new author IDs
Modeling User-Affected Software Properties for Open Source Software Supply Chains
Background: Open Source Software development community relies heavily on users of the software and contributors outside of the core developers to produce top-quality software and provide long-term support. However, the relationship between a software and its contributors in terms of exactly how they are related through dependencies and how the users of a software affect many of its properties are not very well understood.
Aim: My research covers a number of aspects related to answering the overarching question of modeling the software properties affected by users and the supply chain structure of software ecosystems, viz. 1) Understanding how software usage affect its perceived quality; 2) Estimating the effects of indirect usage (e.g. dependent packages) on software popularity; 3) Investigating the patch submission and issue creation patterns of external contributors; 4) Examining how the patch acceptance probability is related to the contributors\u27 characteristics. 5) A related topic, the identification of bots that commit code, aimed at improving the accuracy of these and other similar studies was also investigated.
Methodology: Most of the Research Questions are addressed by studying the NPM ecosystem, with data from various sources like the World of Code, GHTorrent, and the GiHub API. Different supervised and unsupervised machine learning models, including Regression, Random Forest, Bayesian Networks, and clustering, were used to answer appropriate questions.
Results: 1) Software usage affects its perceived quality even after accounting for code complexity measures. 2) The number of dependents and dependencies of a software were observed to be able to predict the change in its popularity with good accuracy. 3) Users interact (contribute issues or patches) primarily with their direct dependencies, and rarely with transitive dependencies. 4) A user\u27s earlier interaction with the repository to which they are contributing a patch, and their familiarity with related topics were important predictors impacting the chance of a pull request getting accepted. 5) Developed BIMAN, a systematic methodology for identifying bots.
Conclusion: Different aspects of how users and their characteristics affect different software properties were analyzed, which should lead to a better understanding of the complex interaction between software developers and users/ contributors
Software Supply Chain Development and Application
Motivation: Free Libre Open Source Software (FLOSS) has become a critical componentin numerous devices and applications. Despite its importance, it is not clear why FLOSS ecosystem works so well or if it may cease to function. Majority of existing research is focusedon studying a specific software project or a portion of an ecosystem, but FLOSS has not been investigated in its entirety. Such view is necessary because of the deep and complex technical and social dependencies that go beyond the core of an individual ecosystem and tight inter-dependencies among ecosystems within FLOSS.Aim: We, therefore, aim to discover underlying relations within and across FLOSS projects and developers in open source community, mitigate potential risks induced by the lack of such knowledge and enable systematic analysis over entire open source community through the lens of supply chain (SC).Method: We utilize concepts from an area of supply chains to model risks of FLOSS ecosystem. FLOSS, due to the distributed decision making of software developers, technical dependencies, and copying of the code, has similarities to traditional supply chain. Unlike in traditional supply chain, where data is proprietary and distributed among players, we aim to measure open-source software supply chain (OSSC) by operationalizing supply chain concept in software domain using traces reconstructed from version control data.Results: We create a very large and frequently updated collection of version control data in the entire FLOSS ecosystems named World of Code (WoC), that can completely cross-reference authors, projects, commits, blobs, dependencies, and history of the FLOSS ecosystems, and provide capabilities to efficiently correct, augment, query, and analyze that data. Various researches and applications (e.g., software technology adoption investigation) have been successfully implemented by leveraging the combination of WoC and OSSC.Implications: With a SC perspective in FLOSS development and the increased visibility and transparency in OSSC, our work provides potential opportunities for researchers to conduct wider and deeper studies on OSS over entire FLOSS community, for developers to build more robust software and for students to learn technologies more efficiently and improve programming skills
The Geography of Open Source Software: Evidence from GitHub
Open Source Software (OSS) plays an important role in the digital economy.
Yet although software production is amenable to remote collaboration and its
outputs are easily shared across distances, software development seems to
cluster geographically in places such as Silicon Valley, London, or Berlin. And
while recent work indicates that OSS activity creates positive externalities
which accrue locally through knowledge spillovers and information effects,
up-to-date data on the geographic distribution of active open source developers
is limited. This presents a significant blindspot for policymakers, who tend to
promote OSS at the national level as a cost-saving tool for public sector
institutions. We address this gap by geolocating more than half a million
active contributors to GitHub in early 2021 at various spatial scales. Compared
to results from 2010, we find a significant increase in the share of developers
based in Asia, Latin America and Eastern Europe, suggesting a more even spread
of OSS developers globally. Within countries, however, we find significant
concentration in regions, exceeding the concentration of workers in high-tech
fields. Social and economic development indicators predict at most half of
regional variation in OSS activity in the EU, suggesting that clusters of OSS
have idiosyncratic roots. We argue that policymakers seeking to foster OSS
should focus locally rather than nationally, using the tools of cluster policy
to support networks of OSS developers
- …