6 research outputs found
The Software Heritage Graph Dataset: Large-scale Analysis of Public Software Development History
International audienceSoftware Heritage is the largest existing public archive of software source code and accompanying development history. It spans more than five billion unique source code files and one billion unique commits , coming from more than 80 million software projects. These software artifacts were retrieved from major collaborative development platforms (e.g., GitHub, GitLab) and package repositories (e.g., PyPI, Debian, NPM), and stored in a uniform representation linking together source code files, directories, commits, and full snapshots of version control systems (VCS) repositories as observed by Software Heritage during periodic crawls. This dataset is unique in terms of accessibility and scale, and allows to explore a number of research questions on the long tail of public software development, instead of solely focusing on "most starred" repositories as it often happens
Forking Without Clicking: on How to Identify Software Repository Forks
International audienceThe notion of software "fork" has been shifting over time from the (negative) phenomenon of community disagreements that result in the creation of separate development lines and ultimately software products, to the (positive) practice of using distributed version control system (VCS) repositories to collaboratively improve a single product without stepping on each others toes. In both cases the VCS repositories participating in a fork share parts of a common development history. Studies of software forks generally rely on hosting platform metadata, such as GitHub, as the source of truth for what constitutes a fork. These "forge forks" however can only identify as forks repositories that have been created on the platform, e.g., by clicking a "fork" button on the platform user interface. The increased diversity in code hosting platforms (e.g., GitLab) and the habits of significant development communities (e.g., the Linux kernel, which is not primarily hosted on any single platform) call into question the reliability of trusting code hosting platforms to identify forks. Doing so might introduce selection and methodological biases in empirical studies. In this article we explore various definitions of "software forks", trying to capture forking workflows that exist in the real world. We quantify the differences in how many repositories would be identified as forks on GitHub according to the various definitions, confirming that a significant number could be overlooked by only considering forge forks. We study the structure and size of fork networks , observing how they are affected by the proposed definitions and discuss the potential impact on empirical research
Building the Universal Archive of Source Code A global collaborative project for the benefit of all
International audienceSoftware is becoming the fabric that binds our personal and social lives, embodying a vast part of the technological knowledge that powers our industry, and fuels innovation. Software is a pillar of most scientific research activities in all fields, from mathematics to physics, from chemistry to biology, from finance to social sciences. Software is also an essential mediator for accessing any digital information. In short, a rapidly increasing part of our collective knowledge is embodied in, or dependent on software artifacts. Our ability to design, use, understand, adapt, and evolve systems and devices on which our lives have come to depend relies on our ability to understand, adapt, and evolve the source code of the software that controls them. Software source code is a precious, unique form of knowledge. It can be readily translated into a form executable by a machine, and yet it is human readable: Harold Abelson wrote "Programs must be written for humans to read", 1 and source code is the preferred form for modification of software artefacts by developers.
SWHAP Workshop, September 14th and 15th, 2023: Proceedings - October 20th, 2023
In October 2022, Software Heritage hosted its inaugural SWHAP Days, a two-day conference dedicated to software preservation. In 2023, the Software Heritage team decided to organize a two-day, hands-on workshop in a closed committee format, scheduled for September 14th and 15th, 2023, at the Inria Paris centre. The workshop aimed to bring together professionals from diverse backgrounds, including conservation and heritage experts, researchers, and engineers. The objective was to foster collaboration, leveraging their collective knowledge and expertise to generate tangible and valuable outcomes for the community. Two topics were selected for this workshop: (1) building a guidebook on legacy software preservation and (2) telling the stories of legacy software
SWHAP Workshop, September 14th and 15th, 2023: Proceedings - October 20th, 2023
In October 2022, Software Heritage hosted its inaugural SWHAP Days, a two-day conference dedicated to software preservation. In 2023, the Software Heritage team decided to organize a two-day, hands-on workshop in a closed committee format, scheduled for September 14th and 15th, 2023, at the Inria Paris centre. The workshop aimed to bring together professionals from diverse backgrounds, including conservation and heritage experts, researchers, and engineers. The objective was to foster collaboration, leveraging their collective knowledge and expertise to generate tangible and valuable outcomes for the community. Two topics were selected for this workshop: (1) building a guidebook on legacy software preservation and (2) telling the stories of legacy software
Scholarly Infrastructures for Research Software: Report from the EOSC Executive Board Working Group (WG) Architecture Task Force (TF) SIRS
The TF on Scholarly Infrastructures of Research Software, as part of the Architecture WG of the European Open Science Cloud (EOSC) Executive Board, has established a set of recommendations to allow EOSC to include software, next to other research outputs like publications and data, in the realm of its research artifacts. This work is built upon a survey and documentation of a representative panel of current operational infrastructures acrossEurope, comparing their scopes and approaches.This report summarizes the state of the art, identifies best practices, as well as open problems, and paves the way for federating the different approaches in view of supporting the software pillar of EOSC