Search CORE

6 research outputs found

The Software Heritage Graph Dataset: Large-scale Analysis of Public Software Development History

Author: Abate Pietro
Abramatic Jean-François
Cosmo Roberto Di
Merkle Ralph C.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 05/10/2020
Field of study

International audienceSoftware Heritage is the largest existing public archive of software source code and accompanying development history. It spans more than five billion unique source code files and one billion unique commits , coming from more than 80 million software projects. These software artifacts were retrieved from major collaborative development platforms (e.g., GitHub, GitLab) and package repositories (e.g., PyPI, Debian, NPM), and stored in a uniform representation linking together source code files, directories, commits, and full snapshots of version control systems (VCS) repositories as observed by Software Heritage during periodic crawls. This dataset is unique in terms of accessibility and scale, and allows to explore a number of research questions on the long tail of public software development, instead of solely focusing on "most starred" repositories as it often happens

arXiv.org e-Print Archive

Crossref

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

Forking Without Clicking: on How to Identify Software Repository Forks

Author: Abramatic Jean-François
Boldi Paolo
Cosmo Roberto Di
Cosmo Roberto Di
Hammouda Imed
Lima Antonio
Merkle Ralph C.
Nyman Linus
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 05/10/2020
Field of study

International audienceThe notion of software "fork" has been shifting over time from the (negative) phenomenon of community disagreements that result in the creation of separate development lines and ultimately software products, to the (positive) practice of using distributed version control system (VCS) repositories to collaboratively improve a single product without stepping on each others toes. In both cases the VCS repositories participating in a fork share parts of a common development history. Studies of software forks generally rely on hosting platform metadata, such as GitHub, as the source of truth for what constitutes a fork. These "forge forks" however can only identify as forks repositories that have been created on the platform, e.g., by clicking a "fork" button on the platform user interface. The increased diversity in code hosting platforms (e.g., GitLab) and the habits of significant development communities (e.g., the Linux kernel, which is not primarily hosted on any single platform) call into question the reliability of trusting code hosting platforms to identify forks. Doing so might introduce selection and methodological biases in empirical studies. In this article we explore various definitions of "software forks", trying to capture forking workflows that exist in the real world. We quantify the differences in how many repositories would be identified as forks on GitHub according to the various definitions, confirming that a significant number could be overlooked by only considering forge forks. We study the structure and size of fork networks , observing how they are affected by the proposed definitions and discuss the potential impact on empirical research

arXiv.org e-Print Archive

Crossref

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

Building the Universal Archive of Source Code A global collaborative project for the benefit of all

Author: Abramatic Jean-François
Di Cosmo Roberto
Zacchiroli Stefano
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 26/09/2018
Field of study

International audienceSoftware is becoming the fabric that binds our personal and social lives, embodying a vast part of the technological knowledge that powers our industry, and fuels innovation. Software is a pillar of most scientific research activities in all fields, from mathematics to physics, from chemistry to biology, from finance to social sciences. Software is also an essential mediator for accessing any digital information. In short, a rapidly increasing part of our collective knowledge is embodied in, or dependent on software artifacts. Our ability to design, use, understand, adapt, and evolve systems and devices on which our lives have come to depend relies on our ability to understand, adapt, and evolve the source code of the software that controls them. Software source code is a precious, unique form of knowledge. It can be readily translated into a form executable by a machine, and yet it is human readable: Harold Abelson wrote "Programs must be written for humans to read", 1 and source code is the preferred form for modification of software artefacts by developers.

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

SWHAP Workshop, September 14th and 15th, 2023: Proceedings - October 20th, 2023

Author: Abramatic Jean-François
Astic Isabelle
Bermès Emmanuelle
Bobbio Jérémy
Di Cosmo Roberto
Fichen Mathilde
Françoise Camille
Gomez Claude
Granger Sabrina
Gruenpeter Morane
Hagenmaier Wendy
Miura Grégory
Montangero Carlo
Phipps Simon
Seals-Nutt Kenneth
Publication venue: HAL CCSD
Publication date: 20/10/2023
Field of study

In October 2022, Software Heritage hosted its inaugural SWHAP Days, a two-day conference dedicated to software preservation. In 2023, the Software Heritage team decided to organize a two-day, hands-on workshop in a closed committee format, scheduled for September 14th and 15th, 2023, at the Inria Paris centre. The workshop aimed to bring together professionals from diverse backgrounds, including conservation and heritage experts, researchers, and engineers. The objective was to foster collaboration, leveraging their collective knowledge and expertise to generate tangible and valuable outcomes for the community. Two topics were selected for this workshop: (1) building a guidebook on legacy software preservation and (2) telling the stories of legacy software

INRIA a CCSD electronic archive server

SWHAP Workshop, September 14th and 15th, 2023: Proceedings - October 20th, 2023

Author: Abramatic Jean-François
Astic Isabelle
Bermès Emmanuelle
Bobbio Jérémy
Di Cosmo Roberto
Fichen Mathilde
Françoise Camille
Gomez Claude
Granger Sabrina
Gruenpeter Morane
Hagenmaier Wendy
Miura Grégory
Montangero Carlo
Phipps Simon
Seals-Nutt Kenneth
Publication venue: HAL CCSD
Publication date: 20/10/2023
Field of study

HAL-Université de Bretagne Occidentale

Scholarly Infrastructures for Research Software: Report from the EOSC Executive Board Working Group (WG) Architecture Task Force (TF) SIRS

Author: Abramatic Jean-François
Barborini Yannick
Candela Leonardo
Colom Miguel
Dalitz Wolfgang
Di Cosmo Roberto
Fenner Martin
Gonzalez Lopez Jose Benito
Graf Kay
Harrison Melissa
Jeangirard Eric
Maassen Jason
Manghi Paolo
Martínez Ortiz Carlos
Ronchieri Elisabetta
Schubotz Moritz
Tenhunen Ville
Wagner Michael
Yates Sam
Publication venue: HAL CCSD
Publication date: 01/12/2020
Field of study

The TF on Scholarly Infrastructures of Research Software, as part of the Architecture WG of the European Open Science Cloud (EOSC) Executive Board, has established a set of recommendations to allow EOSC to include software, next to other research outputs like publications and data, in the realm of its research artifacts. This work is built upon a survey and documentation of a representative panel of current operational infrastructures acrossEurope, comparing their scopes and approaches.This report summarizes the state of the art, identifies best practices, as well as open problems, and paves the way for federating the different approaches in view of supporting the software pillar of EOSC

INRIA a CCSD electronic archive server