Context: When software is released publicly, it is common to include with it
either the full text of the license or licenses under which it is published, or
a detailed reference to them. Therefore public licenses, including FOSS (free,
open source software) licenses, are usually publicly available in source code
repositories.Objective: To compile a dataset containing as many documents as
possible that contain the text of software licenses, or references to the
license terms. Once compiled, characterize the dataset so that it can be used
for further research, or practical purposes related to license analysis.Method:
Retrieve from Software Heritage-the largest publicly available archive of FOSS
source code-all versions of all files whose names are commonly used to convey
licensing terms. All retrieved documents will be characterized in various ways,
using automated and manual analyses.Results: The dataset consists of 6.9
million unique license files. Additional metadata about shipped license files
is also provided, making the dataset ready to use in various contexts,
including: file length measures, MIME type, SPDX license (detected using
ScanCode), and oldest appearance. The results of a manual analysis of 8102
documents is also included, providing a ground truth for further analysis. The
dataset is released as open data as an archive file containing all deduplicated
license files, plus several portable CSV files with metadata, referencing files
via cryptographic checksums.Conclusions: Thanks to the extensive coverage of
Software Heritage, the dataset presented in this paper covers a very large
fraction of all software licenses for public code. We have assembled a large
body of software licenses, characterized it quantitatively and qualitatively,
and validated that it is mostly composed of licensing information and includes
almost all known license texts. The dataset can be used to conduct empirical
studies on open source licensing, training of automated license classifiers,
natural language processing (NLP) analyses of legal texts, as well as
historical and phylogenetic studies on FOSS licensing. It can also be used in
practice to improve tools detecting licenses in source code