2,977 research outputs found
Using the Uniqueness of Global Identifiers to Determine the Provenance of Python Software Source Code
We consider the problem of identifying the provenance of free/open source
software (FOSS) and specifically the need of identifying where reused source
code has been copied from. We propose a lightweight approach to solve the
problem based on software identifiers-such as the names of variables, classes,
and functions chosen by programmers. The proposed approach is able to
efficiently narrow down to a small set of candidate origin products, to be
further analyzed with more expensive techniques to make a final provenance
determination.By analyzing the PyPI (Python Packaging Index) open source
ecosystem we find that globally defined identifiers are very distinct. Across
PyPI's 244 K packages we found 11.2 M different global identifiers (classes and
method/function names-with only 0.6% of identifiers shared among the two types
of entities); 76% of identifiers were used only in one package, and 93% in at
most 3. Randomly selecting 3 non-frequent global identifiers from an input
product is enough to narrow down its origins to a maximum of 3 products within
89% of the cases.We validate the proposed approach by mapping Debian source
packages implemented in Python to the corresponding PyPI packages; this
approach uses at most five trials, where each trial uses three randomly chosen
global identifiers from a randomly chosen python file of the subject software
package, then ranks results using a popularity index and requires to inspect
only the top result. In our experiments, this method is effective at finding
the true origin of a project with a recall of 0.9 and precision of 0.77
- …