Software repository hosting services contain large amounts of open-source
software, with GitHub hosting more than 100 million repositories, from new to
established ones. Given this vast amount of projects, there is a pressing need
for a search based on the software's content and features. However, even though
GitHub offers various solutions to aid software discovery, most repositories do
not have any labels, reducing the utility of search and topic-based analysis.
Moreover, classifying software modules is also getting more importance given
the increase in Component-Based Software Development. However, previous work
focused on software classification using keyword-based approaches or proxies
for the project (e.g., README), which is not always available. In this work, we
create a new annotated dataset of GitHub Java projects called LabelGit. Our
dataset uses direct information from the source code, like the dependency graph
and source code neural representations from the identifiers. Using this
dataset, we hope to aid the development of solutions that do not rely on
proxies but use the entire source code to perform classification.Comment: 5 pages, 2 figures, 1 tabl