Defectors: A Large, Diverse Python Dataset for Defect Prediction

Abstract

Defect prediction has been a popular research topic where machine learning (ML) and deep learning (DL) have found numerous applications. However, these ML/DL-based defect prediction models are often limited by the quality and size of their datasets. In this paper, we present Defectors, a large dataset for just-in-time and line-level defect prediction. Defectors consists of β‰ˆ\approx 213K source code files (β‰ˆ\approx 93K defective and β‰ˆ\approx 120K defect-free) that span across 24 popular Python projects. These projects come from 18 different domains, including machine learning, automation, and internet-of-things. Such a scale and diversity make Defectors a suitable dataset for training ML/DL models, especially transformer models that require large and diverse datasets. We also foresee several application areas of our dataset including defect prediction and defect explanation. Dataset link: https://doi.org/10.5281/zenodo.770898

    Similar works

    Full text

    thumbnail-image

    Available Versions